Statistics for Data Science and Analytics - Peter C. Bruce, Janet Dobbins, Peter Gedeck

Blick ins Buch

Statistics for Data Science and Analytics (eBook)

Peter C. Bruce, Janet Dobbins, Peter Gedeck (Autoren)

eBook Download: EPUB

2024
563 Seiten
Wiley (Verlag)
978-1-394-25381-4 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (EPUB)

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration

Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations.

A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of 'kitchen sink' formulas. Regression is taught both as a tool for explanation and for prediction.

This book is informed by the authors' experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves.

Statistics for Data Science and Analytics includes information on sample topics such as:

Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and sets
Experiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary data
Specialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels-the workhorses of data science-and how to get the most value from them
Statistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions

Written by and for data science instructors, Statistics for Data Science and Analytics is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

Peter C. Bruce is Founder of the Institute for Statistics Education at Statistics.com, now part of Elder Research, Inc. He is the developer of Resampling Stats software, and the author or co-author of a number of peer-reviewed articles and several books.

Dr. Peter Gedeck, PhD, is a scientist in the research informatics team at Collaborative Drug Discovery, specializing in the development of machine learning algorithms to predict biological and physicochemical properties of drug candidates.

Janet Dobbins is the Chair of the Board of Directors for Data Community DC, a non-profit 501(c)(3) corporation committed to promoting data science by fostering education, opportunity, and professional development through high-quality community-driven events. She previously served as the Vice President of Business Development and Strategic Partnership at The Institute for Statistics Education at Statistics.com. Bruce and Gedeck are part of the author teams for the best-selling books Machine Learning for Business Analytics (Wiley) and Practical Statistics for Data Scientists(O'Reilly).

Introductory statistics textbook with a focus on data science topics such as prediction, correlation, and data exploration Statistics for Data Science and Analytics is a comprehensive guide to statistical analysis using Python, presenting important topics useful for data science such as prediction, correlation, and data exploration. The authors provide an introduction to statistical science and big data, as well as an overview of Python data structures and operations. A range of statistical techniques are presented with their implementation in Python, including hypothesis testing, probability, exploratory data analysis, categorical variables, surveys and sampling, A/B testing, and correlation. The text introduces binary classification, a foundational element of machine learning, validation of statistical models by applying them to holdout data, and probability and inference via the easy-to-understand method of resampling and the bootstrap instead of using a myriad of kitchen sink formulas. Regression is taught both as a tool for explanation and for prediction. This book is informed by the authors experience designing and teaching both introductory statistics and machine learning at Statistics.com. Each chapter includes practical examples, explanations of the underlying concepts, and Python code snippets to help readers apply the techniques themselves. Statistics for Data Science and Analytics includes information on sample topics such as: Int, float, and string data types, numerical operations, manipulating strings, converting data types, and advanced data structures like lists, dictionaries, and setsExperiment design via randomizing, blinding, and before-after pairing, as well as proportions and percents when handling binary dataSpecialized Python packages like numpy, scipy, pandas, scikit-learn and statsmodels the workhorses of data science and how to get the most value from themStatistical versus practical significance, random number generators, functions for code reuse, and binomial and normal probability distributions Written by and for data science instructors, Statistics for Data Science and Analytics is an excellent learning resource for data science instructors prescribing a required intro stats course for their programs, as well as other students and professionals seeking to transition to the data science field.

1
Statistics and Data Science

Statistical methods first came into use before homes had electricity, and had several phases of rapid growth:

The first big boost came from manufacturers and farmers who were able to decrease costs, produce better products, and improve crop yields via statistical experiments.
Similar experiments helped drug companies graduate from snake oil purveyors to makers of scientifically proven remedies.
In the late 20th century, computing power enabled a new class of computationally intensive methods, like the resampling methods that we will study.
In the early decades of the current millennium, organizations discovered that the rapidly growing repositories of data they were collecting (“big data”) could be mined for useful insights.

As with any powerful tool, the more you know about it the better you can apply it and the less likely you will go astray. The lurking dangers are illustrated when you type the phrase “How to lie with...” into a web search engine. The likely autocompletion is “statistics.”

Much of the book that follows deals with important issues that can determine whether data yields meaningful information or not:

How to assess the role that random chance can play in creating apparently interesting results or patterns in data
How to design experiments and surveys to get useful and reliable information
How to formulate simple statistical models to describe relationships between one variable and another

We will start our study in the next chapter with a look at how to design experiments, but, before we dive in, let’s look at some statistical wins and losses from different arenas.

1.1 Big Data: Predicting Pregnancy

In 2010, a statistician from Target described how the company used customer transaction data to make educated guesses about whether customers are pregnant or not. On the strength of these guesses, Target sent out advertising flyers to likely prospects, centered around the needs of pregnant women.

How did Target use data to make those guesses? The key was data used to “train” a statistical model: data in which the outcome of interest—pregnant/not pregnant—was known in advance. Where did Target get such data? The “not pregnant” data was easy—the vast majority of customers are not pregnant, so data on their purchases is easy to come by. The “pregnant” data came from a baby shower registry. Both datasets were quite large, containing lists of items purchased by thousands of customers.

Some clues are obvious—the purchase of a crib and baby clothes is a dead giveaway. But, from Target’s perspective, by the time a customer purchases these obvious big ticket items, it was too late—they had already chosen their shopping venue. Target wanted to reach customers earlier, before they decided where to do their shopping for the big day. For that, Target used statistical modeling to make use of non-obvious patterns in the data that distinguish pregnant from non-pregnant customers. One clue that emerged was shifts in the pattern of supplement purchases—e.g. a customer who was not buying supplements 60 days ago but is buying them now.

1.2 Phantom Protection from Vitamin E

In 1993, researchers examining a database on nurses’ health found that nurses who took vitamin E supplements had 30% to 40% fewer heart attacks than those who didn’t. These data fit with theories that antioxidants such as vitamins E and C could slow damaging processes within the body. Linus Pauling, winner of the Nobel Prize in Chemistry in 1954, was a major proponent of these theories, which were one driver of the nutritional supplements industry.

However, the heart health benefits of vitamin E turned out to be illusory. A study completed in 2007 divided 14,641 male physicians randomly into four groups:

Take 268 mg of vitamin E every other day
Take 500 mg of vitamin C every day
Take both vitamin E and C
Take placebo.

Those who took vitamin E fared no better than those who did not take vitamin E. Since the only difference between the two groups was whether or not they took vitamin E, if there were a vitamin E effect, it would have shown up. Several meta-analyses, which are consolidated reviews of the results of multiple published studies, have reached the same conclusion. One found that vitamin E at the above dosage might even increase mortality.

What happened to make the researchers in 1993 think they had found a link between vitamin E and disease inhibition? In reviewing a vast quantity of data, researchers thought they saw an interesting association. In retrospect, with the benefit of a well-designed experiment, it appears that this association was merely a chance coincidence. Unfortunately, coincidences happen all the time in life. In fact, they happen to a greater extent than we think possible.

1.3 Statistician, Heal Thyself

In 1993, Mathsoft Corp., the developer of Mathcad mathematical software, acquired StatSci, the developer of S-PLUS statistical software, the precursor to R. Mathcad was an affordable tool popular with engineers—prices were in the hundreds of dollars and the number of users was in the hundreds of thousands. S-PLUS was a high-end graphical and statistical tool used primarily by statisticians—prices were in the thousands of dollars and the number of users was in the thousands.

In looking to boost revenues, Mathsoft turned to an established marketing principle—cross-selling. In other words, try to convince the people who bought product A to buy product B. With the acquisition of a highly regarded niche product, S-PLUS, and an existing large customer base for Mathcad, Mathsoft decided that the logical thing to do would be to ramp up S-PLUS sales via direct mail to its installed Mathcad user base. It also decided to purchase lists of similar prospective customers for both Mathcad and S-PLUS.

This major mailing program boosted revenues, but it boosted expenses even more. The company lost over $13 million in 1993 and 1994 combined—significant numbers for a company that had only $11 million in 1992 revenue.

What happened?

In retrospect, it was clear that the mailings were not well targeted. The costs of the unopened mail exceeded the revenue from the few recipients who did respond. Mathcad users turned out not to be likely users of S-PLUS. The huge losses could have been avoided through the use of two common statistical techniques:

Doing a test mailing to the various lists being considered to (1) determine whether the list is productive and (2) test different headlines, copy, pricing, etc., to see what works best.
Using predictive modeling techniques to identify which names on a list are most likely to turn into customers.

1.4 Identifying Terrorists in Airports

Since the September 11, 2001, Al Qaeda attacks in the United States and subsequent attacks elsewhere, security screening programs at airports have become a major undertaking, costing billions of dollars per year in the United States alone. Most of these resources are consumed in an exhaustive screening process. All passengers and their tickets are reviewed, their baggage is screened and individuals pass through detectors of varying sophistication. An individual and his or her bag can only receive a limited amount of attention in a screening process that is applied to everyone. The process is largely the same for each individual. Potential terrorists can see the process and its workings in detail and identify weaknesses.

To improve the effectiveness of the system, security officials have studied ways of focusing more concentrated attention on a small number of travelers. In the years after the attacks, one technique used enhanced screening for a limited number of randomly selected travelers. While it adds some uncertainty to the screening process, which acts as a deterrent to attackers, random selection does nothing to focus attention on high-risk individuals.

Determining who is at high-risk is, of course, the problem. How do you know who the high-risk passengers are?

One method is passenger profiling—specifying some guidelines about what passenger characteristics merit special attention. These characteristics were determined by a reasoned, logical approach. For example, purchasing a ticket for cash, as the 2001 hijackers did, raises a red flag. The Transportation Security Administration trains a cadre of Behavior Detection Officers. The Administration also maintains a specific no-fly list of individuals who trigger special screening.

There are several problems with the profiling and no-fly approaches.

Profiling can generate backlash and controversy because it comes close to stereotyping. American National Public Radio commentator Juan Williams was fired when he made an offhand comment to the effect that he would be nervous about boarding an aircraft in the company of people in full Muslim garb.
Profiling, since it does tend to merge with stereotyping and is based on logic and reason, enables terrorist organizations to engineer attackers that do not meet profile criteria.
No-fly lists are imprecise (a name may match thousands of individuals) and often erroneous. Senator Edward Kennedy was once pulled aside because he supposedly showed up on a no-fly list.

An alternative or supplemental approach is a...

Erscheint lt. Verlag	6.8.2024
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Mathematik ► Statistik
Themenwelt	Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik
Schlagworte	A/B Testing • Big Data • Bootstrap • Data Science • NumPy • Python • python data structures • Python Libraries • python operations • python textbook • Regression • resampling • SciPy • Statistical Analysis • Statistical Science • Statistical Techniques • statistics textbook
ISBN-10	1-394-25381-8 / 1394253818
ISBN-13	978-1-394-25381-4 / 9781394253814

Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)
Größe: 19,0 MB

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.