Blick ins Buch

The False Discovery Rate (eBook)

Its Meaning, Interpretation and Application in Data Science

N. W. Galwey (Autor)

eBook Download: EPUB

2024
511 Seiten
Wiley-Blackwell (Verlag)
978-1-119-88979-3 (ISBN)

Lese- und Medienproben

Ebook-Leseprobe (EPUB)

The False Discovery Rate

An essential tool for statisticians and data scientists seeking to interpret the vast troves of data that increasingly power our world

First developed in the 1990s, the False Discovery Rate (FDR) is a way of describing the rate at which null hypothesis testing produces errors. It has since become an essential tool for interpreting large datasets. In recent years, as datasets have become ever larger, and as the importance of 'big data' to scientific research has grown, the significance of the FDR has grown correspondingly.

The False Discovery Rate provides an analysis of the FDR's value as a tool, including why it should generally be preferred to the Bonferroni correction and other methods by which multiplicity can be accounted for. It offers a systematic overview of the FDR, its core claims, and its applications.

Readers of The False Discovery Rate will also find:

Case studies throughout, rooted in real and simulated data sets
Detailed discussion of topics including representation of the FDR on a Q-Q plot, consequences of non-monotonicity, and many more
Wide-ranging analysis suited for a broad readership

The False Discovery Rate is ideal for Statistics and Data Science courses, and short courses associated with conferences. It is also useful as supplementary reading in courses in other disciplines that require the statistical interpretation of 'big data'. The book will also be of great value to statisticians and researchers looking to learn more about the FDR.

STATISTICS IN PRACTICE

A series of practical books outlining the use of statistical techniques in a wide range of applications areas:

HUMAN AND BIOLOGICAL SCIENCES

EARTH AND ENVIRONMENTAL SCIENCES

INDUSTRY, COMMERCE AND FINANCE

N. W. GALWEY is a Statistics Leader, Research Statistics, at GlaxoSmithKline Research and Development (Retired).

1
Introduction

1.1 A Brief History of Multiple Testing

In the beginning was the significance threshold. By the early twentieth century, researchers with an awareness of random variation became concerned that the interesting results that they wished to report might have occurred by chance. Mathematicians worked to develop methods for quantifying this risk, and in 1926, R.A. Fisher wrote,

…it is convenient to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.

That is, he suggested a threshold of α = 0.05, adding,

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point) or one in a hundred (the 1 per cent. point).

(Fisher, 1926, p. 504)

Fisher's suggestion was taken up by researchers, but for the next few decades they did not usually calculate the probability of obtaining, by coincidence, the observed result of each particular study. Such a calculation required a substantial amount of work by a trained mathematician. Instead, the researcher calculated a test statistic – z, t, F, χ2 or r (the correlation coefficient), depending on the design of the study and the question asked – and compared the value obtained with a published table of values corresponding to particular thresholds, typically α = 0.05, 0.01 and 0.001. For example, suppose that a researcher analysing data from an experiment obtained the result t = −3.1, with 8 degrees of freedom (d.f. = 8). If they were interested in large effects either positive or negative, they would consult a table of values of the t statistic for a two‐sided test, and find that P(|T8| > 2.306) = 0.05 and P(|T8| > 3.355) = 0.01. Hence, they would conclude that their result was significant at the 5% (α = 0.05) level, but not at the 1% (α = 0.01) level. Though no probability had been calculated, such a conclusion could be reported in terms of a p‐value – in this case, p < 0.05.

By the late 1970s, many researchers had desktop or pocket calculators offering statistical functions, or even had access to programmable computers. This enabled them to present the actual p‐value associated with their result, rather than comparing the result to pre‐specified thresholds. In the present case, they would report p = 0.015. However, the preoccupation with thresholds that had its origin in arithmetical convenience persisted, and a value of p > 0.05 was (and is) typically presented as ‘non‐significant’ (‘NS’), whereas p < 0.05 is ‘significant’ (often indicated by ‘*’); p < 0.01 is ‘highly significant’ (‘**’); and p < 0.001 is ‘very highly significant’ (‘***’).

By this time, such significance tests had become the mainstay of statistical data analysis in the biological and social sciences – a status that they still retain. However, it was apparent from the outset that there are conceptual problems associated with such tests. Firstly, the test does not address precisely the question that the researcher most wants to answer. The researcher is not primarily interested in the probability of their dataset – in a sense its probability is irrelevant, as it is an event that has actually happened. What they really want to know is the probability of the hypothesis that the experiment was designed to test. This is the problem of ‘inverse’ or ‘Bayesian’ probability, the probability of things that are not – and cannot be – observed. Secondly, although the probability that a single experiment will give a significant result by coincidence is low, if more tests are conducted, the probability that at least one of them will do so increases.

Initially, these difficulties were dealt with by an informal understanding that if results were unlikely to be obtained by coincidence, then the probability that they were indeed produced by coincidence was low, and hence the hypothesis that this had occurred – the null hypothesis, H0 – could be rejected. It followed that among all the ‘discoveries’ announced by many researchers working over many years, it would not often turn out that the null hypothesis had, after all, been correct. As long as every study and every statistical analysis conducted required a considerable effort on the part of the researcher, there was some reason for confidence in this argument: researchers would not usually waste their time and other resources striving to detect effects that were unlikely to exist.

However, in the last decades of the twentieth century, technological developments changed the situation. There were several aspects to this expansion of the scope for statistical analysis, namely:

Increased capacity for statistical calculations, initially by centralised mainframe computers, and later by personal computers and other devices that could be under the control of a small research group or an individual.
Increased capacity for electronic data storage. ‘The world’s technological per‐capita capacity to store information has roughly doubled every 40 months since the 1980s’ (Hilbert and López 2011, quoted by https://en.wikipedia.org/wiki/Big_data, accessed 15 April 2024).
Development of electronic automated measuring devices, electronic data loggers for capturing the measurements from traditional devices such as thermometers, and high‐throughput laboratory technologies for obtaining experimental data (the latter particularly in genetics and genomics).
Development of user‐friendly software, usually with a ‘point‐and‐click’ interface, enabling researchers to perform their own routine statistical analyses and ending their dependence on specialist programmers or statisticians for this service.

By the 1990s, the term ‘big data’ started to be used to refer to such developments. The management, manipulation and exploration of these huge datasets were characterised as a discipline called ‘data science’, distinct from classical statistics:

While the term data science is not new, the meanings and connotations have changed over time. The word first appeared in the ’60s as an alternative name for statistics. In the late ’90s, computer science professionals formalized the term. A proposed definition for data science saw it as a separate field with three aspects: data design, collection, and analysis. It still took another decade for the term to be used outside of academia. (Amazon Web Services, https://aws.amazon.com/what‐is/data‐science/, accessed 15 April 2024)

When thousands of statistical hypothesis tests could be performed with negligible effort either in the collection or the analysis of the data, the prospect that multiple testing would lead to significant results in cases where H0 was true – false positives – became effectively a certainty.

The problem of false‐positive results is exacerbated if the multiple testing that has caused it is not apparent when the results are reported. This can occur, inadvertently or deliberately, due to several distinct mechanisms, for example, as follows:

Repeatedly testing the same null hypothesis. Specifically,
- repeated testing of the same hypothesis by the same method, stopping when a significant result is obtained and reporting only the last test: a flagrant abuse of statistical significance testing;
- less obviously, testing of the same null hypothesis in unconnected studies, reporting only the one or a few that give significant results, perhaps with no awareness that the other studies existed.
Testing a number of related null hypotheses, and reporting only those that give a significant result. Specifically,
- simple failure to report the number of hypotheses, directly comparable with each other, that have been considered (e.g., the number of candidate genes that have been tested for association with a heritable disease);
- testing each of several types of event (e.g., adverse health outcomes) for association with each of several exposures (e.g., potential environmental risk factors) resulting in a large number of pair‐wise combinations;
- in clinical trials, testing the effect of the medical intervention on secondary outcomes in addition to the pre‐specified primary end‐point;
- also in clinical trials, testing the effect of the intervention on subsets of the patients recruited (e.g., older males, a particular ethnic group, patients with a particular comorbidity…).
Testing the same hypothesis by a number of different statistical methods. Specifically,
- selection of different variables for inclusion as covariates in a multivariable regression model;
- inclusion versus deletion of outliers;
- application of different analysis methods to the same data (logarithmic transformation, square root transformation, choice of a value for a ‘tuning...

Erscheint lt. Verlag	29.10.2024
Reihe/Serie	Statistics in Practice
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Mathematik ► Statistik
Themenwelt	Mathematik / Informatik ► Mathematik ► Wahrscheinlichkeit / Kombinatorik
Schlagworte	Big Data • Bonferroni correction • collective interpretation • electronic data collection • False Discovery Rate • FDR • high-throughput screening system • non-monotonicity • p-value • p-value-FDR relationship • Q-Q Plot • quantile-quantile plot
ISBN-10	1-119-88979-0 / 1119889790
ISBN-13	978-1-119-88979-3 / 9781119889793

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.