Statistical significance for genomewide studies.
Journal: 2003/October - Proceedings of the National Academy of Sciences of the United States of America
ISSN: 0027-8424
Abstract:
With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.
Relations:
Content
Citations
(3K+)
References
(15)
Clinical trials
(2)
Organisms
(2)
Processes
(8)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Proc Natl Acad Sci U S A 100(16): 9440-9445

Statistical significance for genomewide studies

Department of Biostatistics, University of Washington, Seattle, WA 98195; and Departments of Health Research and Policy and Statistics, Stanford University, Stanford, CA 94305
To whom correspondence should be addressed. E-mail: ude.notgnihsaw.u@yerotsj.
Edited by Philip P. Green, University of Washington School of Medicine, Seattle, WA, and approved May 30, 2003
Edited by Philip P. Green, University of Washington School of Medicine, Seattle, WA, and approved May 30, 2003
Received 2003 Jan 28

Abstract

With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

Keywords: false discovery rates, genomics, multiple hypothesis testing, q values
Abstract

Some of the earliest genomewide studies involved testing for linkage at loci spanning a large portion of the genome. Because a separate statistical test is performed at each locus, traditional p-value cutoffs of 0.01 or 0.05 had to be made stricter to avoid an abundance of false positive results. The threshold for significance in linkage analysis is usually chosen so that the probability of any single false positive among all loci tested is ≤0.05. This strict criterion is used mainly because one or very few loci are expected to show linkage in any given study (1, 2). Because of the recent surge in high-throughput technologies and genome projects, many more types of genomewide studies are now underway. The analyses of these data also involve performing statistical tests on thousands of features in a genome. As opposed to the linkage case, it is expected that many more than one or two of the tested features are statistically significant. Guarding against any single false positive occurring is often going to be much too strict and will lead to many missed findings. The goal is therefore to identify as many significant features in the genome as possible, while incurring a relatively low proportion of false positives.

We are specifically concerned with situations in which a well defined statistical hypothesis test is performed on each of thousands of features represented in a genome. These “features” can be genes, all nucleotide words of a certain length, single-nucleotide polymorphism markers, etc. Several motivating examples are given below. For each feature, a null hypothesis is tested against an alternative hypothesis. In this work, we say that a feature is truly null if the null hypothesis is true, and a feature is truly alternative if the alternative hypothesis is true. If a feature is called significant, then the null hypothesis is rejected in favor of the alternative hypothesis. The goal is to propose and estimate a measure of significance for each feature that meets the practical goals of the genomewide study and that is easily interpreted in terms of the simultaneous testing of thousands of features.

We propose that the recently introduced q value (3, 4) is a well suited measure of significance for this growing class of genomewide tests of significance. The q value is an extension of a quantity called the “false discovery rate” (FDR) (5), which has received much recent attention in the statistics literature (611). A FDR method has been used in detecting differential gene expression in DNA microarray experiments (12), which can be shown to be equivalent to the method in ref. 5 under certain assumptions. Also, ideas similar to FDRs have appeared in the genetics literature (1, 13).

Similarly to the p value, the q value gives each feature its own individual measure of significance. Whereas the p value is a measure of significance in terms of the false positive rate, the q value is a measure in terms of the FDR. The false positive rate and FDR are often mistakenly equated, but their difference is actually very important. Given a rule for calling features significant, the false positive rate is the rate that truly null features are called significant. The FDR is the rate that significant features are truly null. For example, a false positive rate of 5% means that on average 5% of the truly null features in the study will be called significant. A FDR of 5% means that among all features called significant, 5% of these are truly null on average.

The q value provides a measure of each feature's significance, automatically taking into account the fact that thousands are simultaneously being tested. Suppose that features with q values ≤5% are called significant in some genomewide test of significance. This results in a FDR of 5% among the significant features. A p-value threshold of 5% yields a false positive rate of 5% among all null features in the data set. In light of the definition of the false positive rate, a p-value cutoff says little about the content of the features actually called significant. The q values directly provide a meaningful measure among the features called significant. Because significant features will likely undergo some subsequent biological verification, a q-value threshold can be phrased in practical terms as the proportion of significant features that turn out to be false leads.

Here we show that the FDR is a sensible measure of the balance between the number of true positives and false positives in many genomewide studies. We motivate our proposed approach in the context of several recent and prominent papers in which awkwardly chosen p-value cutoffs were used in an attempt to achieve at least qualitatively what the q value directly achieves. We also introduce a fully automated method for estimating q values, with an initial treatment of dependence issues between the features and guidelines as to when the estimates are accurate. The proposed methodology is applied to some gene expression data taken from cancer tumors (14), supporting previously shown results and providing some additional information.

Appendix

Remark A: FDR, Positive FDR (pFDR), and the q Value. In this article, we have used FDR and FDR = E[F/S] somewhat loosely. It will almost always be the case that S = 0 with positive probability, which implies that E[F/S] is undefined. The quantity E[F/S|S > 0]·Pr(S > 0) was proposed as a solution to this problem (5), which is the result of setting F/S = 0 whenever S = 0 in the original E[F/S]. This quantity is technically called the FDR in the statistics literature. In our case we want to place a measure of significance on each feature, which is done under the assumption that the feature is called significant. Thus, the inclusion of Pr(S > 0) is somewhat awkward. An alternative quantity, called the pFDR, was recently proposed (23), which is simply defined as pFDR = E[F/S|S > 0]. The q value is most technically defined as the minimum pFDR at which the feature can be called significant (24). Because m is large in genomewide studies, we have that Pr(S > 0) ≈ 1 and FDR ≈ pFDR ≈ E[F]/E[S], so the distinction is not crucial here. Also, the estimate we use is easily motivated for either quantity (4, 10).

Suppose that each feature's statistic probabilistically follows a random mixture of a null distribution and an alternative distribution. Then under a fixed significance rule, the pFDR can be written as Pr(feature i is truly null|feature i is significant), for any i = 1,..., m (3). Similarly, the false positive rate can be written as Pr(feature i is significant|feature i is truly null), for any i = 1, 2,..., m. Notice the similarity between the pFDR and false positive rate: the arguments have simply been swapped in the conditional probabilities. This connection is the motivation for calling our proposed quantity q value. Indeed, the p value of a feature is technically defined to be the minimum possible false positive rate when calling that feature significant (26). Likewise, the q value is based on the minimum possible pFDR.

Remark B: General Algorithm for Estimating q Values. There is a tradeoff between bias and variance in choosing the λ to use in equation M18. For well formed p values, it should be the case that the bias of equation M19decreases with increasing λ, the bias being the smallest when λ → 1 (4). Therefore, the method we use here is to estimate equation M20. In doing so, we will borrow strength across the equation M21over a range of λ, giving an implicit balance between bias and variance.

Consider Fig. 3, where we have plotted equation M22versus λ for λ = 0, 0.01, 0.02,..., 0.95. By fitting a natural cubic spline to these data (solid line), we have estimated the overall trend of equation M23as λ increases. We purposely set the degrees of freedom of the natural cubic spline to 3; this means we limit its curvature to be like a quadratic function, which is suitable for our purposes. It can be seen from Fig. 3 that the natural cubic spline fits the points quite well. The natural cubic spline evaluated at λ = 1 is our final estimate of π0. For a variety of simulations and forms of dependence (data not shown), this method performed well, often eliminating all bias in equation M24.

Fig. 3.

An external file that holds a picture, illustration, etc.
Object name is pq1530509003.jpg

The equation M25versus λ for the data of Hedenfalk et al. (14). The solid line is a natural cubic spline fit to these points to estimate equation M26.

The following is the general algorithm for estimating q values from a list of p values.

  1. Let p(1)p(2) ≤... ≤ p(m) be the ordered p values. This also denotes the ordering of the features in terms of their evidence against the null hypothesis.

  2. For a range of λ, say λ = 0, 0.01, 0.02,..., 0.95, calculate

    equation M27

  3. Let f̂be the natural cubic spline with 3 df of equation M28on λ.

  4. Set the estimate of π0 to be

    equation M29

  5. Calculate

    equation M30

  6. For i = m – 1, m – 2,..., 1, calculate

    equation M31

  7. The estimated q value for the ith most significant feature is q̂(p(i)).

Remark C: Analysis of the Hedenfalk et al. Data. The data from ref. 14 can be obtained at http://research.nhgri.nih.gov/microarray/NEJMSupplement. The data consist of 3,226 genes on n1 = 7 BRCA1 arrays and n2 = 8 BRCA2 arrays, along with some arrays from sporadic breast cancer, which we did not use. If any gene had one or more measurement exceeding 20, then this gene was eliminated. A value of 20 is several interquartile ranges away from the interquartile range of all of the data and did not seem trustworthy for this example. This left m = 3,170 genes.

We tested each gene for differential expression between these two tumor types by using a two-sample t statistic. Let the log2 expression value from the jth array and the ith gene be denoted by xij. Then x̄i2 = 1/n2jBRCA2xij and equation M32are the sample mean and variance for gene i among the arrays taken from BRCA2 tumors. We can similarly define x̄i1 and equation M33to be the sample mean and variance for the ith gene among the BRCA1 tumor arrays. The two-sample t statistic for the ith gene, allowing for the possibility that the tumors have different variances, is then

equation M34

for i = 1, 2,..., 3,170.

We next calculated null versions of t1, t2,..., t3170 when there is no differential gene expression. Because it is not clearly valid to assume that the ti follow a t distribution, we calculate these by a permutation method. Consider all possible ways to assign n = 15 arrays to n1 = 7 arrays from BRCA1 and n2 = 8 arrays from BRCA2. Under the assumption that there is no differential gene expression, the t statistic should have the same distribution regardless of how we make these assignments. Specifically, the labels on the arrays are randomly scrambled, and the t statistics are recomputed. Therefore, for B = 100 permutations of the array labels we get a set of null statistics equation M35, b = 1,... B. The p value for gene i, i = 1, 2,..., 3,170 was calculated by

equation M36

We estimated the q values for differential gene expression between the BRCA1 and BRCA2 tumors by using the algorithm presented above. All results, including the computer code used to analyze the data, can be found at http://genomine.org/qvalue/results.html.

Remark D: Theoretical Properties. Some mathematical results hold under “weak dependence” of the p values (or features in the genome). These mathematical results indicate that our method yields conservative q-value estimates. The conservative property is desirable because one does not want to underestimate the true q values. (For the same reason one would not want to underestimate a p value.)

Suppose that with probability 1, we have S(t)/mG(t) and F(t)/m0G0(t) for each t ∈ [0, 1] as m → ∞, where G and G0 are continuous functions. In words, this says that the empirical distribution functions of the observed p values and null p values converge pointwise to some continuous functions. Weak dependence is defined as dependence that allows this pointwise convergence. (As a rule of thumb, the more local the dependence is, the more likely it is to meet the weak dependence criterion.) Also suppose that G0(t) ≤ t (i.e., uniform distribution or more conservative), and that m0/m converges. If we constrain equation M37(which should usually be the case), then it can be shown that for any δ > 0,

equation M38

which means that the estimated q values are simultaneously conservative for the true q values, even when taking the worst-case scenario over [δ, 1] for arbitrarily small δ. Also, we can conclude that

equation M39

which means that if we call all genes with q values ≤ α, then in the long run the FDR will be ≤ α. The proofs of these claims follow from minor modifications to some of the main results in ref. 10.

Appendix

Notes

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: FDR, false discovery rate; pFDR, positive FDR.

Notes
This paper was submitted directly (Track II) to the PNAS office.
Abbreviations: FDR, false discovery rate; pFDR, positive FDR.

References

  • 1. Morton, N. E. (1955) Am. J. Hum. Gen.7, 277–318.
  • 2. Lander, E. S. & Kruglyak, L. (1995) Nat. Genet.11, 241–247. [[PubMed]
  • 3. Storey, J. D. (2003) Ann. Stat., in press.
  • 4. Storey, J. D. (2002) J. R. Stat. Soc. B64, 479–498. [PubMed]
  • 5. Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B85, 289–300. [PubMed]
  • 6. Yekutieli, D. & Benjamini, Y. (1999) J. Stat. Plan. Inf.82, 171–196. [PubMed]
  • 7. Benjamini, Y. & Hochberg, Y. (2000) J. Ed. Behav. Stat.25, 60–83. [PubMed]
  • 8. Efron, B., Tibshirani, R., Storey, J. D. & Tusher, V. (2001) J. Am. Stat. Assoc.96, 1151–1160. [PubMed]
  • 9. Genovese, C. & Wasserman, L. (2002) J. R. Stat. Soc. B64, 499–517. [PubMed]
  • 10. Storey, J. D., Taylor, J. E. & Siegmund, D. (2003) J. R. Stat. Soc. B, in press.
  • 11. Tzeng, J. Y., Byerley, W., Devlin, B., Roeder, K. & Wasserman, L. (2003) J. Am. Stat. Assoc.98, 236–246. [PubMed]
  • 12. Tusher, V., Tibshirani, R. & Chu, C. (2001) Proc. Natl. Acad. Sci. USA98, 5116–5121.
  • 13. Tatusov, R. L., Altschul, S. F. & Koonin, E. V. (1994) Proc. Natl. Acad. Sci. USA91, 12091–12095.
  • 14. Hedenfalk, I., Duggan, D., Chen, Y. D., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O. P., et al. (2001) N. Engl. J. Med.344, 539–548. [[PubMed]
  • 15. Slonim, D. K. (2002) Nat. Genet.32, Suppl., 502–508. [[PubMed]
  • 16. Blencowe, B. J. (2000) Trends Biochem. Sci.25, 106–110. [[PubMed]
  • 17. Fairbrother, W. G., Yeh, R. F., Sharp, P. A. & Burge, C. B. (2002) Science297, 1007–1013. [[PubMed]
  • 18. Brem, R. B., Yvert, G., Clinton, R. & Kruglyak, L. (2002) Science296, 752–755. [[PubMed]
  • 19. Churchill, G. A. & Doerge, R. W. (1994) Genetics138, 963–971.
  • 20. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002) Science298, 799–804. [[PubMed]
  • 21. Kolodner, R(1996) Genes Dev.10, 1433–1442. [[PubMed][Google Scholar]
  • 22. Liu, H. T., Wang, Y. G., Zhang, Y. M., Song, Q. S., Di, C. H., Chen, G. H., Tang, J. & Ma, D. L. (1999) Biochem. Biophys. Res. Commun.254, 203–210. [[PubMed]
  • 23. Hishikawa, K., Oemar, B. S., Tanner, F. C., Nakaki, T., Luscher, T. F. & Fujii, T. (1999) J. Biol. Chem.274, 37461–37466. [[PubMed]
  • 24. Efron, B., Storey, J. D. & Tibshirani, R. (2001) Technical Report 2001-217 (Stanford Univ., Palo Alto, CA).
  • 25. Efron, B. & Tibshirani, R. (2002) Genet. Epidemiol.23, 70–86. [[PubMed]
  • 26. Lehmann, E. L. (1986) Testing Statistical Hypotheses (Springer, New York), 2nd Ed.
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.