Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
Freely available online through the PNAS open access option.
Abstract
Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.
Genomewide expression analysis with DNA microarrays has become a mainstay of genomics research (1, 2). The challenge no longer lies in obtaining gene expression profiles, but rather in interpreting the results to gain insights into biological mechanisms.
In a typical experiment, mRNA expression profiles are generated for thousands of genes from a collection of samples belonging to one of two classes, for example, tumors that are sensitive vs. resistant to a drug. The genes can be ordered in a ranked list L, according to their differential expression between the classes. The challenge is to extract meaning from this list.
A common approach involves focusing on a handful of genes at the top and bottom of L (i.e., those showing the largest difference) to discern telltale biological clues. This approach has a few major limitations.
(i) After correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance, because the relevant biological differences are modest relative to the noise inherent to the microarray technology.
(ii) Alternatively, one may be left with a long list of statistically significant genes without any unifying biological theme. Interpretation can be daunting and ad hoc, being dependent on a biologist's area of expertise.
(iii) Single-gene analysis may miss important effects on pathways. Cellular processes often affect sets of genes acting in concert. An increase of 20% in all genes encoding members of a metabolic pathway may dramatically alter the flux through the pathway and may be more important than a 20-fold increase in a single gene.
(iv) When different groups study the same biological system, the list of statistically significant genes from the two studies may show distressingly little overlap (3).
To overcome these analytical challenges, we recently developed a method called Gene Set Enrichment Analysis (GSEA) that evaluates microarray data at the level of gene sets. The gene sets are defined based on prior biological knowledge, e.g., published information about biochemical pathways or coexpression in previous experiments. The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L, in which case the gene set is correlated with the phenotypic class distinction.
We used a preliminary version of GSEA to analyze data from muscle biopsies from diabetics vs. healthy controls (4). The method revealed that genes involved in oxidative phosphorylation show reduced expression in diabetics, although the average decrease per gene is only 20%. The results from this study have been independently validated by other microarray studies (5) and by in vivo functional studies (6).
Given this success, we have developed GSEA into a robust technique for analyzing molecular profiling data. We studied its characteristics and performance and substantially revised and generalized the original method for broader applicability.
In this paper, we provide a full mathematical description of the GSEA methodology and illustrate its utility by applying it to several diverse biological problems. We have also created a software package, called gsea-p and an initial inventory of gene sets (Molecular Signature Database, MSigDB), both of which are freely available.
For detailed results, see Table 4, which is published as supporting information on the PNAS web site.
Click here to view.Acknowledgments
We acknowledge discussions with or data from D. Altshuler, N. Patterson, J. Lamb, X. Xie, J.-Ph. Brunet, S. Ramaswamy, J.-P. Bourquin, B. Sellers, L. Sturla, C. Nutt, and J. C. Florez and comments from reviewers.
Appendix: Mathematical Description of Methods
Inputs to GSEA.
Expression data set D with N genes and k samples.
Ranking procedure to produce Gene List L. Includes a correlation (or other ranking metric) and a phenotype or profile of interest C. We use only one probe per gene to prevent overestimation of the enrichment statistic (Supporting Text; see also Table 8, which is published as supporting information on the PNAS web site).
An exponent p to control the weight of the step.
Independently derived Gene Set S of NH genes (e.g., a pathway, a cytogenetic band, or a GO category). In the analyses above, we used only gene sets with at least 15 members to focus on robust signals (78% of MSigDB) (Table 3).
Enrichment Score ES(S).
Rank order the N genes in D to form L = {g1,...,gN} according to the correlation, r(gj)= rj, of their expression profiles with C.
Evaluate the fraction of genes in S (“hits”) weighted by their correlation and the fraction of genes not in S (“misses”) present up to a given position i in L.
The ES is the maximum deviation from zero of Phit – Pmiss. For a randomly distributed S, ES(S) will be relatively small, but if it is concentrated at the top or bottom of the list, or otherwise nonrandomly distributed, then ES(S) will be correspondingly high. When p = 0, ES(S) reduces to the standard Kolmogorov–Smirnov statistic; when p = 1, we are weighting the genes in S by their correlation with C normalized by the sum of the correlations over all of the genes in S. We set p = 1 for the examples in this paper. (See Fig. 7, which is published as supporting information on the PNAS web site.)
Estimating Significance. We assess the significance of an observed ES by comparing it with the set of scores ESNULL computed with randomly assigned phenotypes.
Randomly assign the original phenotype labels to samples, reorder genes, and re-compute ES(S).
Repeat step 1 for 1,000 permutations, and create a histogram of the corresponding enrichment scores ESNULL.
Estimate nominal P value for S from ESNULL by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).
Multiple Hypothesis Testing.
Determine ES(S) for each gene set in the collection or database.
For each S and 1000 fixed permutations π of the phenotype labels, reorder the genes in L and determine ES(S, π).
Adjust for variation in gene set size. Normalize the ES(S, π) and the observed ES(S), separately rescaling the positive and negative scores by dividing by the mean of the ES(S, π) to yield the normalized scores NES(S, π) and NES(S) (see Supporting Text).
Compute FDR. Control the ratio of false positives to the total number of gene sets attaining a fixed level of significance separately for positive (negative) NES(S) and NES(S, π).
Create a histogram of all NES(S, π) over all S and π. Use this null distribution to compute an FDR q value, for a given NES(S) = NES* ≥ 0. The FDR is the ratio of the percentage of all (S, π) with NES(S, π) ≥ 0, whose NES(S, π) ≥ NES*, divided by the percentage of observed S with NES(S) ≥ 0, whose NES(S) ≥ NES*, and similarly if NES(S) = NES* ≤ 0.
Notes
Author contributions: A.S., P.T., V.K.M., E.S.L., and J.P.M. designed research; A.S., P.T., V.K.M., E.S.L., and J.P.M. performed research; A.S., P.T., V.K.M., S.M., E.S.L., and J.P.M. contributed new reagents/analytic tools; A.S., P.T., V.K.M., B.L.E., M.A.G., T.R.G., E.S.L., and J.P.M. analyzed data; A.S., P.T., V.K.M., E.S.L., and J.P.M. wrote the paper; and A.P. and S.L.P. contributed data.
Abbreviations: ALL, acute lymphoid leukemia; AML, acute myeloid leukemia; ES, enrichment score; FDR, false discovery rate; GSEA, Gene Set Enrichment Analysis; MAPK, mitogen-activated protein kinase; MSigDB, Molecular Signature Database; NES, normalized enrichment score.
See Commentary on page 15278.
References
- 1. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) Science270, 467–470. [[PubMed]
- 2. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996) Nat. Biotechnol.14, 1675–1680. [[PubMed]
- 3. Fortunel, N. O., Otu, H. H., Ng, H. H., Chen, J., Mu, X., Chevassut, T., Li, X., Joseph, M., Bailey, C., Hatzfeld, J. A., et al. (2003) Science302, 393, author reply 393. [[PubMed]
- 4. Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. (2003) Nat. Genet.34, 267–273. [[PubMed]
- 5. Patti, M. E., Butte, A. J., Crunkhorn, S., Cusi, K., Berria, R., Kashyap, S., Miyazaki, Y., Kohane, I., Costello, M., Saccone, R., et al. (2003) Proc. Natl. Acad. Sci. USA100, 8466–8471.
- 6. Petersen, K. F., Dufour, S., Befroy, D., Garcia, R. & Shulman, G. I. (2004) N. Engl. J. Med.350, 664–671.
- 7. Hollander, M. & Wolfe, D. A. (1999) Nonparametric Statistical Methods (Wiley, New York).
- 8. Benjamini, Y., Drai, D., Elmer, G., Kafkafi, N. & Golani, I. (2001) Behav. Brain Res.125, 279–284. [[PubMed]
- 9. Reiner, A., Yekutieli, D. & Benjamini, Y. (2003) Bioinformatics19, 368–375. [[PubMed]
- 10. Lamb, J., Ramaswamy, S., Ford, H. L., Contreras, B., Martinez, R. V., Kittrell, F. S., Zahnow, C. A., Patterson, N., Golub, T. R. & Ewen, M. E. (2003) Cell114, 323–334. [[PubMed]
- 11. Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., Mootha, V., Lindblad-Toh, K., Lander, E. S. & Kellis, M. (2005) Nature434, 338–345.
- 12. Plath, K., Mlynarczyk-Evans, S., Nusinow, D. A. & Panning, B. (2002) Annu. Rev. Genet.36, 233–278. [[PubMed]
- 13. Carrel, L., Cottle, A. A., Goglin, K. C. & Willard, H. F. (1999) Proc. Natl. Acad. Sci. USA96, 14440–14444.
- 14. Disteche, C. M., Filippova, G. N. & Tsuchiya, K. D. (2002) Cytogenet. Genome Res.99, 36–43. [[PubMed]
- 15. Olivier, M., Eeles, R., Hollstein, M., Khan, M. A., Harris, C. C. & Hainaut, P. (2002) Hum. Mutat.19, 607–614. [[PubMed]
- 16. Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R. & Korsmeyer, S. J. (2002) Nat. Genet.30, 41–47. [[PubMed]
- 17. Zhao, N., Stoffel, A., Wang, P. W., Eisenbart, J. D., Espinosa, R., 3rd, Larson, R. A. & Le Beau, M. M. (1997) Proc. Natl. Acad. Sci. USA94, 6948–6953.
- 18. Barbouti, A., Hoglund, M., Johansson, B., Lassen, C., Nilsson, P. G., Hagemeijer, A., Mitelman, F. & Fioretos, T. (2003) Cancer Res.63, 1202–1206. [[PubMed]
- 19. Tanaka, K., Arif, M., Eguchi, M., Guo, S. X., Hayashi, Y., Asaoku, H., Kyo, T., Dohy, H. & Kamada, N. (1999) Leukemia13, 1367–1373. [[PubMed]
- 20. Morelli, C., Karayianni, E., Magnanini, C., Mungall, A. J., Thorland, E., Negrini, M., Smith, D. I. & Barbanti-Brodano, G. (2002) Oncogene21, 7266–7276. [[PubMed]
- 21. Mrozek, K., Heerema, N. A. & Bloomfield, C. D. (2004) Blood Rev.18, 115–136. [[PubMed]
- 22. Bhattacharjee, A., Richards, WG., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., et al. (2001) Proc. Natl. Acad. Sci. USA98, 13790–13795. [Google Scholar]
- 23. Beer, D. G., Kardia, S. L., Huang, C. C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., et al. (2002) Nat. Med.8, 816–824. [[PubMed]
- 24. Garber, M. E., Troyanskaya, O. G., Schluens, K., Petersen, S., Thaesler, Z., Pacyna-Gengelbach, M., van de Rijn, M., Rosen, G. D., Perou, C. M., Whyte, R. I., et al. (2001) Proc. Natl. Acad. Sci. USA98, 13784–13789.
- 25. Smith, L. L., Coller, H. A. & Roberts, J. M. (2003) Nat. Cell Biol.5, 474–479. [[PubMed]
- 26. Acker, T. & Plate, K. H. (2002) J. Mol. Med.80, 562–575. [[PubMed]
- 27. Peng, T., Golub, T. R. & Sabatini, D. M. (2002) Mol. Cell. Biol.22, 5575–5584.
- 28. Boffa, D. J., Luan, F., Thomas, D., Yang, H., Sharma, V. K., Lagman, M. & Suthanthiran, M. (2004) Clin. Cancer Res.10, 293–300. [[PubMed]
- 29. Monti, S., Savage, K. J., Kutok, J. L., Feuerhake, F., Kurtin, P., Mihm, M., Wu, B., Pasqualucci, L., Neuberg, D., Aguiar, R. C., et al. (2004) Blood105, 1851–1861. [[PubMed]
- 30. Majumder, P. K., Febbo, P. G., Bikoff, R., Berger, R., Xue, Q., McMahon, L. M., Manola, J., Brugarolas, J., McDonnell, T. J., Golub, T. R., et al. (2004) Nat. Med.10, 594–601. [[PubMed]
- 31. Sweet-Cordero, A., Mukherjee, S., Subramanian, A., You, H., Roix, J. J., Ladd-Acosta, C., Mesirov, J., Golub, T. R. & Jacks, T. (2005) Nat. Genet.37, 48–55. [[PubMed]
- 32. Doniger, S. W., Salomonis, N., Dahlquist, K. D., Vranizan, K., Lawlor, S. C. & Conklin, B. R. (2003) Genome Biol.4, R7.
- 33. Zhong, S., Storch, K. F., Lipan, O., Kao, M. C., Weitz, C. J. & Wong, W. H. (2004) Appl. Bioinformatics3, 261–264. [[PubMed]
- 34. Berriz, G. F., King, O. D., Bryant, B., Sander, C. & Roth, F. P. (2003) Bioinformatics19, 2502–2504. [[PubMed]