Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.
Journal: 2005/December - Proceedings of the National Academy of Sciences of the United States of America
ISSN: 0027-8424
Abstract:
Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.
Relations:
Content
Citations
(10K+)
References
(33)
Diseases
(3)
Organisms
(1)
Processes
(2)
Anatomy
(1)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Proc Natl Acad Sci U S A 102(43): 15545-15550

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

+2 authors
Broad Institute of Massachusetts Institute of Technology and Harvard, 320 Charles Street, Cambridge, MA 02141;Department of Systems Biology, Alpert 536, Harvard Medical School, 200 Longwood Avenue, Boston, MA 02446; Institute for Genome Sciences and Policy, Center for Interdisciplinary Engineering, Medicine, and Applied Sciences, Duke University, 101 Science Drive, Durham, NC 27708; Department of Medical Oncology, Dana–Farber Cancer Institute, 44 Binney Street, Boston, MA 02115; Division of Pulmonary and Critical Care Medicine, Massachusetts General Hospital, 55 Fruit Street, Boston, MA 02114; Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, C2-023, P.O. Box 19024, Seattle, WA 98109-1024; Department of Neurology, Enders 260, Children's Hospital, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115; Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02142; and Whitehead Institute for Biomedical Research, Massachusetts Institute of Technology, Cambridge, MA 02142
To whom correspondence may be addressed. E-mail: ude.tim.daorb@rednal or ude.tim.daorb@vorisem.
A.S. and P.T. contributed equally to this work.
Contributed by Eric S. Lander, August 2, 2005
Contributed by Eric S. Lander, August 2, 2005

Freely available online through the PNAS open access option.

Abstract

Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

Keywords: microarray
Abstract

Genomewide expression analysis with DNA microarrays has become a mainstay of genomics research (1, 2). The challenge no longer lies in obtaining gene expression profiles, but rather in interpreting the results to gain insights into biological mechanisms.

In a typical experiment, mRNA expression profiles are generated for thousands of genes from a collection of samples belonging to one of two classes, for example, tumors that are sensitive vs. resistant to a drug. The genes can be ordered in a ranked list L, according to their differential expression between the classes. The challenge is to extract meaning from this list.

A common approach involves focusing on a handful of genes at the top and bottom of L (i.e., those showing the largest difference) to discern telltale biological clues. This approach has a few major limitations.

(i) After correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance, because the relevant biological differences are modest relative to the noise inherent to the microarray technology.

(ii) Alternatively, one may be left with a long list of statistically significant genes without any unifying biological theme. Interpretation can be daunting and ad hoc, being dependent on a biologist's area of expertise.

(iii) Single-gene analysis may miss important effects on pathways. Cellular processes often affect sets of genes acting in concert. An increase of 20% in all genes encoding members of a metabolic pathway may dramatically alter the flux through the pathway and may be more important than a 20-fold increase in a single gene.

(iv) When different groups study the same biological system, the list of statistically significant genes from the two studies may show distressingly little overlap (3).

To overcome these analytical challenges, we recently developed a method called Gene Set Enrichment Analysis (GSEA) that evaluates microarray data at the level of gene sets. The gene sets are defined based on prior biological knowledge, e.g., published information about biochemical pathways or coexpression in previous experiments. The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L, in which case the gene set is correlated with the phenotypic class distinction.

We used a preliminary version of GSEA to analyze data from muscle biopsies from diabetics vs. healthy controls (4). The method revealed that genes involved in oxidative phosphorylation show reduced expression in diabetics, although the average decrease per gene is only 20%. The results from this study have been independently validated by other microarray studies (5) and by in vivo functional studies (6).

Given this success, we have developed GSEA into a robust technique for analyzing molecular profiling data. We studied its characteristics and performance and substantially revised and generalized the original method for broader applicability.

In this paper, we provide a full mathematical description of the GSEA methodology and illustrate its utility by applying it to several diverse biological problems. We have also created a software package, called gsea-p and an initial inventory of gene sets (Molecular Signature Database, MSigDB), both of which are freely available.

For detailed results, see Table 4, which is published as supporting information on the PNAS web site.

Click here to view.

Acknowledgments

We acknowledge discussions with or data from D. Altshuler, N. Patterson, J. Lamb, X. Xie, J.-Ph. Brunet, S. Ramaswamy, J.-P. Bourquin, B. Sellers, L. Sturla, C. Nutt, and J. C. Florez and comments from reviewers.

Acknowledgments

Appendix: Mathematical Description of Methods

Inputs to GSEA.

  1. Expression data set D with N genes and k samples.

  2. Ranking procedure to produce Gene List L. Includes a correlation (or other ranking metric) and a phenotype or profile of interest C. We use only one probe per gene to prevent overestimation of the enrichment statistic (Supporting Text; see also Table 8, which is published as supporting information on the PNAS web site).

  3. An exponent p to control the weight of the step.

  4. Independently derived Gene Set S of NH genes (e.g., a pathway, a cytogenetic band, or a GO category). In the analyses above, we used only gene sets with at least 15 members to focus on robust signals (78% of MSigDB) (Table 3).

Enrichment Score ES(S).

  1. Rank order the N genes in D to form L = {g1,...,gN} according to the correlation, r(gj)= rj, of their expression profiles with C.

  2. Evaluate the fraction of genes in S (“hits”) weighted by their correlation and the fraction of genes not in S (“misses”) present up to a given position i in L.

equation M1
[1]

equation M2

The ES is the maximum deviation from zero of PhitPmiss. For a randomly distributed S, ES(S) will be relatively small, but if it is concentrated at the top or bottom of the list, or otherwise nonrandomly distributed, then ES(S) will be correspondingly high. When p = 0, ES(S) reduces to the standard Kolmogorov–Smirnov statistic; when p = 1, we are weighting the genes in S by their correlation with C normalized by the sum of the correlations over all of the genes in S. We set p = 1 for the examples in this paper. (See Fig. 7, which is published as supporting information on the PNAS web site.)

Estimating Significance. We assess the significance of an observed ES by comparing it with the set of scores ESNULL computed with randomly assigned phenotypes.

  1. Randomly assign the original phenotype labels to samples, reorder genes, and re-compute ES(S).

  2. Repeat step 1 for 1,000 permutations, and create a histogram of the corresponding enrichment scores ESNULL.

  3. Estimate nominal P value for S from ESNULL by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).

Multiple Hypothesis Testing.

  1. Determine ES(S) for each gene set in the collection or database.

  2. For each S and 1000 fixed permutations π of the phenotype labels, reorder the genes in L and determine ES(S, π).

  3. Adjust for variation in gene set size. Normalize the ES(S, π) and the observed ES(S), separately rescaling the positive and negative scores by dividing by the mean of the ES(S, π) to yield the normalized scores NES(S, π) and NES(S) (see Supporting Text).

  4. Compute FDR. Control the ratio of false positives to the total number of gene sets attaining a fixed level of significance separately for positive (negative) NES(S) and NES(S, π).

Create a histogram of all NES(S, π) over all S and π. Use this null distribution to compute an FDR q value, for a given NES(S) = NES* ≥ 0. The FDR is the ratio of the percentage of all (S, π) with NES(S, π) ≥ 0, whose NES(S, π) ≥ NES*, divided by the percentage of observed S with NES(S) ≥ 0, whose NES(S) ≥ NES*, and similarly if NES(S) = NES* ≤ 0.

Appendix: Mathematical Description of Methods

Notes

Author contributions: A.S., P.T., V.K.M., E.S.L., and J.P.M. designed research; A.S., P.T., V.K.M., E.S.L., and J.P.M. performed research; A.S., P.T., V.K.M., S.M., E.S.L., and J.P.M. contributed new reagents/analytic tools; A.S., P.T., V.K.M., B.L.E., M.A.G., T.R.G., E.S.L., and J.P.M. analyzed data; A.S., P.T., V.K.M., E.S.L., and J.P.M. wrote the paper; and A.P. and S.L.P. contributed data.

Abbreviations: ALL, acute lymphoid leukemia; AML, acute myeloid leukemia; ES, enrichment score; FDR, false discovery rate; GSEA, Gene Set Enrichment Analysis; MAPK, mitogen-activated protein kinase; MSigDB, Molecular Signature Database; NES, normalized enrichment score.

See Commentary on page 15278.

Notes
Author contributions: A.S., P.T., V.K.M., E.S.L., and J.P.M. designed research; A.S., P.T., V.K.M., E.S.L., and J.P.M. performed research; A.S., P.T., V.K.M., S.M., E.S.L., and J.P.M. contributed new reagents/analytic tools; A.S., P.T., V.K.M., B.L.E., M.A.G., T.R.G., E.S.L., and J.P.M. analyzed data; A.S., P.T., V.K.M., E.S.L., and J.P.M. wrote the paper; and A.P. and S.L.P. contributed data.
Abbreviations: ALL, acute lymphoid leukemia; AML, acute myeloid leukemia; ES, enrichment score; FDR, false discovery rate; GSEA, Gene Set Enrichment Analysis; MAPK, mitogen-activated protein kinase; MSigDB, Molecular Signature Database; NES, normalized enrichment score.See Commentary on page 15278.

References

  • 1. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) Science270, 467–470. [[PubMed]
  • 2. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. (1996) Nat. Biotechnol.14, 1675–1680. [[PubMed]
  • 3. Fortunel, N. O., Otu, H. H., Ng, H. H., Chen, J., Mu, X., Chevassut, T., Li, X., Joseph, M., Bailey, C., Hatzfeld, J. A., et al. (2003) Science302, 393, author reply 393. [[PubMed]
  • 4. Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. (2003) Nat. Genet.34, 267–273. [[PubMed]
  • 5. Patti, M. E., Butte, A. J., Crunkhorn, S., Cusi, K., Berria, R., Kashyap, S., Miyazaki, Y., Kohane, I., Costello, M., Saccone, R., et al. (2003) Proc. Natl. Acad. Sci. USA100, 8466–8471.
  • 6. Petersen, K. F., Dufour, S., Befroy, D., Garcia, R. & Shulman, G. I. (2004) N. Engl. J. Med.350, 664–671.
  • 7. Hollander, M. & Wolfe, D. A. (1999) Nonparametric Statistical Methods (Wiley, New York).
  • 8. Benjamini, Y., Drai, D., Elmer, G., Kafkafi, N. & Golani, I. (2001) Behav. Brain Res.125, 279–284. [[PubMed]
  • 9. Reiner, A., Yekutieli, D. & Benjamini, Y. (2003) Bioinformatics19, 368–375. [[PubMed]
  • 10. Lamb, J., Ramaswamy, S., Ford, H. L., Contreras, B., Martinez, R. V., Kittrell, F. S., Zahnow, C. A., Patterson, N., Golub, T. R. & Ewen, M. E. (2003) Cell114, 323–334. [[PubMed]
  • 11. Xie, X., Lu, J., Kulbokas, E. J., Golub, T. R., Mootha, V., Lindblad-Toh, K., Lander, E. S. & Kellis, M. (2005) Nature434, 338–345.
  • 12. Plath, K., Mlynarczyk-Evans, S., Nusinow, D. A. & Panning, B. (2002) Annu. Rev. Genet.36, 233–278. [[PubMed]
  • 13. Carrel, L., Cottle, A. A., Goglin, K. C. & Willard, H. F. (1999) Proc. Natl. Acad. Sci. USA96, 14440–14444.
  • 14. Disteche, C. M., Filippova, G. N. & Tsuchiya, K. D. (2002) Cytogenet. Genome Res.99, 36–43. [[PubMed]
  • 15. Olivier, M., Eeles, R., Hollstein, M., Khan, M. A., Harris, C. C. & Hainaut, P. (2002) Hum. Mutat.19, 607–614. [[PubMed]
  • 16. Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R. & Korsmeyer, S. J. (2002) Nat. Genet.30, 41–47. [[PubMed]
  • 17. Zhao, N., Stoffel, A., Wang, P. W., Eisenbart, J. D., Espinosa, R., 3rd, Larson, R. A. & Le Beau, M. M. (1997) Proc. Natl. Acad. Sci. USA94, 6948–6953.
  • 18. Barbouti, A., Hoglund, M., Johansson, B., Lassen, C., Nilsson, P. G., Hagemeijer, A., Mitelman, F. & Fioretos, T. (2003) Cancer Res.63, 1202–1206. [[PubMed]
  • 19. Tanaka, K., Arif, M., Eguchi, M., Guo, S. X., Hayashi, Y., Asaoku, H., Kyo, T., Dohy, H. & Kamada, N. (1999) Leukemia13, 1367–1373. [[PubMed]
  • 20. Morelli, C., Karayianni, E., Magnanini, C., Mungall, A. J., Thorland, E., Negrini, M., Smith, D. I. & Barbanti-Brodano, G. (2002) Oncogene21, 7266–7276. [[PubMed]
  • 21. Mrozek, K., Heerema, N. A. & Bloomfield, C. D. (2004) Blood Rev.18, 115–136. [[PubMed]
  • 22. Bhattacharjee, A., Richards, WG., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., et al. (2001) Proc. Natl. Acad. Sci. USA98, 13790–13795. [Google Scholar]
  • 23. Beer, D. G., Kardia, S. L., Huang, C. C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., et al. (2002) Nat. Med.8, 816–824. [[PubMed]
  • 24. Garber, M. E., Troyanskaya, O. G., Schluens, K., Petersen, S., Thaesler, Z., Pacyna-Gengelbach, M., van de Rijn, M., Rosen, G. D., Perou, C. M., Whyte, R. I., et al. (2001) Proc. Natl. Acad. Sci. USA98, 13784–13789.
  • 25. Smith, L. L., Coller, H. A. & Roberts, J. M. (2003) Nat. Cell Biol.5, 474–479. [[PubMed]
  • 26. Acker, T. & Plate, K. H. (2002) J. Mol. Med.80, 562–575. [[PubMed]
  • 27. Peng, T., Golub, T. R. & Sabatini, D. M. (2002) Mol. Cell. Biol.22, 5575–5584.
  • 28. Boffa, D. J., Luan, F., Thomas, D., Yang, H., Sharma, V. K., Lagman, M. & Suthanthiran, M. (2004) Clin. Cancer Res.10, 293–300. [[PubMed]
  • 29. Monti, S., Savage, K. J., Kutok, J. L., Feuerhake, F., Kurtin, P., Mihm, M., Wu, B., Pasqualucci, L., Neuberg, D., Aguiar, R. C., et al. (2004) Blood105, 1851–1861. [[PubMed]
  • 30. Majumder, P. K., Febbo, P. G., Bikoff, R., Berger, R., Xue, Q., McMahon, L. M., Manola, J., Brugarolas, J., McDonnell, T. J., Golub, T. R., et al. (2004) Nat. Med.10, 594–601. [[PubMed]
  • 31. Sweet-Cordero, A., Mukherjee, S., Subramanian, A., You, H., Roix, J. J., Ladd-Acosta, C., Mesirov, J., Golub, T. R. & Jacks, T. (2005) Nat. Genet.37, 48–55. [[PubMed]
  • 32. Doniger, S. W., Salomonis, N., Dahlquist, K. D., Vranizan, K., Lawlor, S. C. & Conklin, B. R. (2003) Genome Biol.4, R7.
  • 33. Zhong, S., Storch, K. F., Lipan, O., Kao, M. C., Weitz, C. J. & Wong, W. H. (2004) Appl. Bioinformatics3, 261–264. [[PubMed]
  • 34. Berriz, G. F., King, O. D., Bryant, B., Sander, C. & Roth, F. P. (2003) Bioinformatics19, 2502–2504. [[PubMed]
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.