Metagenes and molecular pattern discovery using matrix factorization.
Journal: 2004/May - Proceedings of the National Academy of Sciences of the United States of America
ISSN: 0027-8424
Abstract:
We describe here the use of nonnegative matrix factorization (NMF), an algorithm based on decomposition by parts that can reduce the dimension of expression data from thousands of genes to a handful of metagenes. Coupled with a model selection mechanism, adapted to work for any stochastic clustering algorithm, NMF is an efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery. We demonstrate the ability of NMF to recover meaningful biological information from cancer-related microarray data. NMF appears to have advantages over other methods such as hierarchical clustering or self-organizing maps. We found it less sensitive to a priori selection of genes or initial conditions and able to detect alternative or context-dependent patterns of gene expression in complex biological systems. This ability, similar to semantic polysemy in text, provides a general method for robust molecular pattern discovery.
Relations:
Content
Citations
(348)
References
(12)
Diseases
(4)
Processes
(1)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Proc Natl Acad Sci U S A 101(12): 4164-4169

Metagenes and molecular pattern discovery using matrix factorization

The Eli and Edythe L. Broad Institute, Massachusetts Institute of Technology and Harvard University, 320 Charles Street, Cambridge, MA 02141; and Dana–Farber Cancer Institute and Harvard Medical School, 44 Binney Street, Boston, MA 02115
To whom correspondence should be addressed. E-mail: ude.tim.daorb@vorisem.
Communicated by Eric S. Lander, Massachusetts Institute of Technology, Cambridge, MA, December 20, 2003
Communicated by Eric S. Lander, Massachusetts Institute of Technology, Cambridge, MA, December 20, 2003
Received 2003 Nov 1

Abstract

We describe here the use of nonnegative matrix factorization (NMF), an algorithm based on decomposition by parts that can reduce the dimension of expression data from thousands of genes to a handful of metagenes. Coupled with a model selection mechanism, adapted to work for any stochastic clustering algorithm, NMF is an efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery. We demonstrate the ability of NMF to recover meaningful biological information from cancer-related microarray data. NMF appears to have advantages over other methods such as hierarchical clustering or self-organizing maps. We found it less sensitive to a priori selection of genes or initial conditions and able to detect alternative or context-dependent patterns of gene expression in complex biological systems. This ability, similar to semantic polysemy in text, provides a general method for robust molecular pattern discovery.

Abstract

With the advent of DNA microarrays, it is now possible to simultaneously monitor expression of all genes in the genome. Increasingly, the challenge is to interpret such data to gain insight into biological processes and the mechanisms of human disease.

Various methods have been developed for clustering genes or samples that show similar expression patterns (15). However, these methods have serious limitations in their ability to capture the full structure inherent in the data. They typically focus on the predominant structures in a data set and fail to capture alternative structures and local behavior.

Hierarchical clustering (HC) is a frequently used and valuable approach. It has been successfully used to analyze temporal expression patterns (1), to predict patient outcome among lymphoma patients (2), and to provide molecular portraits of breast tumors (3). However, HC has the disadvantages that it imposes a stringent tree structure on the data, is highly sensitive to the metric used to assess similarity, and typically requires subjective evaluation to define clusters. Self-organizing maps (SOM) provide another powerful approach (4). They have been successfully used in similar applications, including identification of pathways involved in differentiation of hematopoietic cells and recognition of subtypes of leukemia (5). SOMs, however, can be unstable, yielding different decompositions of the data depending on the choice of initial conditions. Recently, various dimensionality reduction and matrix decomposition methods have been introduced (68). However, many questions remain to be resolved about such methods. These include the key issue of model selection (that is, how to select the dimensionality of the reduced representation) and the accuracy and robustness of the representation.

Here, we describe a technique for extracting relevant biological correlations, or “molecular logic,” in gene expression data. The method is designed to capture alternative structures inherent in the data and, by organizing both the genes and samples, to provide biological insight. The method is based on nonnegative matrix factorization (NMF). Lee and Seung (9) introduced NMF in its modern formulation as a method to decompose images. In this context, NMF yielded a decomposition of human faces into parts reminiscent of features such as eyes, nose, etc. By contrast, they noted that the application of traditional factorization methods, such as principal component analysis, to image data yielded components with no obvious visual interpretation. When applied to text, NMF gave some evidence of differentiating meanings of the same word depending on context (semantic polysemy) (9).

Here, we use NMF to describe the tens of thousands of genes in a genome in terms of a small number of metagenes. Samples can then be analyzed by summarizing their gene expression patterns in terms of expression patterns of the metagenes. The metagenes provide an interesting decomposition of genes, analogous to facial features in Lee and Seung's work (9) on images. The metagene expression patterns provide a robust clustering of samples. Importantly, we also introduce a methodology for model selection that highlights alternative decompositions and assesses their robustness.

We apply NMF and our model selection criterion to the problem of elucidating cancer subtypes by clustering tumor samples. We are able to demonstrate multiple robust decompositions of leukemia and brain cancer data sets.

Click here to view.

Acknowledgments

We acknowledge useful discussions with members of the Cancer Genomics program (The Eli and Edythe L. Broad Institute, Massachusetts Institute of Technology and Harvard University), in particular Stefano Monti. This work was funded by grants from the National Institutes of Health. J.-Ph.B. is funded by an Informatics Fellowship grant from AstraZeneca.

Acknowledgments

Notes

Abbreviations: NMF, nonnegative matrix factorization; HC, hierarchical clustering; SOM, self-organizing maps; AML, acute myelogenous leukemia; ALL, acute lymphoblastic leukemia.

Notes
Abbreviations: NMF, nonnegative matrix factorization; HC, hierarchical clustering; SOM, self-organizing maps; AML, acute myelogenous leukemia; ALL, acute lymphoblastic leukemia.

References

  • 1. Eisen, M., Spellman, P., Brown, P. & Botstein, D. (1998) Proc. Natl. Acad. Sci. USA95, 14863–14868.
  • 2. Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000) Nature403, 503–511. [[PubMed]
  • 3. Perou, C. M., Sorlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., Pollack, J. R., Ross, D. T., Johnsen, H., Akslen, L. A., et al. (2000) Nature406, 747–752. [[PubMed]
  • 4. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Dmitrovsky, E., Lander, E. S. & Golub, T. R. (1999) Proc. Natl. Acad. Sci. USA96, 2907–2912.
  • 5. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999) Science286, 531–537. [[PubMed]
  • 6. Moloshok, T. D., Klevecz, R. R., Grant, J. D., Manion, F. J., Speier, W. F. 4th, & Ochs, M. F. (2002) Bioinformatics18, 566–575. [[PubMed]
  • 7. Gasch, A. P. & Eisen, M. B. (2002) Genome Biol.3, research0059.1–0059.22.
  • 8. Alter, O., Brown, P. O. & Botstein, D. (2000) Proc. Natl. Acad. Sci. USA97, 10101–10106.
  • 9. Lee, D. D. & Seung, H. S. (1999) Nature401, 788–793. [[PubMed]
  • 10. Lee, D. D. & Seung, H. S. (2001) Adv. Neural Info. Proc. Syst.13, 556–562. [PubMed]
  • 11. Monti, S., Tamayo, P., Golub, T. R. & Mesirov, J. P. (2003) Machine Learn. J.52, 91–118. [PubMed]
  • 12. Slonim, D. K., Tamayo, P., Mesirov, J. P., Golub, T. R. & Lander, E. S. (2000) Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, RECOMB 2000, 263–272.
  • 13. Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., et al. (2002) Nature415, 436–442. [[PubMed]
  • 14. Kim, P. M. & Tidor, B. (2003) Genome Res.13, 1706–1718.
  • 15. Heger, A. & Holm, L. (2003) Bioinformatics19, Suppl., i130–i137. [[PubMed]
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.