Multiclass cancer diagnosis using tumor gene expression signatures
Abstract
The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a support vector machine algorithm. Overall classification accuracy was 78%, far exceeding the accuracy of random classification (9%). Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics.
Cancer classification relies on the subjective interpretation of both clinical and histopathological information with an eye toward placing tumors in currently accepted categories based on the tissue of origin of the tumor. However, clinical information can be incomplete or misleading. In addition, there is a wide spectrum in cancer morphology and many tumors are atypical or lack morphologic features that are useful for differential diagnosis (1). These difficulties can result in diagnostic confusion, prompting calls for mandatory second opinions in all surgical pathology cases (2). In the aggregate, these are significant limitations that may hinder patient care, add expense, and confound the results of clinical trials.
Molecular diagnostics offer the promise of precise, objective, and systematic human cancer classification, but these tests are not widely applied because characteristic molecular markers for most solid tumors have yet to be identified (3). Recently, DNA microarray-based tumor gene expression profiles have been used for cancer diagnosis. However, studies have been limited to few cancer types and have spanned multiple technology platforms complicating comparison among different datasets (4–10). The feasibility of cancer diagnosis across all of the common malignancies based on a single reference database has not been explored. In addition, comprehensive gene expression databases have yet to be developed, and there are no established analytical methods capable of solving complex, multiclass, gene expression-based classification problems.
To address these challenges, we created a gene expression database containing the expression profiles of 218 tumor samples representing 14 common human cancer classes. By using an innovative analytical method, we demonstrate that accurate multiclass cancer classification is indeed possible, suggesting the feasibility of molecular cancer diagnosis by means of comparison with a comprehensive and commonly accessible catalog of gene expression profiles.
Acknowledgments
We thank Scott Pomeroy, Margaret Shipp, Raphael Bueno, Kevin Loughlin, and Phil Febbo for contributing tumor samples to this study. We thank David Waltregny for initial review of pathology, Christine Huard and Michelle Gaasenbeek for expert technical assistance, and Leslie Gaffney for insightful editorial review. We are also indebted to members of the Cancer Genomics Group (Whitehead/Massachusetts Institute of Technology Center for Genome Research) and the Golub Laboratory (Dana–Farber Cancer Institute) for many valuable discussions. This work was supported in part by a Harvard/National Institutes of Health training grant in Molecular Hematology (S.R.) and by grants from Affymetrix, Millennium Pharmaceuticals (Cambridge, MA), and Bristol-Myers Squibb (E.S.L.).
Abbreviations
| SVM | support vector machine |
| OVA | one vs. all |
| S2N | signal to noise |
Note Added in Proof.
Recently, Su et al. (30) also reported using human tumor gene expression profiles to distinguish a number of carcinoma classes.
References
- 1. Ramaswamy S, Osteen R T, Shulman L N In: Clinical Oncology. Lenhard R E, Osteen R T, Gansler T, editors. Atlanta: Am. Cancer Soc.; 2001. pp. 711–719. [PubMed][Google Scholar]
- 2. Tomaszewski J E, LiVolsi V A. Cancer. 1999;86:2198–2200.[PubMed]
- 3. Connolly J L, Schnitt S J, Wang H H, Dvorak A M, Dvorak H F In: Cancer Medicine. Holland J F, Frei E, Bast R C, Kufe D W, Morton D L, Weichselbaum R R, editors. Baltimore: Williams & Wilkins; 1997. pp. 533–555. [PubMed][Google Scholar]
- 4. Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, et al Science. 1999;286:531–537.[PubMed][Google Scholar]
- 5. Alizadeh A A, Eisen M B, Davis R E, Ma C, Lossos I S, Rosenwald A, Boldrick J C, Sabet H, Tran T, Yu X, et al Nature (London) 2000;403:503–511.[PubMed][Google Scholar]
- 6. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, et al Nature (London) 2000;406:536–540.[PubMed][Google Scholar]
- 7. Perou C M, Sorlie T, Eisen M B, van de Rijn M, Jeffrey S S, Rees C A, Pollack J R, Ross D T, Johnsen H, Akslen L A, et al Nature (London) 2000;406:747–752.[PubMed][Google Scholar]
- 8. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi O P, Wilfond B, et al N Engl J Med. 2001;344:539–548.[PubMed][Google Scholar]
- 9. Khan J, Wei J S, Ringner M, Saal L H, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C R, Peterson C, et al Nat Med. 2001;7:673–679.[Google Scholar]
- 10. Dhanasekaran S M, Barrette T R, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta K J, Rubin M A, Chinnaiyan A M. Nature. 2001;412:822–826.[PubMed]
- 11. Eisen M B, Spellman P T, Brown P O, Botstein D. Proc Natl Acad Sci USA. 1998;95:14863–14868.
- 12. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E S, Golub T R. Proc Natl Acad Sci USA. 1999;96:2907–2912.
- 13. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002) Mach. Learn., in press.
- 14. Hair J F, Anderson R E, Tatham R L, Black W C Multivariate Data Analysis. Englewood Cliffs, NJ: Prentice–Hall; 1998. [PubMed][Google Scholar]
- 15. Slonim D K Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. Tokyo: Universal Acad. Press; 2000. pp. 263–272. [PubMed][Google Scholar]
- 16. Dasarathy V B NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Comp. Soc. Press; 1991. [PubMed][Google Scholar]
- 17. Brown M P, Grundy W N, Lin D, Christianini N, Sugnet C W, Furey T S, Ares M, Haussler D. Proc Natl Acad Sci USA. 2000;97:262–267.
- 18. Furey T, Christianini N, Duffy N, Bednarski D W, Schummer M, Haussler D. Bioinformatics. 2000;16:906–914.[PubMed]
- 19. Vapnik V N Statistical Learning Theory. New York: Wiley; 1998. [PubMed][Google Scholar]
- 20. Evgeniou T, Pontil M, Poggio T. Adv Comput Math. 2000;13:1–50.[PubMed]
- 21. Hainsworth J D, Greco F A. N Engl J Med. 1993;329:257–263.[PubMed]
- 22. Chapelle, O., Vapnik, V., Bousquet, O. & Mukherjee, S. (2002) Mach. Learn., in press.
- 23. Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D. & Brown, P. (2000) Genome Biol.1, RESEARCH003.
- 24. Taipale J, Beachy P A. Nature (London) 2001;411:349–354.[PubMed]
- 25. Lickert H, Domon C, Huls G, Wehrle C, Duluc I, Clevers H, Meyer B I, Freund J N, Kemler R. Development (Cambridge, UK) 2000;127:3805–3813.[PubMed]
- 26. Ziemer L T, Pennica D, Levine A J. Mol Cell Biol. 2001;21:562–574.
- 27. Bienz M, Clevers H. Cell. 2000;103:311–320.[PubMed]
- 28. Scherf U, Ross D T, Waltham M, Smith L H, Lee J K, Tanabe L, Kohn K W, Reinhold W C, Myers T G, Andrews D T, et al Nat Genet. 2000;24:236–244.[PubMed][Google Scholar]
- 29. Staunton J E, Slonim D K, Coller H A, Tamayo P, Angelo M J, Park J, Scherf U, Lee J K, Reinhold W O, Weinstein J N, et al Proc Natl Acad Sci USA. 2001;98:10787–10792.[Google Scholar]
- 30. Su A I, Welsh J B, Sapinoso L M, Kern S G, Dimitrov P, Lapp H, Schultz P G, Powell S M, Moskaluk C A, Frierson H F, Jr, Hampton G M. Cancer Res. 2001;61:7388–7393.[PubMed]




