Proc Natl Acad Sci U S A 102(7): 2454-2459

Fast and reliable prediction of noncoding RNAs

^{Department of Theoretical Chemistry and Structural Biology, University of Vienna, Währingerstrasse 17, A-1090 Wien, Austria; and}^{Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany}

^{To whom correspondence should be addressed. E-mail:}ed.gizpiel-inu.fnioib@alduts.

Communicated by Hans Frauenfelder, Los Alamos National Laboratory, Los Alamos, NM, December 14, 2004

Received 2004 Nov 2

Abstract

We report an efficient method for detecting functional RNAs. The approach, which combines comparative sequence analysis and structure prediction, already has yielded excellent results for a small number of aligned sequences and is suitable for large-scale genomic screens. It consists of two basic components: (i) a measure for RNA secondary structure conservation based on computing a consensus secondary structure, and (ii) a measure for thermodynamic stability, which, in the spirit of a z score, is normalized with respect to both sequence length and base composition but can be calculated without sampling from shuffled sequences. Functional RNA secondary structures can be identified in multiple sequence alignments with high sensitivity and high specificity. We demonstrate that this approach is not only much more accurate than previous methods but also significantly faster. The method is implemented in the program rnaz, which can be downloaded from www.tbi.univie.ac.at/~wash/RNAz. We screened all alignments of length n ≥ 50 in the Comparative Regulatory Genomics database, which compiles conserved noncoding elements in upstream regions of orthologous genes from human, mouse, rat, Fugu, and zebrafish. We recovered all of the known noncoding RNAs and cis-acting elements with high significance and found compelling evidence for many other conserved RNA secondary structures not described so far to our knowledge.

Keywords: comparative genomics, conserved RNA secondary structure

Abstract

Traditionally, the role of RNA in the cell was considered mostly in the context of protein gene expression, limiting RNA to its function as mRNA, tRNA, and rRNA. The discovery of a diverse array of transcripts that are not translated to proteins but rather function as RNAs has changed this view profoundly (1–3). Noncoding RNAs (ncRNAs) are involved in a large variety of processes, including gene regulation (4), maturation of mRNAs, rRNAs, and tRNAs, or X-chromosome inactivation in mammals (5). In fact, a large fraction of the mouse transcriptome consists of ncRNAs (6), and about half of the transcripts from human chromosomes 21 and 22 are noncoding (7, 8). Structured RNA motifs furthermore function as cis-acting regulatory elements within protein-coding genes. Also in this context, new intriguing mechanisms are being discovered (9).

Hence, a comprehensive understanding of cellular processes is impossible without considering RNAs as key players. Efficient identification of functional RNAs (ncRNAs as well as cis-acting elements) in genomic sequences is, therefore, one of the major goals of current bioinformatics. Notwithstanding its utmost biological relevance, de novo prediction is still a largely unsolved issue. Unlike protein-coding genes, functional RNAs lack in their primary sequence common statistical signals that could be exploited for reliable detection algorithms. Many functional RNAs, however, depend on a defined secondary structure. In particular, evolutionary conservation of secondary structures serves as compelling evidence for biologically relevant RNA function. Comparative studies therefore seem to be the most promising approach. To date, complete genomic sequences of related species have been sequenced for almost all genetic model organisms as, for example, bacteria (10, 11), yeasts (12), nematodes (13, 14), and even mammals (15–17). Several studies (18–21) have identified a large collection of evolutionary conserved noncoding elements in mammalian (or, more generally, vertebrate) genomes, and it must be expected that a significant fraction of them are functional RNAs.

Possible candidates, however, have been identified only sporadically so far (19, 21), simply because there are no reliable tools to scan multiple sequence alignments for functional RNAs. The most widely used program qrna (22), which has been successfully used to identify ncRNAs in bacteria (23) and yeast (24), is not suitable for screens of large genomes. qrna is limited to pairwise alignments, and its reliability is low, especially if the evolutionary distance of the two sequences lies outside of the optimal range. An alternative approach, ddbrna (25), suffers from similar problems and so far has not been used in a real-life application. msari (26), on the other hand, gains its drastically enhanced accuracy from the large amount of information contained in large multiple sequence alignments of 10–15 sequences with high sequence diversity. At present, however, data sets of this kind are not available at a genomewide scale, at least for multicellular organisms.

In this article we address the problem by using an alternative approach: we combine a measure for thermodynamic stability with a measure for structure conservation. Using a combination of both scores we are able to efficiently detect functional RNAs in multiple sequence alignments of only a few sequences. Our method is substantially more accurate than qrna or ddbrna and performs better on pairwise alignments than msari does on alignments with 15 sequences. On the large, diverse alignments used for testing msari in ref. 26, our rnaz program achieved 100% sensitivity at 100% specificity.

Results for alignments with two to four sequences and mean pairwise identities between 60% and 100% are shown. N is the number of alignments in the test set. For each native alignment, one randomized alignment was produced, and randomized alignments classified as ncRNA were counted as false positives. Sensitivity and specificity are shown in percentage for three cutoffs of the RNA class probability predicted by the SVM. Absolute numbers of true positives and false negatives are shown in parentheses.

IRE, iron response element. IRES, internal ribosome entry site.

Click here to view.

Acknowledgments

We thank Christoph Dieterich and Martin Vingron for permission to use release 2.0 of their CORG databases before publication and Paul Gardner and Andrea Tanzer for discussion. This work was supported in part by the Austrian Genome Research Program Bioinformatics Integration Network sponsored by Bundesministerium für Bildung, Wissenschaft und Kultur and Bundesministeriums für Wirtschaft und Arbeit, Austrian Fonds zur Förderung der Wissenschaftlichen Forschung Project P-15893, and the Bioinformatics Initiative of the Deutsche Forschungsgemeinschaft (Grant BIZ-6/1-2).

Acknowledgments

Notes

Abbreviations: ncRNA, noncoding RNA; snoRNA, small nucleolar RNA; SRP, signal-recognition particle; MFE, minimum free energy; SCI, structure conservation index; SVM, support vector machine; CNB, conserved noncoding block; CORG, Comparative Regulatory Genomics.

See Commentary on page 2269.

Notes

See Commentary on page 2269.

References

1. Eddy, S. R. (2001) Nat. Rev. Genet.2, 919-929. [[PubMed]
2. Storz, G(2002) Science296, 1260-1263. [[PubMed][Google Scholar]
3. Mattick, J. S. (2003) BioEssays25, 930-939. [[PubMed]
4. He, L. & Hannon, G. J. (2004) Nat. Rev. Genet5, 522-531. [[PubMed]
5. Avner, P. & Heard, E. (2001) Nat. Rev. Genet.2, 59-67. [[PubMed]
6. Suzuki, M. & Hayashizaki, Y. (2004) BioEssays26, 833-843. [[PubMed]
7. Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. (2004) Cell116, 499-509. [[PubMed]
8. Kampa, D., Cheng, J., Kapranov, P., Yamanaka, M., Brubaker, S., Cawley, S., Drenkow, J., Piccolboni, A., Bekiranov, S., Helt, G., et al. (2004) Genome Res.14, 331-342.
9. Nudler, E. & Mironov, A. S. (2004) Trends Biochem. Sci.29, 11-17. [[PubMed]
10. McClelland, M., Florea, L., Sanderson, K., Clifton, S. W., Parkhill, J., Churcher, C., Dougan, G., Wilson, R. K. & Miller, W. (2000) Nucleic Acids Res.28, 4974-4986.
11. Florea, L., McClelland, M., Riemer, C., Schwartz, S. & Miller, W. (2003) Nucleic Acids Res.31, 3527-3532.
12. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. (2003) Nature423, 241-254. [[PubMed]
13. Celegans Sequencing Consortium (1998) Science282, 2012-2018. [[PubMed][Google Scholar]
14. Stein, L. D., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M. R., Chen, N., Chinwalla, A., Clarke, L., Clee, C., Coghlan, A., et al. (2003) PLoS Biol.1, E45.
15. Genome Sequencing Consortium (2001) Nature409, 860-921. [[PubMed]
16. International Mouse Genome Sequencing Consortium (2002) Nature420, 520-562. [[PubMed]
17. Rat Genome Sequencing Project Consortium (2004) Nature428, 493-521. [[PubMed]
18. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S. & Haussler, D. (2004) Science304, 1321-1325. [[PubMed]
19. Bejerano, G., Haussler, D. & Blanchette, M. (2004) Bioinformatics20, Suppl. 1, I40-I48. [[PubMed]
20. Thomas, J. W., Touchman, J. W., Blakesley, R. W., Bouffard, G. G., Beckstrom-Sternberg, S. M., Margulies, E. H., Blanchette, M, Siepel, A. C., Thomas, P. J., McDowell, J. C., et al. (2003) Nature424, 788-793. [[PubMed]
21. Margulies, E. H., Blanchette, M., Haussler, D. & Green, E. D. (2003) Genome Res.13, 2507-2518.
22. Rivas, E. & Eddy, S. R. (2001) BMC Bioinformatics2, 8.
23. Rivas, E., Klein, R. J., Jones, T. A. & Eddy, S. R. (2001) Curr. Biol.11, 1369-1373. [[PubMed]
24. McCutcheon, J. P & Eddy, S. R. (2003) Nucleic Acids Res.31, 4119-4128.
25. di Bernardo, D., Down, T. & Hubbard, T. (2003) Bioinformatics19, 1606-1611. [[PubMed]
26. Coventry, A, Kleitman, D. J. & Berger, B. (2004) Proc. Natl. Acad. Sci. USA101, 12102-12107.
27. Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, S., Tacker, M. & Schuster, P. (1994) Monatsh. Chemie125, 167-188. [PubMed]
28. Hofacker, I. L., Fekete, M. & Stadler, P. F. (2002) J. Mol. Biol.319, 1059-1066. [[PubMed]
29. Blanchette, M., Kent, W. J., Riemer, C., Elnitski, L., Smit, A. F., Roskin, K. M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E. D., et al. (2004) Genome Res.14, 708-715.
30. Workman, C. & Krogh, A. (1999) Nucleic Acids Res.27, 4816-4822.
31. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S. R. (2003) Nucleic Acids Res.31, 439-441.
32. Rosenblad, M. A., Gorodkin, J., Knudsen, B., Zwieb, C. & Samuelsson, T. (2003) Nucleic Acids Res.31, 363-364.
33. Brown, J. W. (1999) Nucleic Acids Res.27, 314.
34. Washietl, S. & Hofacker, I. L. (2004) J. Mol. Biol.342, 19-30. [[PubMed]
35. Zuker, M. & Stiegler, P. (1981) Nucleic Acids Res.9, 133-148.
36. Walter, A. E., Turner, D. H., Kim, J., Lyttle, M. H., Muller, P., Mathews, D. H. & Zuker, M. (1994) Proc. Natl. Acad. Sci. USA91, 9218-9222.
37. Mathews, D. H., Sabina, J., Zuker, M. & Turner, D. H. (1999) J. Mol. Biol.288, 911-940. [[PubMed]
38. Rivas, E. & Eddy, S. R. (2000) Bioinformatics16, 583-605. [[PubMed]
39. Bonnet, E., Wuyts, J., Rouze, P. & Van De Peer, Y. (2004) Bioinformatics20, 2911-2917. [[PubMed]
40. Le, S. V., Chen, J. H., Currey, K. M. & Maizel, J. V., Jr. (1988) Comput. Appl. Biosci.4, 153-159. [[PubMed]
41. Cristianini, N. & Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines (Cambridge Univ. Press, Cambridge, U.K.).
42. Dieterich, C., Wang, H., Rateitschak, K., Luz, H. & Vingron, M. (2003) Nucleic Acids Res.31, 55-57.
43. Griffiths-Jones, S(2004) Nucleic Acids Res.32, D109-D111. [Google Scholar]
44. Pesole, G., Liuni, S., Grillo, G., Licciulli, F., Mignone, F., Gissi, C. & Saccone, C. (2002) Nucleic Acids Res.30, 335-340.
45. Le, S. Y. & Maizel, J. V., Jr. (1997) Nucleic Acids Res.25, 362-369.
46. Hentze, M. W. & Kuhn, L. C. (1996) Proc. Natl. Acad. Sci. USA93, 8175-8182.
47. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. & Haussler, D. (2002) Genome Res.12, 996-1006.