Proc Natl Acad Sci U S A 106(30): 12353-12358

PMC: PMC2708976

PMID: 19592507

Chimeric transcript discovery by paired-end transcriptome sequencing

Christopher Maher

Nallasivam Palanisamy

John Brenner

Xuhong Cao

Shanker Kalyana-Sundaram

Shujun Luo

Irina Khrebtukova

Terrence Barrette

Catherine Grasso+5 authors

Results

Chimera Discovery via Paired-End Transcriptome Sequencing.

Here, we employ transcriptome sequencing to restrict chimera nominations to “expressed sequences,” thus, enriching for potentially functional mutations. To evaluate massively parallel paired-end transcriptome sequencing to identify novel gene fusions, we generated cDNA libraries from the prostate cancer cell line VCaP, CML cell line K562, universal human reference total RNA (UHR; Stratagene), and human brain reference (HBR) total RNA (Ambion). Using the Illumina Genome Analyzer II, we generated 16.9 million VCaP, 20.7 million K562, 25.5 million UHR, and 23.6 million HBR transcriptome mate pairs (2 × 50 nt). The mate pairs were mapped against the transcriptome and categorized as (i) mapping to same gene, (ii) mapping to different genes (chimera candidates), (iii) nonmapping, (iv) mitochondrial, (v) quality control, or (vi) ribosomal (Table S1). Overall, the chimera candidates represent a minor fraction of the mate pairs, comprising ≈<1% of the reads for each sample.

We believe that a paired-end strategy offers multiple advantages over single read based approaches such as alleviating the reliance on sequencing the reads traversing the fusion junction, increased coverage provided by sequencing reads from the ends of a transcribed fragment, and the ability to resolve ambiguous mappings (Fig. S1). Therefore, to nominate chimeras, we leveraged each of these aspects in our bioinformatics analysis. We focused on both mate pairs encompassing and/or spanning the fusion junction by analyzing 2 main categories of sequence reads: chimera candidates and nonmapping (Fig. S2A). The resulting chimera candidates from the nonmapping category that span the fusion boundary were merged with the chimeras found to encompass the fusion boundary revealing 119, 144, 205, and 294 chimeras in VCaP, K562, HBR, and UHR, respectively.

Comparison of a Paired-End Strategy Against Existing Single Read Approaches.

To assess the merit of adopting a paired-end transcriptome approach, we compared the results against existing single read approaches. Although current RNA sequencing (RNA-Seq) studies have been using 36-nt single reads (16, 17), we increased the likelihood of spanning a fusion junction by generating 100-nt long single reads using the Illumina Genome Analyzer II. Also, we chose this length because it would facilitate a more comparable amount of sequencing time as required for sequencing both 50-nt mate pairs. In total, we generated 7.0, 59.4, and 53.0 million 100-nt transcriptome reads for VCaP, UHR, and HBR, respectively, for comparison against paired-end transcriptome reads from matched samples.

Because the UHR is a mixture of cancer cell lines, we expected to find numerous previously identified gene fusions. Therefore, we first assessed the depth of coverage of a paired-end approach against long single reads by directly comparing the normalized frequency of sequence reads supporting 4 previously identified gene fusions [TMPRSS2-ERG (5, 6), BCR-ABL1 (18), BCAS4-BCAS3 (19), and ARFGEF2-SULF2 (20)]. As shown in Fig. 1A, we observed a marked enrichment of paired-end reads compared with long single reads for each of these well characterized gene fusions.

An external file that holds a picture, illustration, etc.
Object name is zpq9990983860001.jpg

Fig. 1.

Dynamic range and sensitivity of the paired-end transcriptome analysis relative to single read approaches. (A) Comparison of paired-end (blue) and long single transcriptome reads (black) supporting known gene fusions TMPRSS2-ERG, BCR-ABL1, BCAS4-BCAS3, and ARFGEF2-SULF2. (B) Schematic representation of TMPRSS2-ERG in VCaP, comparing mate pairs with long single transcriptome reads. (Upper) Frequency of mate pairs, shown in log scale, are divided based on whether they encompass or span the fusion boundary; (Lower) 100-mer single transcriptome reads spanning TMPRSS2-ERG fusion boundary. First 36 nt are highlighted in red. (C) Venn diagram of chimera nominations from both a paired-end (orange) and long single read (blue) strategy for UHR and HBR.

We observed that TMPRSS2-ERG had a >10-fold enrichment between paired-end and single read approaches. The schematic representation in Fig. 1B indicates the distribution of reads confirming the TMPRSS2-ERG gene fusion from both paired-end and single read sequencing. As expected, the longer reads improve the number of reads spanning known gene fusions. For example, had we sequenced a single 36-mer (shown in red text), 11 of the 17 chimeras, shown in the bottom portion of the long single reads, would not have spanned the gene fusion boundary, but instead, would have terminated before the junction and, therefore, only aligned to TMPRSS2. However, despite the improved results only 17 chimeric reads were generated from 7.0 million long single read sequences. In contrast, paired-end sequencing resulted in 552 reads supporting the TMPRSS2-ERG gene fusion from ≈17 million sequences.

Because we are using sequence based evidence to nominate a chimera, we hypothesized that the approach providing the maximum nucleotide coverage is more likely to capture a fusion junction. We calculated an in silico insert size for each sample using mate pairs aligning to the same gene, and found the mean insert size of ≈200 nt. Then, we compared the total coverage from single reads (coverage is equivalent to the total number of pass filter reads against the read length) with the paired-end approach (coverage is equivalent to the sum of the insert size with the length of each read) (Fig. S2B). Overall, we observed an average coverage of 848.7 and 757.3 MB using single read technology, compared with 2,553.3 and 2,363 MB from paired-end in UHR and HBR, respectively. This increase in ≈3-fold coverage in the paired-end samples compared with the long read approach, per lane, could explain the increased dynamic range we observed using a paired-end strategy.

Next we wanted to identify chimeras common to both strategies. The long read approach nominated 1,375 and 1,228 chimeras, whereas with a paired-end strategy, we only nominated 225 and 144 chimeras in UHR and HBR, respectively. As shown in the Venn diagram (Fig. 1C), there were 32 and 31 candidates common to both technologies for UHR and HBR, respectively. Within the common UHR chimeric candidates, we observed previously identified gene fusions BCAS4-BCAS3, BCR-ABL1, ARFGEF2-SULF2, and RPS6KB1-TMEM49 (13). The remaining chimeras, nominated by both approaches, represent a high fidelity set. Therefore, to further assess whether a paired-end strategy has an increased dynamic range, we compared the ratio of normalized mate pair reads against single reads for the remaining chimeras common to both technologies. We observed that 93.5 and 93.9% of UHR and HBR candidates, respectively, had a higher ratio of normalized mate pair reads to single reads (Table S2), confirming the increased dynamic range offered by a paired-end strategy. We hypothesize that the greater number of nominated candidates specific to the long read approach represents an enrichment of false positives, as observed when using the 454 long read technology (15, 21).

Paired-End Approach Reveals Novel Gene Fusions.

We were interested in determining whether the paired-end libraries could detect novel gene fusions. Among the top chimeras nominated from VCaP, HBR, UHR, and K562, many were already known, including TMPRSS2-ERG, BCAS4-BCAS3, BCR-ABL1, USP10-ZDHHC7, and ARFGEF2-SULF2. Also ranking among these well known gene fusions in UHR was a fusion on chromosome 13 between GAS6 and RASA3 (Fig. S3A and Table S2). The fact that GAS6-RASA3 ranked higher than BCR-ABL1 suggests that it may be a driving fusion in one of the cancer cell lines in the RNA pool.

Another observation was that there were 2 candidates among the top 10 found in both UHR and K562. This observation was intriguing, because hematological malignancies are not considered to have multiple gene fusion events. In addition to BCR-ABL1, we were able to detect a previously undescribed interchromosomal gene fusion between exon 23 of NUP214 located at chromosome 9q34.13 with exon 2 of XKR3 located at chromosome 22q11.1. Both of these genes reside on chromosome 22 and 9 in close proximity to BCR and ABL1, respectively (Fig. S3B). We confirmed the presence of NUP214-XKR3 in K562 cells using qRT-PCR, but were unable to detect it across an additional 5 CML cell lines tested (SUP-B15, MEG-01, KU812, GDM-1, and Kasumi-4) (Fig. S3C). These results suggest that NUP214-XKR3 is a “private” fusion that originated from additional complex rearrangements after the translocation that generated BCR-ABL1 and a focal amplification of both gene regions.

Although we were able to detect BCR-ABL1 and NUP214-XKR3 in both UHR and K562, there was a marked reduction in the mate pairs supporting these fusions in UHR. Although a diluted signal is expected, because UHR is pooled samples, it provides evidence that pooling samples can serve as a useful approach for nominating top expressing chimeras, and potentially enrich for “driver” chimeras.

Previously Undescribed Prostate Gene Fusions.

Our previous work using integrative transcriptome sequencing to detect gene fusions in cancer revealed multiple gene fusions, demonstrating the complexity of the prostate transcriptomes of VCaP and LNCaP (15). Here, we exploit the comprehensiveness of a paired-end strategy on the same cell lines to reveal novel chimeras. In the circular plot shown in Fig. S4A, we displayed all experimentally validated paired-end chimeras in the larger red circle. We found that all of the previously discovered chimeras in VCaP and LNCaP comprised a subset of the paired-end candidates, as displayed in the inner black circle.

As expected, TMPRSS2-ERG was the top VCaP candidate. In addition to “rediscovering” the USP10-ZDHHC7, HJURP-INPP4A, and EIF4E2-HJURP gene fusions, a paired-end approach revealed several previously undescribed gene fusions in VCaP. One such example was an interchromosomal gene fusion between ZDHHC7, on chromosome 16, with ABCB9, residing on chromosome 12, that was validated by qRT-PCR (Fig. S3D). Interestingly, the 5′ partner, ZDHHC7, had previously been validated as a complex intrachromosomal gene fusion with USP10 (15). Both fusions have mate pairs aligning to the same exon of ZDHHC7 (15), suggesting that their breakpoints are in adjacent introns (Fig. S3D).

Another previously undescribed VCaP interchromosomal gene fusion that we discovered was between exon 2 of TIA1, residing on chromosome 2, with exon 3 of DIRC2, or disrupted in renal carcinoma 2, located on chromosome 3. TIA1-DIRC2 was validated by qRT-PCR and FISH (Fig. S5). In total, we confirmed an additional 4 VCaP and 2 LNCaP chimeras (Fig. S6). Overall, these fusions demonstrate that paired-end transcriptome sequencing can nominate candidates that have eluded previous techniques, including other massively parallel transcriptome sequencing approaches.

Distinguishing Causal Gene Fusions from Secondary Mutations.

We were next interested in determining whether the dynamic range provided by paired-end sequencing can distinguish known high-level “driving” gene fusions, such as known recurrent gene fusions BCR-ABL1 and TMPRSS2-ERG, from lower level “passenger” fusions. Therefore, we plotted the normalized mate pair coverage at the fusion boundary for all experimentally validated gene fusions for the 2 cell lines that we sequenced harboring recurrent gene fusions, VCaP and K562. As shown in Fig. S4B, we observed that both driver fusions, TMPRSS2-ERG and BCR-ABL1, show the highest expression among the validated chimeras in VCaP and K562, respectively. This observation suggests a paired-end nomination strategy for selecting putative driver gene fusions among private nonspecific gene fusions that lack detectable levels of expression across a panel of samples (15).

Previously Undescribed Breast Cancer Gene Fusions.

Our ability to detect previously undescribed prostate gene fusions in VCaP and LNCaP demonstrated the comprehensiveness of paired-end transcriptome sequencing compared with an integrated approach, using short and long transcriptome reads. Therefore, we extended our paired-end analysis by using breast cancer cell line MCF-7, which has been mined for fusions using numerous approaches such as expressed sequence tags (ESTs) (22), array CGH (23), single nucleotide polymorphism arrays (24), gene expression arrays (25), end sequence profiling (20, 26), and paired-end diTag (PET) (13).

A histogram (Fig. S4C) of the top ranking MCF-7 candidates highlights BCAS4-BCAS3 and ARFGEF-SULF2 as the top 2 ranking candidates, whereas other previously reported candidates, such as SULF2-PRICKLE, DEPDC1B-ELOVL7, RPS6KB1-TMEM49, and CXorf15-SYAP1, were interspersed among a comprehensive list of previously undescribed putative chimeras. To confirm that these previously undescribed nominations were not false positives, we experimentally validated 2 interchromosomal and 3 intrachromosomal candidates using qRT-PCR (Fig. S6). Overall, not only was a paired-end approach able to detect gene fusions that have eluded numerous existing technologies, it has revealed 5 previously undescribed mutations in breast cancer.

RNA-Based Chimeras.

Although many of the inter and intrachromosomal rearrangements that we nominated were found within a single sample, we observed many chimeric events shared across samples. We identified 11 chimeric events common to UHR, VCaP, K562, and HBR (Table S3). Via heatmap representation (Fig. 2A) of the normalized frequency of mate pairs supporting each chimeric event, we can observe these events are broadly transcribed in contrast to the top restricted chimeric events. Also, we found that 100% of the broadly expressed chimeras resided adjacent to one another on the genome, whereas only 7.7% of the restricted candidates were neighboring genes. This discrepancy can be explained by the enrichment of inter and intrachromosomal rearrangements in the restricted set.

An external file that holds a picture, illustration, etc.
Object name is zpq9990983860002.jpg

Fig. 2.

RNA based chimeras. (A) Heatmaps showing the normalized number of reads supporting each read-through chimera across samples ranging from 0 (white) to 30 (red). (Upper) The heatmap highlights broadly expressed chimeras in UHR, HBR, VCaP, and K562. (Lower) The heatmap highlights the expression of the top ranking restricted gene fusions that are enriched with interchromosomal and intrachromosomal rearrangements. (B) Illustrative examples classifying RNA-based chimeras into (i) read-throughs, (ii) converging transcripts, (iii) diverging transcripts, and (iv) overlapping transcripts. (C Upper) Paired-end approach links reads from independent genes as belonging to the same transcriptional unit (Right), whereas a single read approach would assign these reads to independent genes (Left). (Lower) The single read approach requires that a chimera span the fusion junction (Left), whereas a paired-end approach can link mate pairs independent of gene annotation (Right).

Unlike, previously characterized restricted read-throughs, such as SLC45A3-ELK4 (15), which are found adjacent to one another, but in the same orientation, we found that the majority of the broadly expressed chimera candidates resided adjacent to one another in different orientations. Therefore, we have categorized these events as (i) read-throughs, adjacent genes in the same orientation, (ii) diverging genes, adjacent genes in opposite orientation whose 5′ ends are in close proximity, (iii) convergent genes, adjacent genes in opposite orientation whose 3′ ends are in close proximity, and (iv) overlapping genes, adjacent genes who share common exons (Fig. 2B). Based on this classification, we found 1 read-through, 2 convergent genes, 6 divergent genes, and 2 overlapping genes. Also, we found that ≈81.8% of these chimeras had at least 1 supporting EST, providing independent confirmation of the event (Table S3). In contrast to paired-end, single read approaches would likely miss these instances as each mate would have aligned to their respective genes based on the current annotations (Fig. 2C). Also, these instances may represent extensions of a transcriptional unit, which would not be detectable by a single read approach that identifies chimeric reads that span exon boundaries of independent genes. Overall, we believe that many of these broadly expressed RNA chimeras represent instances where mate pairs are revealing previously undescribed annotation for a transcriptional unit.

Previously Undescribed ETS Gene Fusions in Clinically Localized Prostate Cancer.

Given the high prevalence of gene fusions involving ETS oncogenic transcription factor family members in prostate tumors, we applied paired-end transcriptome sequencing for gene fusion discovery in prostate tumors lacking previously reported ETS fusions. For 2 prostate tumors, aT52 and aT64, we generated 6.2 and 7.4 million transcriptome mate pairs, respectively. In aT64, we found that HERPUD1, residing on chromosome 16, juxtaposed in front of exon 4 of ERG (Fig. 3A), which was validated by qRT-PCR (Fig. S6) and FISH (Fig. 3B), thus identifying a third 5′ fusion partner for ERG, after TMPRSS2 (6) and SLC45A3 (27), and presumably, HERPUD1 also mediates the overexpression of ERG in a subset of prostate cancer patients. Also, just as TMPRSS2 and SLC45A3 have been shown to be androgen regulated by qRT-PCR (5), we found HERPUD1 expression, via RNA-Seq, to be responsive to androgen treatment (Fig. S7). Also, ChIP-Seq analysis revealed androgen binding at the 5′ end of HERPUD1 (Fig. S7).

An external file that holds a picture, illustration, etc.
Object name is zpq9990983860003.jpg

Fig. 3.

Discovery of previously undescribed ETS gene fusions in localized prostate cancer. (A) Schematic representation of the interchromosomal gene fusion between exon 1 of HERPUD1 (red), residing on chromosome 16, with exon 4 of ERG (blue), located on chromosome 21. (B) Schematic representation showing genomic organization of HERPUD1 and ERG genes. Horizontal red and green bars indicate the location of BAC clones. (Lower) FISH analysis using BAC clones showing HERPUD1 and ERG in a normal tissue (Left), deletion of the ERG 5′ region in tumor (Center), and HERPUD1-ERG fusion in a tumor sample (Right). (C) Schematic representation of the interchromosomal gene fusion between FLJ35294 (green), residing on chromosome 17, with exon 4 of ETV1 (orange) located on chromosome 21. (D Upper) Schematic representation of the genomic organization of FLJ35294 and ETV1 genes. (Lower) FISH analysis using BAC clones showing split of ETV1 in tumor sample (Left) and the colocalization of FLJ35294 and ETV1 in a tumor sample (Right).

Also, in the second prostate tumor sample (aT52), we discovered an interchromosomal gene fusion between the 5′ end of a prostate cDNA clone, {"type":"entrez-nucleotide","attrs":{"text":"AX747630","term_id":"32132018","term_text":"AX747630"}}AX747630 (FLJ35294), residing on chromosome 17, with exon 4 of ETV1, located on chromosome 7 (Fig. 3C), which was validated via qRT-PCR (Fig. S6) and FISH (Fig. 3D). Interestingly, this fusion has previously been reported in an independent sample found by a fluorescence in situ hybridization screen (27); thus, demonstrating that it is recurrent in a subset of prostate cancer patients. As previously reported, gene expression via RNA-Seq confirmed that {"type":"entrez-nucleotide","attrs":{"text":"AX747630","term_id":"32132018","term_text":"AX747630"}}AX747630 is an androgen-inducible gene (Fig. S7). Also, ChIP-Seq revealed androgen occupancy at the 5′ end of {"type":"entrez-nucleotide","attrs":{"text":"AX747630","term_id":"32132018","term_text":"AX747630"}}AX747630 (Fig. S7).

Chimera Discovery via Paired-End Transcriptome Sequencing.

Comparison of a Paired-End Strategy Against Existing Single Read Approaches.

Fig. 1.

Paired-End Approach Reveals Novel Gene Fusions.

Previously Undescribed Prostate Gene Fusions.

Distinguishing Causal Gene Fusions from Secondary Mutations.

Previously Undescribed Breast Cancer Gene Fusions.

RNA-Based Chimeras.

Fig. 2.

Previously Undescribed ETS Gene Fusions in Clinically Localized Prostate Cancer.

Fig. 3.

Discussion

This study demonstrates the effectiveness of paired-end massively parallel transcriptome sequencing for fusion gene discovery. By using a paired-end approach, we were able to rediscover known gene fusions, comprehensively discover previously undescribed gene fusions, and hone in on causal gene fusions. The ability to detect 12 previously undescribed gene fusions in 4 commonly used cell lines that eluded any previous efforts conveys the superior sensitivity of a paired-end RNA-Seq strategy compared with existing approaches. Also, it suggests that we may be able to unveil previously undescribed chimeric events in previously characterized samples believed to be devoid of any known driver gene fusions as exemplified by the discovery of previously undescribed ETS gene fusions in 2 clinically localized prostate tumor samples that lacked known driver gene fusions.

By analyzing the transcriptome at unprecedented depth, we have revealed numerous gene fusions, demonstrating the prevalence of a relatively under-represented class of mutations. However, one of the major goals remains to discover recurrent gene fusions and to distinguish them from secondary, nonspecific chimeras. Although quantifying expression levels is not proof of whether a gene fusion is a driver or passenger, because a low-level gene fusion could still be causative, it still of major significance that a paired-end strategy clearly distinguished known high-level driving gene fusions, such as BCR-ABL1 and TMPRSS2-ERG, from potential lower level passenger chimeras. Overall, these fusions serve as a model for employing a paired-end nomination strategy for prioritizing leads likely to be high-level driving gene fusions, which would subsequently undergo further functional and experimental evaluation.

One of the major advantages of using a transcriptome approach is that it enables us to identify rearrangements that are not detectable at the DNA level. For example, conventional cytogenetic methods would miss gene fusions produced by paracentric inversions, or sub microscopic events, such as GAS6-RASA3. Also, transcriptome sequencing can unveil RNA chimeras, lacking DNA aberrations, as demonstrated by the discovery of a recurrent, prostate specific, read-through of SLC45A3 with ELK4 in prostate cancers. Further classification of RNA based events using paired-end sequencing revealed numerous broadly expressed chimeras between adjacent genes. Although these events were not necessarily read-throughs events, because they typically had different orientations, we believe they represent extensions of transcriptional units beyond their annotated boundaries. Unlike single read based approaches, which require chimeras to span exon boundaries of independent genes, we were able to detect these events using paired-end sequencing, which could have significant impact for improving how we annotate transcriptional units.

Overall, we have demonstrated the advantages of employing a paired-end transcriptome strategy for chimera discovery, established a methodology for mining chimeras, and extensively catalogued chimeras in a prostate and hematological cancer models. We believe that the sensitivity of this approach will be of broad impact and significance for revealing novel causative gene fusions in various cancers while revealing additional private gene fusions that may contribute to tumorigenesis or cooperate with driver gene fusions.

Methods

Paired-End Gene Fusion Discovery Pipeline.

Mate pair transcriptome reads were mapped to the human genome (hg18) and Refseq transcripts, allowing up to 2 mismatches, using Efficient Alignment of Nucleotide Databases (ELAND) pair within the Illumina Genome Analyzer Pipeline software. Illumina export output files were parsed to categorize passing filter mate pairs as (i) mapping to the same transcript, (ii) ribosomal, (iii) mitochondrial, (iv) quality control, (v) chimera candidates, and (vi) nonmapping. Chimera candidates and nonmapping categories were used for gene fusion discovery. For the chimera candidates category, the following criteria were used: (i) mate pairs must be of high mapping quality (best unique match across genome), (ii) best unique mate pairs do not have a more logical alternative combination (i.e., best mate pairs suggest an interchromosomal rearrangement, whereas the second best mapping for a mate reveals the pair have a alignment within the expected insert size), (iii) the sum of the distances between the most 5′ and 3′ mate on both partners of the gene fusion must be <500 nt, and (iv) mate pairs supporting a chimera must be nonredundant.

In addition to mining mate pairs encompassing a fusion boundary, the nonmapping category was mined for mate pairs that had 1 read mapping to a gene, whereas its corresponding read fails to align, because it spans the fusion boundary. First, the annotated transcript that the “mapping” mate pair aligned against was extracted, because this transcript represents one of the potential partners involved in the gene fusion. The “nonmapping” mate pair was then aligned against all of the exon boundaries of the known gene partner to identify a perfect partial alignment. A partial alignment confirms that the nonmapping mate pair maps to our expected gene partner while revealing the portion of the nonmapping mate pair, or overhang, aligning to the unknown partner. The overhang is then aligned against the exon boundaries of all known transcripts to identify the fusion partner. This process is done using a Perl script that extracts all possible University of California Santa Cruz (UCSC) and Refseq exon boundaries looking for a single perfect best hit.

Mate pairs spanning the fusion boundary are merged with mate pairs encompassing the fusion boundary. At least 2 independent mate pairs are required to support a chimera nomination, which can be achieved by (i) 2 or more nonredundant mate pairs spanning the fusion boundary, (ii) 2 or more nonredundant mate pairs encompassing a fusion boundary, or (iii) 1 or more mate pairs encompassing a fusion boundary and 1 or more mate pairs spanning the fusion boundary. All chimera nominations were normalized based on the cumulative number of mate pairs encompassing or spanning the fusion junction per million mate pairs passing filter.

RNA Chimera Analysis.

Chimeras found from UHR, HBR, VCaP, and K562 were grouped based on whether they showed expression in all samples, “broadly expressed,” or a single sample, “restricted expression.” Because UHR is comprised of K562, chimeras found in only these 2 samples were also considered as restricted. Heatmap visualization was conducted by using TIGR's MultiExperiment Viewer (TMeV) version 4.0 (www.tm4.org).

Additional Details.

Additional details can be found in SI Text.

Paired-End Gene Fusion Discovery Pipeline.

RNA Chimera Analysis.

Additional Details.

Additional details can be found in SI Text.

Supplementary Material

Supporting Information:

Click here to view.

^{Michigan Center for Translational Pathology, Ann Arbor, MI 48109;}

Departments of ^{Pathology and}

^{Urology, University of Michigan, Ann Arbor, MI 48109;}

^{Howard Hughes Medical Institute and}

^{Comprehensive Cancer Center, University of Michigan Medical School, Ann Arbor, MI 48109; and}

^{Illumina Inc., 25861 Industrial Boulevard, Hayward, CA 94545}

^{To whom correspondence should be addressed. E-mail:}ude.hcimu@lura

Communicated by David Ginsburg, University of Michigan Medical School, Ann Arbor, MI, May 4, 2009.

Author contributions: C.A.M. and A.M.C. designed research; C.A.M., N.P., J.C.B., X.C., S.L., I.K., T.R.B., R.J.L., G.S., C.K.-S., and A.M.C. performed research; C.A.M., S.L., I.K., R.J.L., and G.S. contributed new reagents/analytic tools; C.A.M., N.P., J.C.B., S.K.-S., C.G., J.Y., R.J.L., G.S., C.K.-S., and A.M.C. analyzed data; and C.A.M., N.P., X.C., C.K.-S., and A.M.C. wrote the paper.

Received 2009 Mar 16

Freely available online through the PNAS open access option.

Abstract

Recurrent gene fusions are a prevalent class of mutations arising from the juxtaposition of 2 distinct regions, which can generate novel functional transcripts that could serve as valuable therapeutic targets in cancer. Therefore, we aim to establish a sensitive, high-throughput methodology to comprehensively catalog functional gene fusions in cancer by evaluating a paired-end transcriptome sequencing strategy. Not only did a paired-end approach provide a greater dynamic range in comparison with single read based approaches, but it clearly distinguished the high-level “driving” gene fusions, such as BCR-ABL1 and TMPRSS2-ERG, from potential lower level “passenger” gene fusions. Also, the comprehensiveness of a paired-end approach enabled the discovery of 12 previously undescribed gene fusions in 4 commonly used cell lines that eluded previous approaches. Using the paired-end transcriptome sequencing approach, we observed read-through mRNA chimeras, tissue-type restricted chimeras, converging transcripts, diverging transcripts, and overlapping mRNA transcripts. Last, we successfully used paired-end transcriptome sequencing to detect previously undescribed ETS gene fusions in prostate tumors. Together, this study establishes a highly specific and sensitive approach for accurately and comprehensively cataloguing chimeras within a sample using paired-end transcriptome sequencing.

Keywords: bioinformatics, gene fusions, prostate cancer, breast cancer, RNA-Seq

Abstract

One of the most common classes of genetic alterations is gene fusions, resulting from chromosomal rearrangements (1). Intriguingly, >80% of all known gene fusions are attributed to leukemias, lymphomas, and bone and soft tissue sarcomas that account for only 10% of all human cancers. In contrast, common epithelial cancers, which account for 80% of cancer-related deaths, can only be attributed to 10% of known recurrent gene fusions (2 –4). However, the recent discovery of a recurrent gene fusion, TMPRSS2-ERG, in a majority of prostate cancers (5, 6), and EML4-ALK in non-small-cell lung cancer (NSCLC) (7), has expanded the realm of gene fusions as an oncogenic mechanism in common solid cancers. Also, the restricted expression of gene fusions to cancer cells makes them desirable therapeutic targets. One successful example is imatinib mesylate, or Gleevec, that targets BCR-ABL1 in chronic myeloid leukemia (CML) (8 –10). Therefore, the identification of novel gene fusions in a broad range of cancers is of enormous therapeutic significance.

The lack of known gene fusions in epithelial cancers has been attributed to their clonal heterogeneity and to the technical limitations of cytogenetic analysis, spectral karyotyping, FISH, and microarray-based comparative genomic hybridization (aCGH). Not surprisingly, TMPRSS2-ERG was discovered by circumventing these limitations through bioinformatics analysis of gene expression data to nominate genes with marked overexpression, or outliers, a signature of a fusion event (6). Building on this success, more recent strategies have adopted unbiased high-throughput approaches, with increased resolution, for genome-wide detection of chromosomal rearrangements in cancer involving BAC end sequencing (11), fosmid paired-end sequences (12), serial analysis of gene expression (SAGE)-like sequencing (13), and next-generation DNA sequencing (14). Despite unveiling many novel genomic rearrangements, solid tumors accumulate multiple nonspecific aberrations throughout tumor progression; thus, making causal and driver aberrations indistinguishable from secondary and insignificant mutations, respectively.

The deep unbiased view of a cancer cell enabled by massively parallel transcriptome sequencing has greatly facilitated gene fusion discovery. As shown in our previous work, integrating long and short read transcriptome sequencing technologies was an effective approach for enriching “expressed” fusion transcripts (15). However, despite the success of this methodology, it required substantial overhead to leverage 2 sequencing platforms. Therefore, in this study, we adopted a single platform paired-end strategy to comprehensively elucidate novel chimeric events in cancer transcriptomes. Not only was using this single platform more economical, but it allowed us to more comprehensively map chimeric mRNA, hone in on driver gene fusion products due to its quantitative nature, and observe rare classes of transcripts that were overlapping, diverging, or converging.

Click here to view.

Acknowledgments.

We thank Lu Zhang, Eric Vermaas, Victor Quijano, and Juying Yan for assistance with sequencing, Shawn Baker and Steffen Durinck for helpful discussions, Rohit Mehra and Javed Siddiqui for collecting tissue samples, and Bo Han and Kalpana Ramnarayanan for technical assistance. C.A.M. was supported by a National Institutes of Health (NIH) Ruth L. Kirschstein postdoctoral training grant, and currently derives support from the American Association of Cancer Research Amgen Fellowship in Clinical/Translational Research and the Canary Foundation and American Cancer Society Early Detection Postdoctoral Fellowship. J.Y. was supported by NIH Grant 1K99CA129565-01A1 and Department of Defense (DOD) Grant PC080665. A.M.C. was supported in part by the NIH (Prostate SPORE P50CA69568, R01 R01CA132874), the DOD ({"type":"entrez-nucleotide","attrs":{"text":"BC075023","term_id":"50960734","term_text":"BC075023"}}BC075023, W81XWH-08-0110), the Early Detection Research Network (U01 {"type":"entrez-nucleotide","attrs":{"text":"CA111275","term_id":"34964582","term_text":"CA111275"}}CA111275), a Burroughs Welcome Foundation Award in Clinical Translational Research, a Doris Duke Charitable Foundation Distinguished Clinical Investigator Award, and the Howard Hughes Medical Institute. This work was also supported by National Center for Integrative Biomedical Informatics Grant U54 {"type":"entrez-nucleotide","attrs":{"text":"DA021519","term_id":"78407467","term_text":"DA021519"}}DA021519.

Acknowledgments.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0904720106/DCSupplemental.

Footnotes

References

1. Futreal PA, et al A census of human cancer genes. Nat Rev. 2004;4:177–183.[Google Scholar]
2. Kumar-Sinha C, Tomlins SA, Chinnaiyan AMRecurrent gene fusions in prostate cancer. Nat Rev. 2008;8:497–511.[Google Scholar]
3. Mitelman F, Johansson B, Mertens FFusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet. 2004;36:331–334.[PubMed][Google Scholar]
4. Mitelman F, Mertens F, Johansson BPrevalence estimates of recurrent balanced cytogenetic aberrations and gene fusions in unselected patients with neoplastic disorders. Gene Chromosome Canc. 2005;43:350–366.[PubMed][Google Scholar]
5. Tomlins SA, et al Distinct classes of chromosomal rearrangements create oncogenic ETS gene fusions in prostate cancer. Nature. 2007;448:595–599.[PubMed][Google Scholar]
6. Tomlins SA, et al Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005;310:644–648.[PubMed][Google Scholar]
7. Soda M, et al Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature. 2007;448:561–566.[PubMed][Google Scholar]
8. Druker BJ, et al Five-year follow-up of patients receiving imatinib for chronic myeloid leukemia. New Engl J Med. 2006;355:2408–2417.[PubMed][Google Scholar]
9. Druker BJ, et al Effects of a selective inhibitor of the Abl tyrosine kinase on the growth of Bcr-Abl positive cells. Nat Med. 1996;2:561–566.[PubMed][Google Scholar]
10. Kantarjian H, et al Hematologic and cytogenetic responses to imatinib mesylate in chronic myelogenous leukemia. New Engl J Med. 2002;346:645–652.[PubMed][Google Scholar]
11. Volik S, et al End-sequence profiling: Sequence-based analysis of aberrant genomes. Proc Natl Acad Sci USA. 2003;100:7696–7701.[Google Scholar]
12. Tuzun E, et al Fine-scale structural variation of the human genome. Nat Genet. 2005;37:727–732.[PubMed][Google Scholar]
13. Ruan Y, et al Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs) Genome Res. 2007;17:828–838.[Google Scholar]
14. Campbell PJ, et al Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008;40:722–729.[Google Scholar]
15. Maher CA, et al Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009;458:97–101.[Google Scholar]
16. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad YRNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517.[Google Scholar]
17. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold BMapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628.[PubMed][Google Scholar]
18. Shtivelman E, Lifshitz B, Gale RP, Canaani EFused transcript of abl and bcr genes in chronic myelogenous leukaemia. Nature. 1985;315:550–554.[PubMed][Google Scholar]
19. Barlund M, et al Cloning of BCAS3 (17q23) and BCAS4 (20q13) genes that undergo amplification, overexpression, and fusion in breast cancer. Gene Chromosome Canc. 2002;35:311–317.[PubMed][Google Scholar]
20. Hampton OA, et al A sequence-level map of chromosomal breakpoints in the MCF-7 breast cancer cell line yields insights into the evolution of a cancer genome. Genome Res. 2009;19:167–177.[Google Scholar]
21. Zhao Q, et al Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line. Proc Natl Acad Sci USA. 2009;106:1886–1891.[Google Scholar]
22. Hahn Y, et al Finding fusion genes resulting from chromosome rearrangement by analyzing the expressed sequence databases. Proc Natl Acad Sci USA. 2004;101:13257–13261.[Google Scholar]
23. Shadeo A, Lam WLComprehensive copy number profiles of breast cancer cell model genomes. Breast Cancer Res. 2006;8:R9.[Google Scholar]
24. Huang J, et al Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum Genom. 2004;1:287–299.[Google Scholar]
25. Neve RM, et al A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell. 2006;10:515–527.[Google Scholar]
26. Volik S, et al Decoding the fine-scale structure of a breast cancer genome and transcriptome. Genome Res. 2006;16:394–404.[Google Scholar]
27. Han B, et al A fluorescence in situ hybridization screen for E26 transformation-specific aberrations: Identification of DDX5-ETV4 fusion protein in prostate cancer. Cancer Res. 2008;68:7629–7637.[Google Scholar]