| Literature DB >> 26763976 |
Martin Bens1, Arne Sahm2, Marco Groth3, Niels Jahn4, Michaela Morhart5, Susanne Holtze6, Thomas B Hildebrandt7, Matthias Platzer8, Karol Szafranski9.
Abstract
BACKGROUND: Advances in second-generation sequencing of RNA made a near-complete characterization of transcriptomes affordable. However, the reconstruction of full-length mRNAs via de novo RNA-seq assembly is still difficult due to the complexity of eukaryote transcriptomes with highly similar paralogs and multiple alternative splice variants. Here, we present FRAMA, a genome-independent annotation tool for de novo mRNA assemblies that addresses several post-assembly tasks, such as reduction of contig redundancy, ortholog assignment, correction of misassembled transcripts, scaffolding of fragmented transcripts and coding sequence identification.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26763976 PMCID: PMC4712544 DOI: 10.1186/s12864-015-2349-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Stages of the FRAMA procedure. Black arrows show the flow of data, red arrows indicate which stages make use of input data, and light red arrows indicate optional use of input data
Fig. 2Schematic illustration of complex processing stages in FRAMA: a inference of CDS using orthologous transcripts from related species; b ortholog-based detection of fusion contigs; c scaffolding; d clipping of transcript 3’ termini by the use of weighted scores for indicative features. Horizontal bars indicate contigs and mRNAs, thicker regions indicate CDS. Colors code the origin of sequence data: Trinity contig (blue), orthologous transcript (green), final FRAMA transcript (red)
Fig. 3Completeness of CDS regions a classified according to ORF status, where “full length” refers to existing start and stop codons; b histogram of correspondence between (partly) recovered CDS and orthologous CDS
Fig. 4A genome-based transcript map showing misassembled Trinity contigs (purple track) and improvements made by FRAMA’s mRNA boundary clipping (red track). Human RefSeq counterparts to FRAMA transcripts are shown in green. Trinity provides a plethora of (putative) transcript isoforms (63 contigs) for the HYAL1-NAT6-HYAL3 locus, many of them being read-through variants that join neighboring genes (informative subset in purple track). Although FRAMA is not able to resolve the shared first exon of the NAT6-HYAL3 locus correctly, mRNA boundary clipping improved the raw assembly substantially by separating the gene loci. Genome-based methods (brown tracks) struggle in predicting the correct gene loci, too: TKIM shows the best performance, separating each gene locus correctly. GENSCAN correctly separates HYAL1, NAT6 and HYAL3 loci, but joins neighboring loci (HYAL1 with HYAL2 and HYAL3 with IFRD2). GNOMON correctly provides several different HYAL3 variants, but misses NAT6 completely. Throughout the figure, thick bars represent coding regions, thin bars untranslated regions and lines introns. Arrows on lines or bars indicate the direction of transcription. Accession numbers of external gene models are listed in Additional file 1: Table S11
Results of structural agreement of overlapping loci in the hetgla2 genome sequence
| Recovereda | Identicalb | Matchingb | Otherb | |
|---|---|---|---|---|
| TCUR (loci: 136) | ||||
| TFRAMA | 129; 94.9 % | 100; 77.5 % | 15; 11.6 % | 14; 10.9 % |
| TGNOMON | 135; 99.3 % | 114; 84.4 % | 8; 5.9 % | 13; 9.6 % |
| TKIM | 122; 89.7 % | 50; 41.0 % | 16; 13.1 % | 56; 45.9 % |
| TGENSCAN | 133; 97.8 % | 13; 9.8 % | 6; 4.5 % | 114; 85.7 % |
| TGNOMON (loci: 19,746) | ||||
| TFRAMA | 14,387; 72.9 % | 8463; 58.8 % | 2127; 14.8 % | 3797; 26.4 % |
| TKIM | 14,933; 75.6 % | 5382; 36.0 % | 2647; 17.7 % | 6904; 46.2 % |
| TGENSCAN | 16,082; 81.4 % | 1584; 9.8 % | 1044; 6.5 % | 13,454; 83.7 % |
Each orthologous set of transcripts was compared to TCUR and TGNOMON, after filtering of alignments with perfectly aligned CDS (>99 % recovered in genome). CDSs are considered overlapping if they share nucleotides on the same strand. CDS overlap cases were classified to the following categories: identical (identical exons), matching (shared exons), or ‘other’ (unequal number of exons)
aNumber of overlapping loci and their proportion of the loci in reference
bNumber of identical, matching and other transcript models and their proportion of the loci in overlap