| Literature DB >> 19653905 |
Roberto T Arrial1, Roberto C Togawa, Marcelo de M Brigido.
Abstract
BACKGROUND: Transcriptome sequences provide a complement to structural genomic information and provide snapshots of an organism's transcriptional profile. Such sequences also represent an alternative method for characterizing neglected species that are not expected to undergo whole-genome sequencing. One difficulty for transcriptome sequencing of these organisms is the low quality of reads and incomplete coverage of transcripts, both of which compromise further bioinformatics analyses. Another complicating factor is the lack of known protein homologs, which frustrates searches against established protein databases. This lack of homologs may be caused by divergence from well-characterized and over-represented model organisms. Another explanation is that non-coding RNAs (ncRNAs) may be caught during sequencing. NcRNAs are RNA sequences that, unlike messenger RNAs, do not code for protein products and instead perform unique functions by folding into higher order structural conformations. There is ncRNA screening software available that is specific for transcriptome sequences, but their analyses are optimized for those transcriptomes that are well represented in protein databases, and also assume that input ESTs are full-length and high quality.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19653905 PMCID: PMC2731755 DOI: 10.1186/1471-2105-10-239
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Construction of the training database (dbTR). The dbTR comprises both negative and positive instances, and was subdivided as transcripts having identified ORFs (dbTR_OP) and transcripts lacking ORFs (dbTR_OA). Each of these subsets harbor their own negative and positive instances. dbTR_OP training subset was used to induce the protein-dependent SVM model, while dbTR_OA training subset generated the protein-independent SVM model.
Feature vector description. Cited references either support the coding/non-coding discrimination power of the feature or describe the corresponding program.
| Nucleotide composition* | 84 | Individual nucleotide frequency divided by total nucleotide frequency | [ |
| Transcript length** | 4 | Binary coding: length intervals < 100, 400, 900 and > 900. | [ |
| Amino acid composition§ | 20 | Individual amino acid frequency divided by total amino acid frequency | [ |
| ORF length§ | 4 | Binary coding: length intervals < 20, 60, 100 and > 100. | [ |
| Isoelectric point§ | 1 | Value divided by 14 | [ |
| Compositional entropy§ | 1 | Amount of low complexity residues divided by sequence length | [ |
| Mean hidropathy§ | 1 | Summed means from sliding 3nt window | [ |
*Includes nucleotide, dinucleotide and trinucleotide composition.
**Used only on protein-independent models.
§ Used only on protein-dependent models.
Speed performance (in minutes), standard efficiency measures and cross validation accuracy. Indices were calculated from the mean of predictions of the classifiers regarding dbTS_OP and dbTS_OA sets.
| ACC (%) | SPC (%) | SEN (%) | F-M (%) | PPV (%) | Time (min.)* | CV acc. (%) | |
| PORTRAIT | 91.9 | 95.6 | 86.5 | 90.8 | 92.9 | 21.6 | 92.1 |
| nB classifier | 73.2 | 75.1 | 70.1 | 72.6 | 65.3 | 16.1 | 72.9 |
| CPC | 90.8 | 90.9 | 90.7 | 90.8 | 87.0 | 1,789.7 | 95.8** |
| Random | 49.1 | 45.6 | 54.6 | 49.7 | 44.0 | 0.07 | - |
ACC = accuracy; SPC = specificity; SEN = sensitivity; F-M = F-measure; PPV = positive predictive value; CV acc. = cross-validation accuracy. Prediction threshold in all measures was 0.5 for all classifiers.
*All processes were run on a Intel® XEON™ 1.80 MHz × 86 with 512 Mb RAM computer. Runtime refers to prediction of 6,022 sequences from dbPB.
**As reported by Kong et al, 2007. Training set used was not dbTS.
Figure 2ROC curves showing performance of classifiers on dbTS sets. Sensitivity is plotted against (1-specificity), allowing accuracy comparisons among classifiers. A perfect classifier would yield a curve with a point at (0,1) and the final point in (1,1), that is, top-leftmost curves have better classification performance. Classification threshold was set to 0.5 for all classifiers.
Proportion of transcripts predicted as ncRNA by three classifiers.
| 15.8% | 60.5% | 26.4% | |
| 3.2% | 14.6% | 12.1% | |
| 33.1% | 100% | 49.8% |
*A transcript is considered ncRNA if its prediction score is below 0.5.
**A transcript is considered ncRNA if it is labeled as "noncoding" by the program.
Figure 3Distribution of . Annotations of the 6,022 transcripts [32] were considered only after classifier prediction, so even transcripts previously manually annotated as proteins were evaluated for coding potential. A "Confident annotation" refers to a transcript description which lacks the words: "putative", "probable" and "hypothetical". The numbers of transcripts classified as ncRNA are shown in the legend (except for dbPB, which shows the total of Pb transcripts).