| Literature DB >> 25409897 |
Manuel Landesfeind, Peter Meinicke1.
Abstract
BACKGROUND: The annotation of biomolecular functions is an essential step in the analysis of newly sequenced organisms. Usually, the functions are inferred from predicted genes on the genome using homology search techniques. A high quality genomic sequence is an important prerequisite which, however, is difficult to achieve for certain organisms, such as hybrids or organisms with a large genome. For functional analysis it is also possible to use a de novo transcriptome assembly but the computational requirements can be demanding. Up to now, it is unclear how much of the functional repertoire of an organism can be reliably predicted from unassembled RNA-seq short reads alone.Entities:
Mesh:
Year: 2014 PMID: 25409897 PMCID: PMC4258056 DOI: 10.1186/1471-2164-15-1003
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Prediction performance for different tools before filtering
| Tool | Sensitivity | AUC (in %) | |||
|---|---|---|---|---|---|
| (in %) |
|
|
|
| |
|
|
|
| |||
| BLASTX | 97.21 | 92.02 | 93.09 |
| 94.95 |
| RAPSearch | 97.04 | 96.13 | 96.26 | 95.38 |
|
| PAUDA | 96.36 | 96.89 | 96.94 | 94.66 |
|
| UProC | 97.10 | 92.20 | 95.62 | 95.68 |
|
|
| |||||
|
|
|
|
| ||
|
|
|
| |||
| BLASTX | 74.19 | 76.77 |
| 82.80 | |
| RAPSearch | 87.03 | 87.72 | 86.13 |
| |
| PAUDA | 88.30 | 88.63 | 83.09 |
| |
| UProC | 75.01 | 87.67 | 88.60 |
| |
The area under the curve (AUC) was calculated on sorted functions. The maximum F1–Score corresponds to the best possible separation between false and true predictions. Quality scores are averaged over all samples. The maximum AUC and F1–Score per tool are marked in bold text.
Figure 1Score distribution and fitted Gamma mixture model. Histogram of scores from sample SRR360152 with threshold estimator using scaled mean–score and Gamma Mixture Model. The evidence value histograms of the falsely predicted and the annotated functions are colored in red and green, respectively. The curves correspond to the probability distributions of the two component mixture model. Although the probability density curves are shown colored in the plot, the fitting of the model was performed in an unsupervised manner. Histograms were generated from sample SRR360152 based on the results from BLASTX (a), RAPSearch (b), PAUDA (c), and UProC (d).
Average performance after filtering
|
| TPR | FPR | PPV | F1 |
|---|---|---|---|---|
| BLASTX | 92.85 | 8.57 | 72.64 | 81.50 |
| RAPSearch | 96.66 | 17.19 | 57.98 | 72.47 |
| PAUDA | 94.68 | 9.38 | 71.41 | 81.34 |
| UProC | 94.97 | 9.34 | 71.34 | 81.47 |
|
|
|
|
|
|
| BLASTX | 54.94 | 1.00 | 93.60 | 68.54 |
| RAPSearch | 93.75 | 4.26 | 84.35 | 88.80 |
| PAUDA | 87.31 | 2.33 | 90.20 | 88.72 |
| UProC | 94.02 | 4.76 | 82.88 | 88.10 |
Performance averaged over all samples after filtering the mean–score (MS) using Gaussian mixture model and the scaled meanŰscore (SMS) by Gamma mixture model True positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), and F1–Score (F1) are utilized as performance measures. All values are given in percent.
Figure 2Prediction performance after filtering and consensus on evidence values. The arrows indicate the increasing consensus threshold ranging from one to five.
Figure 3Prediction performance after filtering and consensus on evidence values. The arrows indicate the increasing consensus threshold ranging from one to five.
Performance of the functional prediction from transcriptomic assembly using different E–value thresholds
| E-value cutoff | FPR | TPR | PPV | F1 |
|---|---|---|---|---|
| 10 | 89.63 | 95.69 | 20.16 | 33.30 |
| 1e-1 | 50.86 | 95.69 | 30.79 | 46.59 |
| 1e-5 | 35.67 | 95.59 | 38.79 | 55.19 |
| 1e-10 | 26.62 | 95.49 | 45.89 | 61.99 |
| 1e-25 | 15.24 | 94.31 | 59.41 | 72.90 |
| 1e-50 | 08.33 | 89.44 | 71.74 | 79.62 |
| 1e-75 | 04.59 | 80.43 | 80.56 | 80.49 |
| 1e-100 | 02.95 | 72.52 | 85.31 | 78.40 |
True positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), and F1–Score (F1) are utilized as performance measures. All values are given in percent.