| Literature DB >> 23484025 |
Jonathan L Klassen1, Cameron R Currie.
Abstract
The high-throughput annotation of open reading frames (ORFs) required by modern genome sequencing projects necessitates computational protocols that sometimes annotate orthologous ORFs inconsistently. Such inconsistencies hinder comparative analyses by non-uniformly extending or truncating 5' and/or 3' sequence ends, causing ORFs that are in fact identical to artificially diverge. Whereas strategies exist to correct such inconsistencies during whole-genome annotation, equivalent software designed to correct subsets of these data without genome reannotation is lacking. We therefore developed ORFcor, which corrects annotation inconsistencies using consensus start and stop positions derived from sets of closely related orthologs. ORFcor corrects inconsistent ORF annotations in diverse test datasets with specificities and sensitivities approaching 100% when sufficiently related orthologs (e.g., from the same taxonomic family) are available for comparison. The ORFcor package is implemented in Perl, multithreaded to handle large datasets, includes related scripts to facilitate high-throughput phylogenomic analyses, and is freely available at www.currielab.wisc.edu/downloads.html.Entities:
Mesh:
Year: 2013 PMID: 23484025 PMCID: PMC3590147 DOI: 10.1371/journal.pone.0058387
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Illustrative examples of the ORFcor approach.
A sequence alignment is given where one sequence is overextended (red box), one is chimeric (green box) and one is truncated (blue box). For each altered sequence, the consensus unaligned sequence positions for both the query and reference are indicated, compared with the relevant (non-default) parameters, and the resulting alterations to the sequences indicated.
Performance of ORFcor run on simulated inconsistency-containing data in comparison to known values using the parameters: a = 5; b = 10; d = 0.75 or 0.90; f = 10; g = 30; l = k = 1000.
|
|
| |||
| Test Dataset #1 | Test Dataset #2 | Test Dataset #1 | Test Dataset #2 | |
|
|
|
|
|
|
| Sp | 100.00% | 99.99% | 100.00% | 100.00% |
| Sn | 67.33% | 92.00% | 64.00% | 88.33% |
| Mean 100% accurate corrections | 83.66±1.85% | 55.43±1.19% | 86.98±1.56% | 62.26±1.53% |
| Mean deviation from perfect correction | 1.18±0.46 AA | 3.76±2.92 AA | 1.04±0.20 AA | 3.29±2.79 AA |
|
|
|
|
|
|
| Sp | 100.00% | 100.00% | 100.00% | 100.00% |
| Sn | 65.00% | 87.33% | 62.67% | 85.33% |
| Mean 100% accurate corrections | 82.05±1.56% | 53.82±1.52% | 85.11±1.78% | 63.28±1.53% |
| Mean deviation from perfect correction | 1.34±0.68 AA | 3.06±2.24 AA | 1.14±0.36 AA | 3.29±2.64 AA |
|
|
|
|
|
|
| Sp | 99.72% | 99.72% | 99.64% | 99.66% |
| Sn | 99.82% | 99.82% | 98.58% | 98.52% |
| Mean 100% accurate corrections | 98.65±1.91% | 98.75±2.13% | 99.52±3.23% | 99.49±3.23% |
| Mean deviation from perfect correction | 9.37±22.49 AA | 9.14±20.97 AA | 10.56±17.26 AA | 9.51±16.79 AA |
|
|
|
|
|
|
| Sp | 100.00% | 100.00% | 100.00% | 100.00% |
| Sn | 98.15% | 100.00% | 97.78% | 98.80% |
| Mean 100% accurate corrections | 95.59±2.09% | 94.80±1.75% | 96.99±2.04% | 96.61±1.86% |
| Mean deviation from perfect correction | 2.10±2.60 AA | 2.01±2.44 AA | 1.00±0.00 AA | 1.21±1.00 AA |
|
|
|
|
|
|
| Sp | 99.80% | 99.81% | 99.99% | 99.99% |
| Sn | 98.97% | 99.41% | 98.07% | 97.99% |
| Mean 100% accurate corrections | 93.14±5.10% | 90.65±4.58% | 95.91±5.05% | 94.38±4.95% |
| Mean deviation from perfect correction | 2.61±12.08 AA | 1.64±1.52 AA | 2.40±3.27 AA | 1.80±1.33 AA |
|
|
|
|
|
|
| Sp | 100.00% | 100.00% | 100.00% | 100.00% |
| Sn | 97.45% | 99.49% | 97.56% | 97.78% |
| Mean 100% accurate corrections | 88.80±2.53% | 86.77±2.64% | 94.51±2.42% | 92.51±2.57% |
| Mean deviation from perfect correction | 1.67±1.50 AA | 2.12±2.15 AA | 1.21±1.59 AA | 0.97±1.34 AA |
The lengths of the simulated errors are indicated for each error type; test dataset #1 represents the shortest possible errors detectable using the tested ORFcor parameters. Errors were introduced into test sequences at rates: 5′ overextensions and truncations - 5%; 3′ overextensions and truncations - 1%; 5′ and 3′ chimeras - 0.1%. See File S1 for the complete simulation results.
Mean 100% accurate correction values and their standard deviations were derived from all true-positive values, averaged over the 50 replicates run for each parameter set.
Mean deviation from perfect correction is derived from true-positive values that are not 100% accurate.
Figure 2Effect of varying d (the minimum identity required between two protein sequences to be used for correction) on ORFcor performance measured using the F-score.
Simulations using test datasets #1 and #2 are shown in panels A and B, respectively.
Performance of ORFcor on 123 complete genome sequences belonging to the Enterobacteriaceae using default settings, except as indicated.
| Minimum identity threshold | Minimum identity threshold | |
|
| ||
| True corrections (% total ORFs) | 1 (0.02) | 1 (0.02) |
| False corrections (% total ORFs) | 0 (0.00) | 0 (0.00) |
|
| ||
| True corrections (% total ORFs) | 1 (0.02) | 1 (0.02) |
| False corrections (% total ORFs) | 0 (0.00) | 0 (0.00) |
|
| ||
| True corrections (% total ORFs) | 81 (1.45) | 78 (1.40) |
| False corrections (% total ORFs) | 34 (0.61) | 0 (0.00) |
|
| ||
| True corrections (% total ORFs) | 4 (0.07) | 4 (0.07) |
| False corrections (% total ORFs) | 0 (0.00) | 0 (0.00) |
|
| ||
| True corrections (% total ORFs) | 51 (0.91) | 50 (0.90) |
| False corrections (% total ORFs) | 1 (0.02) | 0 (0.00) |
|
| ||
| True corrections (% total ORFs) | 8 (0.14) | 6 (0.11) |
| False corrections (% total ORFs) | 0 (0.00) | 0 (0.00) |
|
| ||
| True corrections (% total ORFs) | 146 (2.62) | 140 (2.51) |
| False corrections (% total ORFs) | 35 (0.63) | 0 (0.00) |
Figure 3Stability of ORFcor corrections with increasing taxonomic divergence of the dataset.
Values shown indicate ORFcor corrections that differ between test datasets subdivided at the lowest (genus in A; family in B) and higher taxonomic subdivisions. Genomes not classified at particular taxonomic levels were excluded from analysis.