| Literature DB >> 21903723 |
Jia-Feng Yu1, Ke Xiao, Dong-Ke Jiang, Jing Guo, Ji-Hua Wang, Xiao Sun.
Abstract
The falsely annotated protein-coding genes have been deemed one of the major causes accounting for the annotating errors in public databases. Although many filtering approaches have been designed for the over-annotated protein-coding genes, some are questionable due to the resultant increase in false negative. Furthermore, there is no webserver or software specifically devised for the problem of over-annotation. In this study, we propose an integrative algorithm for detecting the over-annotated protein-coding genes in microorganisms. Overall, an average accuracy of 99.94% is achieved over 61 microbial genomes. The extremely high accuracy indicates that the presented algorithm is efficient to differentiate the protein-coding genes from the non-coding open reading frames. Abundant analyses show that the predicting results are reliable and the integrative algorithm is robust and convenient. Our analysis also indicates that the over-annotated protein-coding genes can cause the false positive of horizontal gene transfers detection. The webserver of the proposed algorithm can be freely accessible from www.cbi.seu.edu.cn/RPGM.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21903723 PMCID: PMC3223076 DOI: 10.1093/dnares/dsr030
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Numerical descriptors for two short sequence (a) ATG CAT TTA, (b) CAT ATG TTA and (c) ATG TTA CAT
| Numerical descriptors | Encoding strategy I | Numerical descriptors | Encoding strategy II | Numerical descriptors | Encoding strategy III | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Seq. a | Seq. b | Seq. c | Seq. a | Seq. b | Seq. c | Seq. a | Seq. b | Seq. c | |||
| 7/3 | 7/3 | 7/3 | 8/3 | 8/3 | 8/3 | 2 | 2 | 2 | |||
| −1 | −1 | −1 | 1/3 | 1/3 | 1/3 | 7/3 | 7/3 | 7/3 | |||
| 8/3 | 8/3 | 8/3 | 2 | 2 | 2 | 3 | 3 | 3 | |||
| 14/3 | 3 | 19/3 | 13/3 | 5 | 14/3 | 7/3 | 11/3 | 8/3 | |||
| −1 | −3 | 0 | 1/3 | 7/3 | 4/3 | 14/3 | 3 | 19/3 | |||
| 28/3 | 8 | 20/3 | 14/3 | 28/3 | −2/3 | −2/3 | −1/3 | 17/3 | |||
Accuracies of mutual validations for the genomes with different G + C content based on the 75D vector
| Species | ||||||
|---|---|---|---|---|---|---|
| 100 | 88.79 | 50.25 | 9.01 | 0.97 | 0 | |
| 99.79 | 99.87 | 87.11 | 60.66 | 12.84 | 2.15 | |
| 99.38 | 99.53 | 99.85 | 99.47 | 99.32 | 96.97 | |
| 91.99 | 96.86 | 96.59 | 100 | 99.46 | 99.24 | |
| 98.36 | 99.13 | 99.75 | 100 | 99.78 | 100 | |
| 2.67 | 13.35 | 70.06 | 99.34 | 99.18 | 100 |
Figure 1.Comparing the Fisher coefficients (C) between P. aeruginosa and Buchnera.
Predicting results based on genomes with different sizes
| Species | |||||
|---|---|---|---|---|---|
| 98.36 | 99.13 | 99.75 | 100 | 100 | |
| 0.21 | 0.80 | 41.62 | 91.91 | 98.99 | |
| 71.46 | 81.58 | 93.68 | 99.47 | 99.12 | |
| 98.15 | 99.53 | 99.80 | 100 | 100 | |
| 7.60 | 71.09 | 95.64 | 99.54 | 99.87 |
Figure 2.Scatter plot of axis 1 against CAI.
Figure 3.Projecting the annotated ORFs into 2D coordinates by PCA.
Distribution of sequence length among the 925 recognized non-coding ORFs
| 300 bp < | Average (bp) | |||||
|---|---|---|---|---|---|---|
| Number | Percentage | Number | Percentage | Number | Percentage | |
| 367 | 39.68 | 335 | 36.22 | 223 | 24.11 | 411 |
Figure 4.Correlations among the 10 vectors composed of each genome.