| Literature DB >> 23874379 |
Dan Lin1, Xin Yin, Xianlong Wang, Peng Zhou, Feng-Biao Guo.
Abstract
The annotation of the well-studied organism, Saccharomyces cerevisiae, has been improving over the past decade while there are unresolved debates over the amount of biologically significant open reading frames (ORFs) in yeast genome. We revisited the total count of protein-coding genes in S. cerevisiae S288c genome using a theoretical approach by combining the Support Vector Machine (SVM) method with six widely used measurements of sequence statistical features. The accuracy of our method is over 99.5% in 10-fold cross-validation. Based on the annotation data in Saccharomyces Genome Database (SGD), we studied the coding capacity of all 1744 ORFs which lack experimental results and suggested that the overall number of chromosomal ORFs encoding proteins in yeast should be 6091 by removing 488 spurious ORFs. The importance of the present work lies in at least two aspects. First, cross-validation and retrospective examination showed the fidelity of our method in recognizing ORFs that likely encode proteins. Second, we have provided a web service that can be accessed at http://cobi.uestc.edu.cn/services/yeast/, which enables the prediction of protein-coding ORFs of the genus Saccharomyces with a high accuracy.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23874379 PMCID: PMC3707884 DOI: 10.1371/journal.pone.0064477
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Performance for all the 63 groups of measurements in cross-validation.*
| Measure ments | Sn | Stdev | Sp | Stdev | Accuracy |
| 1 | 97.52% | 0.39% | 96.68% | 0.74% | 97.10% |
| 2 | 99.23% | 0.34% | 99.21% | 0.46% | 99.22% |
| 3 | 99.71% | 0.28% | 99.77% | 0.16% | 99.74% |
| 4 | 99.67% | 0.24% | 99.69% | 0.24% | 99.68% |
| 5 | 96.61% | 0.69% | 95.66% | 1.24% | 96.13% |
| 6 | 98.04% | 0.64% | 96.48% | 1.05% | 97.26% |
| 1, 2 | 99.21% | 0.38% | 99.29% | 0.37% | 99.25% |
| 1, 3 | 99.67% | 0.30% | 99.80% | 0.19% | 99.74% |
| 1, 4 | 99.67% | 0.28% | 99.74% | 0.25% | 99.71% |
| 1, 5 | 99.36% | 0.45% | 99.55% | 0.31% | 99.45% |
| 1, 6 | 99.30% | 0.38% | 99.01% | 0.68% | 99.15% |
| 2, 3 | 99.69% | 0.25% | 99.74% | 0.16% | 99.72% |
| 2, 4 | 99.71% | 0.28% | 99.74% | 0.25% | 99.73% |
| 2, 5 | 99.65% | 0.38% | 99.66% | 0.26% | 99.65% |
| 2, 6 | 99.44% | 0.37% | 99.35% | 0.37% | 99.39% |
| 3, 4 | 99.71% | 0.31% | 99.74% | 0.26% | 99.73% |
| 3, 5 | 99.71% | 0.28% | 99.72% | 0.19% | 99.71% |
| 3, 6 | 99.52% | 0.31% | 99.40% | 0.44% | 99.46% |
| 4, 5 | 99.69% | 0.30% | 99.72% | 0.26% | 99.70% |
| 4, 6 | 99.67% | 0.29% | 99.55% | 0.37% | 99.61% |
| 5, 6 | 98.68% | 0.57% | 98.07% | 0.52% | 98.37% |
| 1, 2, 3 | 99.67% | 0.25% | 99.74% | 0.16% | 99.71% |
| 1, 2, 4 | 99.69% | 0.24% | 99.72% | 0.25% | 99.70% |
| 1, 2, 5 | 99.65% | 0.38% | 99.63% | 0.27% | 99.64% |
| 1, 2, 6 | 99.50% | 0.42% | 99.38% | 0.34% | 99.44% |
| 1, 3, 4 | 99.69% | 0.27% | 99.72% | 0.26% | 99.70% |
| 1, 3, 5 | 99.67% | 0.28% | 99.80% | 0.19% | 99.74% |
| 1, 3, 6 | 99.59% | 0.33% | 99.46% | 0.37% | 99.52% |
| 1, 4, 5 | 99.67% | 0.28% | 99.77% | 0.23% | 99.72% |
| 1, 4, 6 | 99.63% | 0.27% | 99.72% | 0.27% | 99.67% |
| 1, 5, 6 | 99.32% | 0.38% | 99.12% | 0.56% | 99.22% |
| 2, 3, 4 | 99.71% | 0.28% | 99.77% | 0.26% | 99.74% |
| 2, 3, 5 | 99.69% | 0.25% | 99.74% | 0.16% | 99.72% |
| 2, 3, 6 | 99.67% | 0.29% | 99.49% | 0.23% | 99.58% |
| 2, 4, 5 | 99.69% | 0.28% | 99.72% | 0.29% | 99.70% |
| 2, 4, 6 | 99.65% | 0.16% | 99.72% | 0.26% | 99.68% |
| 2, 5, 6 | 99.50% | 0.36% | 99.26% | 0.44% | 99.38% |
| 3, 4, 5 | 99.69% | 0.30% | 99.74% | 0.26% | 99.72% |
| 3, 4, 6 | 99.67% | 0.24% | 99.72% | 0.31% | 99.69% |
| 3, 5, 6 | 99.57% | 0.32% | 99.46% | 0.47% | 99.51% |
| 4, 5, 6 | 99.67% | 0.28% | 99.57% | 0.38% | 99.62% |
| 1, 2, 3, 4 | 99.71% | 0.28% | 99.77% | 0.25% | 99.74% |
| 1, 2, 3, 5 | 99.67% | 0.25% | 99.77% | 0.16% | 99.72% |
| 1, 2, 3, 6 | 99.67% | 0.29% | 99.49% | 0.24% | 99.58% |
| 1, 2, 4, 5 | 99.69% | 0.28% | 99.72% | 0.23% | 99.70% |
| 1, 2, 4, 6 | 99.65% | 0.26% | 99.77% | 0.24% | 99.71% |
| 1, 2, 5, 6 | 99.52% | 0.39% | 99.38% | 0.33% | 99.45% |
| 1, 3, 4, 5 | 99.71% | 0.28% | 99.74% | 0.25% | 99.73% |
| 1, 3, 4, 6 | 99.69% | 0.26% | 99.74% | 0.26% | 99.72% |
| 1, 3, 5, 6 | 99.57% | 0.33% | 99.49% | 0.36% | 99.53% |
| 1, 4, 5, 6 | 99.63% | 0.27% | 99.72% | 0.27% | 99.67% |
| 2, 3, 4, 5 | 99.71% | 0.28% | 99.74% | 0.26% | 99.73% |
| 2, 3, 4, 6 | 99.67% | 0.28% | 99.80% | 0.24% | 99.74% |
| 2, 3, 5, 6 | 99.67% | 0.26% | 99.52% | 0.24% | 99.59% |
| 2, 4, 5, 6 | 99.67% | 0.26% | 99.74% | 0.26% | 99.71% |
| 3, 4, 5, 6 | 99.69% | 0.24% | 99.74% | 0.31% | 99.72% |
| 1, 2, 3, 4, 5 | 99.71% | 0.24% | 99.74% | 0.26% | 99.73% |
| 1, 2, 3, 4, 6 | 99.69% | 0.28% | 99.77% | 0.20% | 99.73% |
| 1, 2, 3, 5, 6 | 99.65% | 0.29% | 99.77% | 0.24% | 99.71% |
| 1, 2, 4, 5, 6 | 99.65% | 0.26% | 99.77% | 0.24% | 99.71% |
| 1, 3, 4, 5, 6 | 99.67% | 0.24% | 99.77% | 0.26% | 99.72% |
| 2, 3, 4, 5, 6 | 99.69% | 0.22% | 99.80% | 0.24% | 99.75% |
| 1, 2, 3, 4, 5, 6 | 99.69% | 0.28% | 99.80% | 0.20% | 99.75% |
Six measurements are represented as following: 1/mono-nucleotide frequencies, 2/di-nucleotide frequencies, 3/mono-codon composition, 4/di-codon composition, 5/mono-amino acid usages, 6/di-amino acid usages.
The boldface letter indicates the group with highest accuracy among the 63 combinations.
List of misclassified genes in 10-fold cross-validation.*
| ORF ID | Gene Name | GC% | Length(bp) |
| YAL064W | – | 37.30887 | 327 |
| YDR504C | SPG3 | 26.30208 | 384 |
| YGL032C | AGA2 | 40.5303 | 264 |
| YJL028W | – | 47.61905 | 336 |
| YJR120W | – | 46.72365 | 351 |
| YNL269W | BSC4 | 40.40404 | 396 |
| YOR302W | – | 41.02564 | 78 |
| YBR058C-A | TSC3 | 34.97942 | 243 |
| YFL010W-A | AUA1 | 42.45614 | 285 |
| YGL168W | HUR1 | 32.43243 | 333 |
| YJL077C | ICS3 | 40.90909 | 396 |
| YKL037W | AIM26 | 49.85994 | 357 |
| YOR031W | CRS5 | 42.38095 | 210 |
| YPL096C-A | ERI1 | 44.44444 | 207 |
| YPL183W-A | RTC6 | 43.97163 | 282 |
All the 15 misclassified ORFs (with an average length of 296.6 nucleotides) are small ORFs, which are usually difficult to identify.
Results of retrospective examination into historical snapshots.*
| Compared Snapshots |
|
| Misclassified ORFs | Gene Name | Coverage% |
| 2004–2005 | 127 | 1784 | YDR504C | SPG3 | 99.21% |
| 2005–2006 | 94 | 1709 | YJR120W | 98.94% | |
| 2006–2007 | 216 | 1614 | YAL064W YJL077C | –ICS3 | 98.15% |
| YGL168W YJL028W | HUR1 – | ||||
| 2007–2008 | 39 | 1436 | – | 100% | |
| 2008–2009 | 103 | 1388 | YKL037W YPL183W-A | AIM26 RTC6 | 98.06% |
| 2009–2010 | 45 | 1317 | – | 100% |
n new_veri denotes the number of verified genes newly added in updated snapshot of SGD database; n predicted denotes the overall predicted protein-coding ORFs based on every historical version of SGD database. Take the first snapshot as example, we predict 1784 coding ORFs based on the data of 2004, which covered 99.21% of the 127 newly added verified genes in 2005.
Figure 1Histogram distribution of predicted scores of samples on the training set.
Genes with scores <0 are predicted as non-coding genes, and >0 coding ones. As denoted in the overlapping areas, FN (coding genes with predicted score inferior to zero) is 15 and FP (intergenic sequences with score over zero) is 7. The list of 15 misclassified genes is as shown in Table 2.
Figure 2G-T nucleotide distribution on 1st codon position of all four sets of ORFs.
(a) G-T distribution of 4835 positive and 3515 negative samples in training sets. (b) G-T distribution for 1256 predicted genes and 488 rejected spurious ORFs and all 1744 ORFs are those originally labeled as dubious or uncharacterized by the SGD annotation.