| Literature DB >> 19329068 |
Xiao Li1, Qingan Ren, Yang Weng, Haoyang Cai, Yunmin Zhu, Yizheng Zhang.
Abstract
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly sequenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCG-Pred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCG-Pred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.Entities:
Mesh:
Year: 2008 PMID: 19329068 PMCID: PMC5054121 DOI: 10.1016/S1672-0229(09)60005-X
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Fig. 1Schematic illustration of SCGPred combining three ab initio gene predictions with results of two sequence alignments. All predicted exons have the same orientation of transcription (from left to right) and are encoded on the plus strand of a given genomic sequence. Text in each black rectangle denotes the evidence type and the corresponding probabilistic score.
Fig. 2The relationships between the probabilistic scores of exons predicted by the four ab initio gene finders (GENSCAN, AUGUSTUS, Fgenesh and GeneID) and the proportion of correctly predicted exons.
Predication accuracy of SCGPred and other gene finders on human chromosome 22*
| Method | Gene | Exon | Base | ||||
|---|---|---|---|---|---|---|---|
| Sn | Sp | Sn | Sp | (Sn+Sp)/2 | Sn | Sp | |
| GENSCAN (GS) | 0.09 | 0.05 | 0.71 | 0.40 | 0.56 | 0.48 | |
| GeneID (GI) | 0.16 | 0.09 | 0.68 | 0.55 | 0.62 | 0.83 | 0.63 |
| Fgenesh (FS) | 0.15 | 0.09 | 0.73 | 0.53 | 0.63 | 0.86 | 0.61 |
| AUGUSTUS (AG) | 0.19 | 0.09 | 0.67 | 0.53 | 0.60 | 0.83 | 0.60 |
| SCGPred: | |||||||
| GS | 0.08 | 0.08 | 0.65 | 0.65 | 0.65 | 0.75 | 0.70 |
| GS+GI | 0.15 | 0.14 | 0.69 | 0.68 | 0.69 | 0.78 | 0.71 |
| GS+GI+FS | 0.18 | 0.16 | 0.73 | 0.68 | 0.71 | 0.82 | 0.71 |
| GS+GI+FS+AG | 0.21 | 0.83 | |||||
| SGP2 | 0.14 | 0.60 | 0.67 | 0.85 | 0.69 | ||
The highest value at each level is indicated in bold. Sn, sensitivity; Sp, specificity.
Predication accuracy of SCGPred and other gene finders on ENCODE regions*
| Method | Gene | Exon | Base | ||||
|---|---|---|---|---|---|---|---|
| Sn | Sp | Sn | Sp | (Sn+Sp)/2 | Sn | Sp | |
| GENSCAN (GS) | 0.10 | 0.04 | 0.67 | 0.38 | 0.52 | 0.43 | |
| GeneID (GI) | 0.13 | 0.05 | 0.62 | 0.48 | 0.55 | 0.82 | 0.48 |
| Fgenesh (FS) | 0.15 | 0.05 | 0.43 | 0.57 | 0.87 | 0.44 | |
| AUGUSTUS (AG) | 0.15 | 0.07 | 0.58 | 0.53 | 0.56 | 0.78 | 0.59 |
| SCGPred: | |||||||
| GS | 0.08 | 0.06 | 0.61 | 0.64 | 0.63 | 0.71 | 0.61 |
| GS+GI | 0.14 | 0.10 | 0.64 | 0.66 | 0.65 | 0.77 | 0.60 |
| GS+GI+FS | 0.18 | 0.10 | 0.70 | 0.60 | 0.65 | 0.83 | 0.56 |
| GS+GI+FS+AG | 0.70 | 0.80 | 0.66 | ||||
| SGP2 | 0.14 | 0.10 | 0.70 | 0.61 | 0.66 | 0.85 | |
Predication accuracy of SCGPred and other gene finders on U. maydis genome*
| Method | Parameter model | Gene | Exon | Base | ||||
|---|---|---|---|---|---|---|---|---|
| Sn | Sp | Sn | Sp | (Sn+Sp)/2 | Sn | Sp | ||
| GENSCAN | human | 0.35 | 0.47 | 0.33 | 0.35 | 0.34 | 0.75 | 0.90 |
| GeneID | human | 0.27 | 0.26 | 0.25 | 0.23 | 0.24 | 0.75 | 0.90 |
| yeast | 0.42 | 0.28 | 0.32 | 0.27 | 0.30 | 0.81 | 0.87 | |
| Fgenesh | human | 0.33 | 0.44 | 0.31 | 0.38 | 0.35 | 0.77 | 0.93 |
| 0.40 | 0.48 | 0.39 | 0.43 | 0.41 | 0.77 | 0.94 | ||
| AUGUSTUS | human | 0.41 | 0.39 | 0.35 | 0.37 | 0.36 | 0.81 | 0.93 |
| 0.46 | 0.47 | 0.43 | 0.45 | 0.90 | ||||
| SCGPred | human | 0.42 | 0.51 | 0.38 | 0.47 | 0.43 | 0.73 | 0.95 |
| phylogenetic neighbors | 0.55 | 0.77 | ||||||
| GeneMark-ES | – | 0.52 | 0.56 | 0.45 | 0.53 | 0.49 | 0.82 | 0.94 |
| Agene | 0.16 | 0.22 | 0.36 | 0.34 | 0.35 | 0.83 | 0.95 | |
The highest value at each level is indicated in bold. Sn, sensitivity; Sp, specificity.
Fig. 3Prediction accuracy of SCGPred on human chromosome 22 versus penalty factor without (A) and with (B) validation evidence. Panel A only displays at exon level, in which for every penalty factor, the top is exon sensitivity, the bottom is exon specificity, and the middle dark point represents the average value of exon sensitivity and specificity. Panel B displays accuracy results at both base and exon levels (ESn, exon sensitivity; ESp, exon specificity; BSn, base sensitivity; BSp, base specificity).