| Literature DB >> 16393144 |
Abstract
Translation initiation sites (TISs) are important signals in cDNA sequences. In many previous attempts to predict TISs in cDNA sequences, three major factors affect the prediction performance: the nature of the cDNA sequence sets, the relevant features selected, and the classification methods used. In this paper, we examine different approaches to select and integrate relevant features for TIS prediction. The top selected significant features include the features from the position weight matrix and the propensity matrix, the number of nucleotide C in the sequence downstream ATG, the number of downstream stop codons, the number of upstream ATGs, and the number of some amino acids, such as amino acids A and D. With the numerical data generated from these features, different classification methods, including decision tree, naïve Bayes, and support vector machine, were applied to three independent sequence sets. The identified significant features were found to be biologically meaningful, while the experiments showed promising results.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16393144 PMCID: PMC5172590 DOI: 10.1016/s1672-0229(05)03012-3
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Top Selected Features from Three Different Feature Selection Methods
| Rank | Relief method | Chi2-based method | Information gain method |
|---|---|---|---|
| 1 | 2-gram PWM | 1-gram PWM | 2-gram PWM |
| 2 | # G upstream | # C in the region of [−36, −7] | # C in the region of [−36, −7] |
| 3 | 3-gram PWM | # G at upstream codon position 1 | 3-gram PWM |
| 4 | 1-gram propensity matrix | 2-gram PWM | # G downstream |
| 5 | # C upstream | 3-gram PWM | # G at upstream codon position 1 |
| 6 | # T upstream | # G downstream | # C downstream |
| 7 | # amino acids AA downstream | # C at upstream codon position 3 | 1-gram propensity matrix |
| 8 | # ATG downstream | # downstream stop codon | # T downstream |
| 9 | # A upstream | C at position 139 | # stop codon downstream |
| 10 | # G at downstream codon position 1 | # C downstream | # amino acid A downstream |
| 11 | # C downstream | # downstream in-frame stop codon | # downstream in-frame stop codon |
| 12 | G at position 127 | # 2-gram amino acids AG downstream | # A downstream |
| 13 | C at position 3 of potential downstream codons | T at position 149 | # C at upstream codon position 3 |
| 14 | # 2-gram amino acids GC upstream | C at position 148 | 2-gram propensity matrix |
| 15 | # amino acid A upstream | # amino acid D downstream | # ATG upstream |
Note: # represents the number of the items followed. For example, “# G upstream” means the number of nucleotide G in the upstream sequence relative to the corresponding ATG. “# 2-gram amino acids GC upstream” means the number of 2-gram amino acids GC that are possibly encoded by the upstream sequence relative to the corresponding ATG.
Fig. 1The Matthews correlation coefficients for different numbers of features with information-gain-ranked features and the decision tree method.
Classification Results with 20 Top-ranked Features for Three Different Feature Selection Methods and Three Different Classification Methods
| Feature selection method | Classification method | Training | Testing | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Se | Sp | Acc | MCC | Se | Sp | Acc | MCC | ||
| Relief | Decision tree | 83% | 92% | 90% | 0.74 | 81% | 91% | 89% | 0.71 |
| Naïve Bayes | 98% | 75% | 81% | 0.63 | 97% | 75% | 81% | 0.63 | |
| SVM | 75% | 93% | 89% | 0.69 | 75% | 93% | 89% | 0.69 | |
| Chi2 | Decision tree | 70% | 92% | 86% | 0.63 | 66% | 90% | 84% | 0.56 |
| Naïve Bayes | 86% | 80% | 81% | 0.59 | 86% | 80% | 81% | 0.59 | |
| SVM | 57% | 90% | 82% | 0.50 | 57% | 90% | 82% | 0.50 | |
| Information gain | Decision tree | 66% | 92% | 85% | 0.59 | 65% | 91% | 85% | 0.58 |
| Naïve Bayes | 97% | 70% | 77% | 0.58 | 97% | 70% | 77% | 0.58 | |
| SVM | 68% | 90% | 85% | 0.58 | 68% | 90% | 85% | 0.58 | |
Classification Results with 100 Top-ranked Features for Three Different Feature Selection Methods and Three Different Classification Methods
| Feature selection method | Classification method | Training | Testing | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Se | Sp | Acc | MCC | Se | Sp | Acc | MCC | ||
| Relief | Decision tree | 98% | 98% | 98% | 0.95 | 95% | 97% | 97% | 0.91 |
| Naïve Bayes | 99% | 80% | 85% | 0.70 | 99% | 80% | 85% | 0.70 | |
| SVM | 96% | 97% | 97% | 0.92 | 95% | 97% | 97% | 0.91 | |
| Chi2 | Decision tree | 96% | 98% | 97% | 0.93 | 72% | 90% | 86% | 0.61 |
| Naïve Bayes | 100% | 76% | 82% | 0.66 | 100% | 76% | 82% | 0.66 | |
| SVM | 84% | 93% | 91% | 0.75 | 82% | 92% | 90% | 0.73 | |
| Information gain | Decision tree | 94% | 98% | 97% | 0.91 | 76% | 92% | 88% | 0.68 |
| Naïve Bayes | 99% | 77% | 83% | 0.67 | 99% | 77% | 83% | 0.67 | |
| SVM | 84% | 93% | 91% | 0.76 | 83% | 93% | 90% | 0.75 | |
Comparison of the Results from Different Methods
| Method | Se | Sp | Acc | MCC |
|---|---|---|---|---|
| Neural network | 82.4% | 64.5% | 84.6% | 0.627 |
| Salzberg method | 68.1% | 73.7% | 86.2% | 0.619 |
| SVM Salzberg kernel | 78.4% | 76.0% | 88.6% | 0.696 |
| SVM edit kernel III ASCM250 | 99.8% | 99.9% | 99.9% | 0.997 |
| Decision tree (20 features from Relief) | 81% | 91% | 89% | 0.71 |
| Decision tree and SVM (100 features from Relief) | 95% | 97% | 97% | 0.91 |
The results from Zien et al. (.
The results from Li and Jiang (.
Results on Sequence Sets 2 and 3 with 100 Top-ranked Features
| Feature selection method | Classification method | Sequence set 2 | Sequence set 3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Se | Sp | Acc | MCC | Se | Sp | Acc | MCC | ||
| Relief | Decision tree | 97% | 76% | 77% | 0.30 | 80% | 78% | 78% | 0.30 |
| Naïve Bayes | 100% | 46% | 47% | 0.17 | 95% | 45% | 48% | 0.18 | |
| SVM | 98% | 76% | 76% | 0.30 | 81% | 78% | 78% | 0.30 | |
| Chi2 | Decision tree | 82% | 47% | 48% | 0.11 | 71% | 54% | 55% | 0.12 |
| Naïve Bayes | 99% | 8% | 11% | 0.05 | 90% | 23% | 26% | 0.07 | |
| SVM | 91% | 52% | 53% | 0.16 | 81% | 57% | 58% | 0.17 | |
| Information gain | Decision tree | 88% | 59% | 60% | 0.17 | 80% | 64% | 65% | 0.20 |
| Naïve Bayes | 100% | 10% | 13% | 0.06 | 90% | 24% | 27% | 0.08 | |
| SVM | 92% | 63% | 64% | 0.20 | 86% | 68% | 69% | 0.25 | |