| Literature DB >> 23813000 |
Bo Xie1, Boris R Jankovic, Vladimir B Bajic, Le Song, Xin Gao.
Abstract
MOTIVATION: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23813000 PMCID: PMC3694652 DOI: 10.1093/bioinformatics/btt218
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Comparison of the error rates of our method (HMM) with PPK, SPE and WD
| Variants | Size | Error rate (%) | False-negative rate (%) | False-positive rate (%) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PPK | SPE | WD | HMM | Rel | PPK | SPE | WD | HMM | Rel | PPK | SPE | WD | HMM | Rel | ||
| 5190 | — | 23.08 | 23.72 | 19.45 | — | 21.93 | 23.70 | 15.47 | — | 24.24 | 23.74 | 23.05 | ||||
| 2400 | 27.13 | 20.17 | 18.29 | 19.63 | 32.50 | 22.83 | 21.50 | 20.44 | 21.75 | 17.50 | 15.08 | 18.57 | ||||
| 1250 | 31.28 | 14.72 | 16.72 | 36.41 | 37.12 | 14.08 | 19.68 | 19.32 | 25.44 | 15.36 | 13.76 | 52.08 | ||||
| 1230 | 15.04 | 13.25 | 7.80 | 58.90 | 25.20 | 8.94 | 8.46 | 32.73 | 4.88 | 17.56 | 7.15 | 72.22 | ||||
| 880 | 31.48 | 18.98 | 23.18 | 19.16 | 35.91 | 19.55 | 30.68 | 2.33 | 27.05 | 18.41 | 15.68 | 37.04 | ||||
| 780 | 29.87 | 16.28 | 18.46 | 31.50 | 34.36 | 22.31 | 21.54 | 29.89 | 25.38 | 10.26 | 15.38 | 35.00 | ||||
| 690 | 40.72 | 24.35 | 30.29 | 30.36 | 43.48 | 28.41 | 39.42 | 29.59 | 37.97 | 20.29 | 21.16 | 31.43 | ||||
| 670 | 31.19 | 20.90 | 23.88 | 31.43 | 33.73 | 30.75 | 25.67 | 33.01 | 28.66 | 11.04 | 22.09 | 27.03 | ||||
| 460 | 25.43 | 17.39 | 14.13 | 45.00 | 35.22 | 21.74 | 16.96 | 52.00 | 15.65 | 13.04 | 11.30 | 33.33 | ||||
| 410 | 29.51 | 15.85 | 18.78 | 41.54 | 31.22 | 23.90 | 25.85 | 38.78 | 27.80 | 7.80 | 11.71 | 50.00 | ||||
| 410 | 32.68 | 18.78 | 22.20 | 32.47 | 40.98 | 22.93 | 27.80 | 10.64 | 24.39 | 14.63 | 16.59 | 66.67 | ||||
| 370 | 24.05 | 8.11 | 14.86 | 36.67 | 22.16 | 6.49 | 9.73 | 0.00 | 25.95 | 9.73 | 20.00 | 61.11 | ||||
| — | — | 19.56 | 20.22 | 28.09 | — | 20.60 | 22.47 | 20.75 | — | 18.52 | 17.96 | 34.17 | ||||
‘Average’ denotes the weighted average of the corresponding column. ‘Size’ denotes the number of samples for the corresponding motif variant. ‘Error rate’ is the proportion of false results in the dataset, which equals one minus accuracy. ‘False-negative rate’ is the proportion of true poly(A) motifs that are predicted to be false, which equals one minus sensitivity. ‘False-positive rate’ is the proportion of false poly(A) motifs that are predicted to be true, which equals one minus specificity. ‘Rel’ denotes the relative improvement of HMM with respect to SPE. The lowest error rate for each motif variant is indicated in bold. PPK could not finish running within 48 h on AATAAA.
Runtime comparisons on two variants AATAAA and ATTAAA for one train/test split, with k = 3 and all other parameters set to optimal
| Time (s) | ||||||||
|---|---|---|---|---|---|---|---|---|
| PPK | SPE | WD | HMM | PPK | SPE | WD | HMM | |
| Training | — | 46.16 | 37.38 | 2722.81 | 9.46 | 6.47 | ||
| Testing | — | 6.81 | 1.43 | 674.08 | 1.54 | 0.69 | ||
Note: PPK could not finish running within 48 h on AATAAA. The values in bold indicate better results.
Comparison of our method (HMM) with RF
| Variants | Size | Error rate (%) | False-negative rate (%) | False-positive rate (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RF | HMM | Rel | RF | HMM | Rel | RF | HMM | Rel | ||
| 5190 | 20.06 | 7.31 | 19.74 | 6.10 | 20.37 | 8.44 | ||||
| 2400 | 18.42 | 12.01 | 18.68 | 2.75 | 18.15 | 21.49 | ||||
| 1250 | 16.64 | 43.75 | 16.53 | 31.28 | 16.75 | 56.06 | ||||
| 1230 | 11.06 | 50.75 | 11.92 | 49.53 | 10.15 | 51.94 | ||||
| 880 | 19.55 | 21.53 | 19.09 | −5.47 | 20.87 | 44.46 | ||||
| 780 | 19.36 | 42.39 | 18.13 | 13.73 | 20.49 | 67.46 | ||||
| 690 | 27.83 | 39.07 | 25.24 | 20.76 | 29.92 | 53.50 | ||||
| 670 | 22.09 | 35.14 | 20.69 | 0.45 | 23.36 | 65.50 | ||||
| 460 | 20.00 | 52.17 | 21.01 | 50.33 | 18.92 | 54.04 | ||||
| 410 | 18.54 | 50.01 | 16.92 | 13.51 | 20.00 | 80.49 | ||||
| 410 | 24.88 | 49.02 | 24.12 | 15.06 | 25.59 | 80.94 | ||||
| 370 | 18.38 | 72.06 | 19.37 | 66.51 | 17.32 | 78.15 | ||||
| — | 19.19 | 25.62 | 18.83 | 14.81 | 19.48 | 35.40 | ||||
Note: The performance of both RF and HMM is evaluated on the same 5-fold cross-validation. ‘Rel’ denotes the relative improvement of HMM with respect to RF. The lowest value for each criterion of each motif variant is indicated in bold.
Fig. 1.Visualization of the importance of different dimers at different positions for the 12 variants of human poly(A) motifs. The x-axis gives the positions in the sequence. The y-axis lists all 16 possible dimers. The colors denote the levels of importance: the light green color for the positions 0–6 is the background color, which indicates that no effects differentiate true and false motifs; the darker the red, the more important the dimer at that position is to identifying true motifs; the darker the blue, the more important the dimer at that position is to identifying false motifs.
Fig. 2.Visualization of the importance of different positions for the 12 motif variants. The x-axis gives the position in the sequence. For each k from 1 to 5, the y-axis is the importance score of a position by summing over the absolute values of the importance for all possible k-mers at that position