| Literature DB >> 23368518 |
Eric S Ho1, Samuel I Gunderson, Siobain Duffy.
Abstract
BACKGROUND: Polyadenylation is present in all three domains of life, making it the most conserved post-transcriptional process compared with splicing and 5'-capping. Even though most mammalian poly(A) sites contain a highly conserved hexanucleotide in the upstream region and a far less conserved U/GU-rich sequence in the downstream region, there are many exceptions. Furthermore, poly(A) sites in other species, such as plants and invertebrates, exhibit high deviation from this genomic structure, making the construction of a general poly(A) site recognition model challenging. We surveyed nine poly(A) site prediction methods published between 1999 and 2011. All methods exploit the skewed nucleotide profile across the poly(A) sites, and the highly conserved poly(A) signal as the primary features for recognition. These methods typically use a large number of features, which increases the dimensionality of the models to crippling degrees, and typically are not validated against many kinds of genomes.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23368518 PMCID: PMC3549828 DOI: 10.1186/1471-2105-14-S2-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Dimer PCA of real poly(A) sequences. Blue dots represent positions, and red arrows represent kmers. Only arrows longer than a specified threshold are labeled. DUR, PUR, CS, and DR denote distal upstream region, proximal upstream region, cleavage site, and downstream region, respectively. A) Arabidopsis, B) human.
Figure 2Predicted nucleosome occupancy in region 300 nts upstream and downstream of poly(A) sites. The vertical axis is predicted probability. Red and blue lines represent average probability of occupancy for real poly(A) sites, and 2nd order Markov sequences i.e. false sequences, respectively. The shaded region denotes the 95% confidence interval. A) Arabidopsis, B) human.
Multispecies poly(A) signal predictions are assessed by sensitivity (Sn), specificity (Sp), and Matthews correlation coefficient (MCC) in seven diverse species.
| LDA | LR | |||||
|---|---|---|---|---|---|---|
| Sn | Sp | MCC | Sn | Sp | MCC | |
| Human | 85 | 93 | 0.78 | 86 | 90 | 0.77 |
| Mouse | 91 | 96 | 0.87 | 92 | 94 | 0.86 |
| Chicken | 87 | 91 | 0.78 | 87 | 90 | 0.77 |
| C.elegans | 85 | 89 | 0.74 | 90 | 85 | 0.75 |
| Oryza sativa | 88 | 89 | 0.77 | 88 | 89 | 0.77 |
| Arabidopsis | 92 | 92 | 0.84 | 91 | 91 | 0.82 |
| s.lycopersicum | 91 | 92 | 0.83 | 90 | 91 | 0.81 |
Sn and Sp are expressed in percentage.
Comparing LDA and LR methods with polya_svm and PolyA-EP.
| Sn | Sp | MCC | ||
|---|---|---|---|---|
| Human | LR | 86 | 90 | 0.77 |
| LDA | 85 | 93 | 0.78 | |
| Polya_svm | 84 | 89 | 0.73 | |
| Arabidopsis | LR | 91 | 91 | 0.82 |
| LDA | 92 | 92 | 0.84 | |
| PolyA-EP | 95 | 41 | 0.43 | |
| PAC | - | - | 0.65-0.70 | |
Sn and Sp are expressed in percentages. The MCC of PAC is from Figure 3 of [20].
Phylogenetic distances between species based on, A) 18S, B) GAPDH protein, C) CPSF3.
| Species1 | Species2 | 18S | GAPDH | CPSF3 | rSn |
|---|---|---|---|---|---|
| mouse | human | 0.008099 | 0.083105 | 0.01452 | 88.5 |
| chicken | human | 0.034382 | 0.080086 | 0.052188 | 85.0 |
| c.elegans | human | 0.336445 | 0.309192 | 0.642712 | 68.5 |
| O.sativa | human | 0.245416 | 0.3565 | 0.674258 | 61.0 |
| Arabidopsis | human | 0.240724 | 0.3788 | 0.678983 | 65.5 |
| S.lycopersicum | human | 0.232462 | 0.390498 | 0.838763 | 58.0 |
| chicken | mouse | 0.035031 | 0.098663 | 0.061379 | 86.0 |
| c.elegans | mouse | 0.336259 | 0.318807 | 0.649077 | 67.0 |
| O.sativa | mouse | 0.242306 | 0.376424 | 0.677389 | 60.5 |
| Arabidopsis | mouse | 0.24307 | 0.40044 | 0.683073 | 55.5 |
| S.lycopersicum | mouse | 0.233802 | 0.390069 | 0.843015 | 58.0 |
| c.elegans | chicken | 0.331649 | 0.26794 | 0.63646 | 67.5 |
| O.sativa | chicken | 0.242221 | 0.338926 | 0.688556 | 61.5 |
| Arabidopsis | chicken | 0.241121 | 0.353368 | 0.685943 | 62.0 |
| S.lycopersicum | chicken | 0.231997 | 0.36337 | 0.843798 | 54.0 |
| O.sativa | c.elegans | 0.392745 | 0.388677 | 0.848847 | 65.0 |
| Arabidopsis | c.elegans | 0.3879 | 0.40471 | 0.885896 | 70.5 |
| S.lycopersicum | c.elegans | 0.377503 | 0.387265 | 0.997973 | 64.0 |
| Arabidopsis | O.sativa | 0.054837 | 0.260299 | 0.245492 | 63.5 |
| S.lycopersicum | O.sativa | 0.045931 | 0.273969 | 0.346818 | 68.5 |
| S.lycopersicum | Arabidopsis | 0.031789 | 0.218126 | 0.31507 | 65.5 |
Unit is substitution per site. The last column rSn is defined according to Materials and methods section.
Cross species predictions measured by sensitivity.
| Poly(A) sites | |||||||
|---|---|---|---|---|---|---|---|
| Models | Human | Mouse | Chicken | C.elegans | Oryza sativa | Arabidopsis | S.lycopersicum |
| Human | 58 | 86 | 90 | ||||
| Mouse | 60 | 75 | 90 | ||||
| Chicken | 53 | 78 | 81 | ||||
| C.elegans | 60 | 85 | 83 | ||||
| Oryza sativa | 64 | 61 | 70 | 70 | |||
| Arabidopsis | 45 | 36 | 46 | 56 | |||
| s.lycopersicum | 26 | 26 | 27 | 45 | |||
Sensitivity is calculated by using model from one species (indicated in the leftmost column) to predict real and false poly(A) sequences from other species (top row). The table needs not be symmetrical because poly(A) sites from different species tend to possess different characteristics according to the PCA profiles discussed above.