| Literature DB >> 15251040 |
Olivier Andrieu1, Anna-Sophie Fiston, Dominique Anxolabéhère, Hadi Quesneville.
Abstract
BACKGROUND: Transposable elements (TE) are mobile genetic entities present in nearly all genomes. Previous work has shown that TEs tend to have a different nucleotide composition than the host genes, either considering codon usage bias or dinucleotide frequencies. We show here how these compositional differences can be used as a tool for detection and analysis of TE sequences.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15251040 PMCID: PMC497039 DOI: 10.1186/1471-2105-5-94
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sensitivity and specificity of sequence class prediction.
| gene | 93.80% | 84.50% | |
| class I TE | 64.29% | 97.67% | |
| class II TE | 93.33% | 94.78% | |
| gene | 83.20% | 70.00% | |
| class I TE | 88.89% | 99.41% | |
| class II TE | 45.45% | 83.89% | |
| gene | 87.20% | 86.87% | |
| class I TE | 68.75% | 92.29% | |
| class II TE | 84.21% | 93.28% |
Interspecific tests. Predictions (as a %) of TE sequences for one species obtained using the host gene model of the same species versus the value found using the TE models of the other two species (counts are given in parentheses).
| 39% (28) | 34% (24) | 28% (19) | |
| 75% (15) | 20% (4) | 5% (1) | |
| 87% (86) | 9% (9) | 4% (4) |
Class I vs genes: The four most discriminating dinucleotides when comparing the HMM trained on class I TEs and the HMM trained on genes for each state of the HMM. E1, E2, E3 are the three coding states; I is the non-coding state. Log2 of the frequency ratio is given in parentheses. A value of +1 (resp. -1) thus indicates a twofold increase (resp. decrease) of the frequency in the TE model.
| CC (-0.32) | AC (0.20) | TC (0.17) | ||
| CA (0.28) | TA (0.26) | AC (0.17) | TC (-0.14) | |
| GC (-0.30) | TC (-0.23) | CC (-0.18) | ||
| CG (2.23) | CC (1.76) | TT (-1.62) | TA (-1.35) | |
| TG (-1.71) | CT (1.12) | CC (0.96) | AG (-0.85) | |
| GA (-1.66) | GG (-1.17) | TA (0.80) | TC (0.73) | |
| TA (1.03) | CC (0.90) | TT (-0.68) | AT (-0.67) | |
| TT (-1.73) | CG (1.22) | TA (-0.87) | CC (0.87) | |
| AG (0.89) | AA (0.87) | TA (0.66) | GG (-0.58) | |
| AG (0.59) | TG (0.38) | GA (-0.25) | GG (-0.24) | |
| TA (1.06) | CA (0.73) | GG (0.72) | TG (-0.68) | |
| GG (-1.03) | GC (-0.56) | AA (0.55) | CG (-0.51) | |
Class II vs genes (cf. legend of table 3).
| CT (-0.51) | CC (-0.48) | CA (-0.29) | |||
| CT (-0.59) | TA (0.35) | TC (-0.28) | TG (0.27) | ||
| TC (-0.72) | CC (-0.58) | AT (0.38) | |||
| CC (-1.06) | CA (-0.48) | GC (-0.41) | GT (0.38) | ||
| GT (0.42) | GC (0.41) | CC (0.35) | TG (-0.35) | ||
| TG (0.43) | AG (0.38) | TA (0.33) | |||
| GG (0.67) | GC (0.57) | CA (-0.36) | |||
| AA (-0.26) | CG (0.22) | TC (0.15) | AC (0.13) | ||
| AA (1.59) | TA (1.32) | GG (-1.28) | AT (1.24) | ||
| GG (-0.84) | TT (0.67) | AA (0.64) | CT (-0.62) | ||
| TA (1.47) | CC (-1.34) | AA (1.28) | TT (1.01) | ||
| GC (-1.08) | CC (-1.00) | TC (-0.76) | |||
Class I vs Class II (cf. legend of table 3).
| CT (0.46) | CA (0.36) | AT (-0.28) | AC (0.22) | |
| CT (0.52) | TG (-0.30) | GA (-0.20) | AG (-0.16) | |
| TC (0.50) | CC (0.39) | AT (-0.24) | GG (-0.23) | |
| CC (2.81) | TT (-1.94) | GC (1.56) | ||
| TG (-1.37) | CT (0.97) | GG (-0.91) | ||
| GA (-1.62) | GG (-0.95) | GT (-0.71) | TC (0.69) | |
| TA (1.01) | CC (0.73) | GT (-0.66) | AT (-0.58) | |
| TT (-1.75) | GG (0.94) | CC (0.87) | ||
| TT (-1.28) | CC (0.91) | AA (-0.72) | AT (-0.70) | |
| TT (-0.76) | GC (0.69) | CT (0.65) | GG (0.60) | |
| CC (0.89) | AT (-0.82) | TT (-0.79) | GG (0.75) | |
| CA (0.87) | AC (0.85) | TT (-0.79) | CC (0.78) | |
Figure 1Specificity and sensitivity of sequence predictions with varying minimal length for: (a) Drosophila melanogaster TEs, (b) Arabidopsis thaliana TEs, (c) Cænorhabditis elegans TEs. The unbroken line indicates the sensitivity, the dotted line the specificity.
Quality of coding region prediction
| TE class II | 97% | 85% |
| TE class I | 95% | 51% |
| TE class I (alt. model) | 75% | 80% |
Figure 2HMM structures. States are depicted by circles and transitions by arrows. The thickness of the arrows indicates the probability of the transition. The initial transition probabilities used for starting the estimation algorithm are 0.995 for thick arrows, the remaining 0.005 being equally divided between the thin arrows. Three different HMM structures were used. (a) 3 coding states, 1 non-coding state (b) 3 coding states, 2 non-coding states (intronic and non-intronic) (c) 3 coding states, 3 non-coding states. In (b) a path in the HMM starts and ends in the T state ; in (a) and (c) a path can begin and end in any state.
Sources of genomic sequences and annotations
| Bdgp [22] | release 3 |
| Tigr [23] | release ATH1 |
| EnsEMBL/ Wormbase [24] | release WS85 |
Contents of TE training sets
| 56 | 2483 | 6411 | 10654 | |
| 15 | 912 | 2167 | 4347 | |
| 9 | 261 | 3259 | 4082 | |
| 11 | 1242 | 5625 | 7227 | |
| 78 | 330 | 4496 | 10633 | |
| 19 | 265 | 4548 | 9233 | |