| Literature DB >> 19534745 |
Teresa M Creanza1, David S Horner, Annarita D'Addabbo, Rosalia Maglietta, Flavio Mignone, Nicola Ancona, Graziano Pesole.
Abstract
BACKGROUND: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19534745 PMCID: PMC2697643 DOI: 10.1186/1471-2105-10-S6-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The reading frame conservation. This figure shows the reading frame conservation test by M. Kellis et al. (2004).
The Wilcoxon-Mann-Whitney test P-values
| (a) | (b) | (c) | |||||||||
| CG-Content | 0.514 | 0.407 | 0 | CG-Content | 0.517 | 0.419 | 0 | CG-Content | 0.517 | 0.420 | 0 |
| FFT | 0.017 | 0.007 | 0 | FFT | 0.016 | 0.007 | 0 | FFT | 0.016 | 0.007 | 0 |
| % | 0.086 | 2.846 | 0 | % | 0.081 | 2.652 | 0 | % | 0.166 | 2.623 | 0 |
| 0.263 | 0.150 | 0 | 0.264 | 0.153 | 0 | 0.265 | 0.154 | 0 | |||
| RFC H-M | 1.000 | 0.915 | 0 | RFC H-M | 1.000 | 0.915 | 0 | RFC H-R | 0.999 | 0.915 | 0 |
| 7.393 | 5.763 | 0 | 7.393 | 5.763 | 0 | 7.405 | 5.699 | 0 | |||
| 0.132 | 0.061 | 0 | 0.132 | 0.061 | 0 | 0.133 | 0.062 | 0 | |||
| 0.938 | 0.758 | 0 | 0.938 | 0.758 | 0 | 0.939 | 0.760 | 0 | |||
| 0.672 | 0.358 | 0 | 0.672 | 0.358 | 0 | 0.672 | 0.354 | 0 | |||
| RFC H-R | 0.999 | 0.915 | 0 | 7.513 | 6.592 | 0 | 7.513 | 6.592 | 0 | ||
| 7.405 | 5.699 | 0 | 0.073 | 0.048 | 0 | 0.073 | 0.048 | 0 | |||
| 0.133 | 0.062 | 0 | 0.956 | 0.823 | 0 | 0.956 | 0.823 | 0 | |||
| 0.940 | 0.760 | 0 | 0.747 | 0.526 | 0 | 0.747 | 0.526 | 0 | |||
| 0.672 | 0.354 | 0 | 1.178 | 1.094 | 10-275 | 1.169 | 1.092 | 10-220 | |||
| 1.179 | 1.108 | 10-198 | RFC M-R | 0.990 | 0.940 | 10-205 | RFC M-R | 0.990 | 0.940 | 10-205 | |
| 0.155 | 0.129 | 10-46 | NPI M-R | 92.233 | 89.585 | 10-51 | NPI M-R | 92.233 | 89.585 | 10-51 | |
| NPI H-M | 84.213 | 85.102 | 10-14 | 0.157 | 0.132 | 10-39 | 0.157 | 0.132 | 10-40 | ||
| 0.134 | 0.148 | 10-7 | NPI H-M | 84.213 | 85.102 | 10-15 | NPI H-R | 84.248 | 84.448 | 0.0002 | |
| % | 2.264 | 2.257 | 10-6 | 0.136 | 0.145 | 0.005 | % | 2.126 | 2.163 | 0.0012 | |
| NPI H-R | 84.248 | 84.450 | 0.0002 | % | 2.136 | 2.188 | 0.007 | 0.135 | 0.146 | 0.0017 | |
For each species, the first two columns show the mean values of each variable in the two classes and the last one shows the P-values of Wilcoxon-Mann-Whitney test. The suffices H-M, H-R and H-R indicate the species in comparison. The features are ranked for increasing P-values. Note that all P-values are less than 0.0025 except for the CG-skewness and the stop codon spread %Stopfor M. musculus.
Figure 2Learning curves for the intrinsic features. The plots refer to the error rates as function of sequence length (in bp) for H. sapiens and for each intrinsic feature.
Figure 3Learning curves for the comparative features. The plots refer to the error rates as function of sequence length (in bp) for comparative features based on the H. sapiens versus M. musculus and on the M. musculus versus R. norvegicus comparisons, respectively by blue and green lines.
Figure 4The summary plot. The error bars in the figure represent the median, 25% and 75% quantiles error rates for H. sapiens sequences and for their pairwise alignments: the red, blue, yellow and green bars refer to the 4 classes of ascending sequence lengths in the legend.
Figure 5Sensitivity and specificity. The plots refer to the sequences of the H. sapiens and their alignments with rat and mouse genomes: in particular on the right the are the three plots of prediction accuracy of the combination of the only comparative a), of the only intrinsic c) and of all metrics e) as function of sequence lengths, on the left the respective plots b), d), f) for the sensitivity and the specificity.