| Literature DB >> 19073698 |
Chuanhua Xing1, Donald L Bitzer, Winser E Alexander, Mladen A Vouk, Anne-Marie Stomp.
Abstract
We introduce a new approach in this article to distinguish protein-coding sequences from non-coding sequences utilizing a period-3, free energy signal that arises from the interactions of the 3'-terminal nucleotides of the 18S rRNA with mRNA. We extracted the special features of the amplitude and the phase of the period-3 signal in protein-coding regions, which is not found in non-coding regions, and used them to distinguish protein-coding sequences from non-coding sequences. We tested on all the experimental genes from Saccharomyces cerevisiae and Schizosaccharomyces pombe. The identification was consistent with the corresponding information from GenBank, and produced better performance compared to existing methods that use a period-3 signal. The primary tests on some fly, mouse and human genes suggests that our method is applicable to higher eukaryotic genes. The tests on pseudogenes indicated that most pseudogenes have no period-3 signal. Some exploration of the 3'-tail of 18S rRNA and pattern analysis of protein-coding sequences supported further our assumption that the 3'-tail of 18S rRNA has a role of synchronization throughout translation elongation process. This, in turn, can be utilized for the identification of protein-coding sequences.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19073698 PMCID: PMC2632891 DOI: 10.1093/nar/gkn917
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The comparison of the polar plots for protein-coding gene YRF1-3 and a randomly selected non-coding sequence.
Datasets for experiments
| Fly | Mouse | Human | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4670 | 5664 | 182 | 591 | 1997 | 4000 | 4000 | 4000 | 4000 | 4401 | 4000 | 4000 | 3576 |
The first row is specie names, the second row is dataset names and the third row is the number of sequences in each dataset.
Figure 2.The comparison of the histograms for the terminal phases of the protein-coding and non-protein-coding sequences. (a) The histogram of the terminal phases for the protein-coding sequences, where T1 and T2 mark the boundaries of 95% CI and 99.9% CI. (b) The histogram of the terminal phases for the non-coding sequences.
Figure 3.The position boundaries versus the phase variations for three measures of phase variation.
Figure 4.Two groups of amplitude rate plots for the protein-coding and non-coding sequences.
Accuracy of the synchronization based coding-region identification algorithm on different coding/non-coding subsets
| Our work | Gao | for Our work | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Se. (%) | Sp. (%) | Se./Sp. (%) | Se./Sp. (%) | Se. (%) | Sp. (%) | |||||||
| 17 | 2670 | 92.38 | 2670 | 73.39 | – | – | – | – | 591 | 92.89 | 1997 | 75.79 |
| 50 | 2646 | 92.55 | 2396 | 78.27 | – | – | – | – | 587 | 93.19 | 1782 | 83.09 |
| 85 | 2579 | 93.15 | 1902 | 84.92 | 4067 | 4186 | 85.7 | 86.7 | 583 | 93.57 | 1601 | 87.73 |
| 171 | 2325 | 94.15 | 877 | 94.41 | 3756 | 1948 | 89.89 | 89.4 | 510 | 95.29 | 1168 | 92.00 |
| 200 | 2219 | 94.61 | 722 | 95.57 | – | – | – | – | 483 | 95.61 | 1021 | 92.81 |
| 342 | 1663 | 95.53 | 314 | 97.89 | 2674 | 650 | 95.4 | 94.4 | 349 | 96.28 | 554 | 96.93 |
| 440 | 1263 | 95.74 | 215 | 98.79 | – | – | – | – | 273 | 96.34 | 380 | 97.89 |
| 500 | 1054 | 96.14 | 178 | 99.1 | – | – | – | – | 225 | 97.11 | 306 | 99.02 |
| 800 | 430 | 97.35 | 83 | 99.32 | – | – | – | – | 78 | 98.72 | 121 | 100 |
†These are the results from Gao et al. (10).