| Literature DB >> 12600277 |
Matthieu Legendre1, Daniel Gautheret.
Abstract
BACKGROUND: Differential polyadenylation is a widespread mechanism in higher eukaryotes producing mRNAs with different 3' ends in different contexts. This involves several alternative polyadenylation sites in the 3' UTR, each with its specific strength. Here, we analyze the vicinity of human polyadenylation signals in search of patterns that would help discriminate strong and weak polyadenylation sites, or true sites from randomly occurring signals.Entities:
Mesh:
Substances:
Year: 2003 PMID: 12600277 PMCID: PMC151664 DOI: 10.1186/1471-2164-4-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Schematic view of EST-based polyadenylation site identification. Each UTR is aligned onto the complete EST database. A poly(A) site is validated when at least two ESTs match this site while respecting specific length, position and quality criteria (see Methods). The -300/+300 nt fragment surrounding each site (here called "terminal sequence") is then extracted for further analysis. For sites located near the 3' end of the UTR, we use the corresponding genomic sequence to complete the terminal sequence.
Figure 2Nucleotide composition in terminal sequences. Position 0 corresponds to the 3' base of the polyadenylation signal. Nucleotide positions were averaged in a sliding 11 nt window.
Figure 3Uracil frequencies in a 11 nt window, in the vicinity of "strong" poly(A) sites (645 sequences), "weak" sites (1200 sequences), "unique" sites (3776 sequences) and controls (1249 sequences).
Figure 4Uracil frequencies in a 11 nt window, in the vicinity of "control" sites and two types of "unique" poly(A) sites: those located less than 300 nt from the Stop codon (CDS overlap: 1328 sequences), and those located more than 300 nt from the Stop codon (no CDS overlap: 2448 sequences).
Figure 5Uracil frequencies in a 11 nt window in the vicinity of alternative poly(A) sites, distinguishing proximal sites from distal sites. (a) : "strong" poly(A) sites (129 proximal, 499 distal); (b) : "weak" poly(A) sites (655 proximal, 210 distal).
Negative predictions and accuracy of the ERPIN and POLYADQ program, evaluated for different control sequences not containing polyadenylation sites: coding sequences (CDS), introns, and two types of randomized UTR sequences: simple shuffling or first order Markov simulation.
| 31.2 | Erpin | 880 | 102 | 3.7 | 84.33 % | 0.483 | |
| Polyadq | 862 | 120 | 3.8 | 82.01 % | 0.459 | ||
| 156.4 | Erpin | 741 | 241 | 38.9 | 69.49 % | 0.320 | |
| Polyadq | 718 | 264 | 42.0 | 67.45 % | 0.293 | ||
| 109.6 | Erpin | 888 | 94 | 11.0 | 85.38 % | 0.494 | |
| Polyadq | 826 | 156 | 17.4 | 77.81 % | 0.415 | ||
| 94.49 | Erpin | 772 | 210 | 21.9 | 72.33 % | 0.354 | |
| Polyadq | 733 | 249 | 23.9 | 68.72 % | 0.309 |
See Methods for information on database construction. Each row shows the number of potential A(A/U)UAAA signals per 100 kb in the dataset, True Negatives (TN), False Positives (FP), False Positives per 100 kb, Specificity (SP) and Accuracy (CC). Calculation of CC uses TP and TN from Table 1.
Measure of True Positives (TP), False Negatives (FN) and Sensitivity (SN) in the prediction of polyadenylation signals by the POLYADQ and ERPIN programs, based on a dataset of 982 annotated UTR sequences from the EMBL database. See Methods for information on database construction. ERPIN parameters were adjusted to match the sensitivity of POLYADQ.
| Erpin | 549 | 433 | 55.9 % |
| Polyadq | 547 | 435 | 55.7 % |
Compared accuracy of the ERPIN and POLYADQ programs for the prediction of EMBL annotated poly(A) sites, and of EST-derived weak poly(A) sites identified in this study. The negative set for SP and CC calculation is "UTR shuffled" from Table 2.
| Erpin | EMBL annotated sites | 55.91 % | 85.38 % | 0.494 |
| Polyadq | 55.70 % | 77.81 % | 0.415 | |
| Erpin | Weak sites | 31.28 % | 80.21 % | 0.262 |
| Polyadq | 30.54 % | 70.45 % | 0.171 |