| Literature DB >> 16277750 |
Shuba Gopal1, Saria Awadalla, Terry Gaasterland, George A M Cross.
Abstract
Trans-splicing is an unusual process in which two separate RNA strands are spliced together to yield a mature mRNA. We present a novel computational approach which has an overall accuracy of 82% and can predict 92% of known trans-splicing sites. We have applied our method to chromosomes 1 and 3 of Leishmania major, with high-confidence predictions for 85% and 88% of annotated genes respectively. We suggest some extensions of our method to other systems.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16277750 PMCID: PMC1297651 DOI: 10.1186/gb-2005-6-11-r95
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Classification of sequences by linear discriminant analysis (LDA)
| Known | Known coding regions | |
| Predicted | True positive | False positive |
| 20 | 0.9 | |
| (14-24) | (0-3) | |
| Predicted coding region | False negative | True negative |
| 0.7 | 18.5 | |
| (0-2) | (10-31) | |
| Sensitivity: 0.97 | Specificity: 0.96 | |
| (0.91-1.00) | (0.88-1.00) | |
| Accuracy: 0.96 | ||
| (0.90-1.00) | ||
The overall performance of the LDA method after tenfold cross-validation using 214 known trans-splicing regions and 198 coding regions is shown here. The average across all ten testing sets is reported, with the range of values indicated in parentheses for each class of sequence. Each test dataset had on average 20.7 known trans-splicing regions and 19.4 known coding regions.
Figure 1Inter-AG lengths in known splicing and coding regions. Inter-AG distances in known coding and trans-splicing regions show different distributions. The distance between AG dinucleotides is significantly greater in known trans-splicing regions than in known coding regions. Distances are shown for 214 known trans-splicing regions and 198 coding regions. The mean inter-AG distance for the coding region data is 42 nucleotides, compared with a mean inter-AG distance of 81 nucleotides for known trans-splicing regions.
Identification of splice junctions
| Known | Known coding regions | |
| Predicted splice sites | True positive | False positive |
| 17 | 2.5 | |
| (10-22) | (0-4) | |
| Predicted nonsplice sites | False negative | True negative |
| 4.5 | 13.9 | |
| (1-8) | (11-16) | |
| Sensitivity: 0.80 | Specificity: 0.85 | |
| (0.71-0.93) | (0.75-1.00) | |
| Accuracy: 0.82 | ||
| (0.74-0.93) | ||
The overall performance of the method in identifying splice junctions was determined by comparing the number of known splice junctions that were identified by the method in known trans-splicing regions versus those in known coding regions. These results are from tenfold cross-validation, and each test dataset had on average 21.5 known trans-splicing regions and 16.4 known coding regions.
High-confidence predictions for known trans-splicing regions
| Distance from known site (nucleotides) | Number of regions with sites predicted ( |
| Exact matches | 12.6 (7-16) |
| 10 | 1.38 (1-2) |
| 25 | 1.5 (1-3) |
| 50 | 1.38 (1-2) |
| Missing predictions | 4.5 (1-8) |
Overall performance of the method on a set of known trans-splicing regions (tenfold cross-validation of 214 EST mapped trans-splicing sites). Each test dataset had on average 21.5 known trans-splicing regions, of which on average 17 had predictions. Missing predictions indicate those sequences for which no high-confidence prediction was available or where the nearest prediction was more than 50 nucleotides away. The mean of all ten datasets is reported with the range of values in parentheses.
Figure 2Log-normal transform of inter-AG lengths. Inter-AG distances after log-normal transform show a roughly normal curve. To better evaluate the inter-AG distances in known trans-splicing regions, we transformed the long-tailed distribution seen in Figure 1 using a log-transform. The result is a good approximation to a normal curve, allowing us to use the full panoply of statistical analyses available for manipulations of normally distributed data.
Figure 3False-positive rate as a function of z score. False-positive rate as a function of z score can be used to measure the confidence of an individual prediction. The rate of false positives predicted by the method is shown as a function of the z scores used to evaluate inter-AG distances. False-positive rates were estimated for a range of z scores from -6 to +6 based on known splice sites in the training data. The dotted lines indicate that a z score of 0.6 or greater will yield a false positive rate of just 5%. In other words, inter-AG segments with a z score of 0.6 or greater will have a 95% confidence of being trans-splicing regions.
Predictions for chromosome 1
| Public annotation | Totals | High confidence | Low confidence | No prediction |
| Forward strand | ||||
| Protein function assigned | 22 | 20 | 2 | 0 |
| Conserved hypothetical | 18 | 16 | 0 | 2 |
| Hypothetical | 13 | 9 | 3 | 1 |
| Total | 53 | 45 | 5 | 3 |
| Reverse strand | ||||
| Protein function assigned | 9 | 8 | 1 | 0 |
| Conserved hypothetical | 18 | 14 | 4 | 0 |
| Hypothetical | 4 | 4 | 0 | 0 |
| Total | 31 | 26 | 5 | 0 |
Comparison of predicted splice sites with annotation of chromosome 1 of Leishmania major. A total of 84 genes have been annotated on chromosome 1 of L. major [12]. Of these, the method finds a splice site with a high-confidence score in all but 13 instances (85%). Only three genes were missed entirely by the method, with no prediction within the 400 nucleotide window upstream of the annotated start of the gene.
Predictions for chromosome 3
| Public annotation | Totals | High confidence | Low confidence | No prediction |
| Forward strand | ||||
| Protein function assigned | 18 | 14 | 4 | 0 |
| Conserved hypothetical | 5 | 5 | 0 | 0 |
| Hypothetical | 44 | 40 | 4 | 0 |
| Total | 67 | 59 | 8 | 0 |
| Reverse strand | ||||
| Protein function assigned | 7 | 6 | 1 | 0 |
| Conserved hypothetical | 2 | 2 | 0 | 0 |
| Hypothetical | 22 | 19 | 3 | 0 |
| Total | 31 | 27 | 4 | 0 |
Comparison of predicted splice sites with annotation of chromosome 3 of Leishmania major. A total of 98 genes have been annotated on chromosome 3 of L. major [13]. Of these, the method finds a splice site with a high confidence score in all but 12 instances (88%). A splice site was predicted for every gene annotated on this chromosome.