| Literature DB >> 15924626 |
Eduardo Eyras1, Alexandre Reymond, Robert Castelo, Jacqueline M Bye, Francisco Camara, Paul Flicek, Elizabeth J Huckle, Genis Parra, David D Shteynberg, Carine Wyss, Jane Rogers, Stylianos E Antonarakis, Ewan Birney, Roderic Guigo, Michael R Brent.
Abstract
BACKGROUND: Despite the continuous production of genome sequence for a number of organisms, reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularly true for genomes for which there is not a large collection of known gene sequences, such as the recently published chicken genome. We used the chicken sequence to test comparative and homology-based gene-finding methods followed by experimental validation as an effective genome annotation method.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15924626 PMCID: PMC1174864 DOI: 10.1186/1471-2105-6-131
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Venn diagram of the prediction sets. Venn diagram obtained from the comparison of the three prediction sets: Ensembl (E), SGP2 (S) and TWINSCAN (T). (A) Description of each subset in the Venn diagram. (B) Total number of intron assemblies (IAs) populating each subset. (C) Percentage of experimentally verified IAs for each subset (top) and number of assayed IAs (bottom). (D) Percentage of correctly predicted splice junctions (top) from the experimentally verified IAs (bottom).
distribution of intron assemblies (IAs) for each of the 7 subsets of the Venn diagram of the three prediction sets: Ensembl (E), SGP2 (S) and TWINSCAN (T) (see also Figure 1). The number of transcripts from each prediction set participating in the intron assemblies is indicated on the right.
| Subsets | Number of IAs | Transcripts involved | ||
| E | S | T | ||
| E and S and T | 10650 | 10282 | 8974 | 9888 |
| (S and T) not E | 4769 | 0 | 3930 | 3924 |
| (E and S) not T | 4757 | 3636 | 3273 | 0 |
| (T and E) not S | 1748 | 1592 | 0 | 1507 |
| S not (E+T) | 25119 | 0 | 20740 | 0 |
| T not (S+E) | 27592 | 0 | 0 | 22239 |
| E not (T+S) | 13514 | 11014 | 0 | 0 |
Figure 2Comparison of two predictions. From the comparison of two predictions (a) we obtain three differentiated sets of intron assemblies (IAs): the set of IAs that are identical in both transcripts ('A and B'), and two set of the IAs that are in one prediction but not in the other ('A not B' and 'B not A'). When two sets have the same intron with different outside boundaries for the flanking exons these boundaries are taken from the intersection of the exons. Ensembl predictions (b) have in general more than one transcript per gene (two top yellow tracks). The intersecting intron assemblies (IAs) are therefore defined as the longest non-redundant IAs common between the transcripts from either prediction. For the novel IAs we take the longest non-redundant IAs in one that are not present in the other set.
Experimentally verified of intron assemblies (see also Figure 1)
| Total tested | No amplimer | Amplimer correctly predicted | Amplimer but junction not correctly predicted | |
| S and T and E | 20 | 3 (15%) | 16 (80%) | 1 (5%) |
| (S and T) not E | 76 | 40 (53%) | 27 (35%) | 9 (12%) |
| (E and S) not T | 64 | 12 (19%) | 44 (69%) | 8 (12%) |
| (T and E) not S | 40 | 14 (35%) | 22 (55%) | 4 (10%) |
| S not (T + E) | 88 | 67 (76%) | 6 (7%) | 15 (17%) |
| T not (S + E) | 96 | 83 (86%) | 9 (9%) | 4 (5%) |
| E not (T + S) | 30 | 7 (23.3%) | 16 (53.3%) | 7 (23.3%) |
Figure 3Ensembl extensions. Exon extensions to Ensembl predictions can be obtained from exons predicted by TWINSCAN and/or SGP2. These exons can either (a) be part of a transcript with exons in common with the Ensembl transcript (linked) or (b) be part of a close but non-overlapping transcript (unlinked).
Experimental verification of IAs corresponding to Ensembl 5' extensions. The extensions are separated according to whether the 5'-most Ensembl exon also existed in TWINSCAN and/or SGP2 (linked) or not (unlinked) (see Figure 3).
| Ensembl extensions | Total tested | No amplimer | Amplimer correctly predicted | Amplimer but junction not correctly predicted |
| linked | 60 | 36 (60%) | 11 (18%) | 13(22%) |
| unlinked | 29 | 27 (93%) | 2 (7%) | 0 |
Figure 4Classification of intron assemblies. We classified the intron assemblies that were novel with respect to a reference set according to their position relative to the other set against which we do the comparison. A novel IA can (a) fall between the genomic extent of two predictions (intergenic), (b1) bridge across two predictions (bridge), (c1) overlap the 5' or the 3' end of one prediction (external), and (d1) fall within one or more introns of another prediction (intronic). Additionally, novel IAs are labelled as complete when they are a complete ATG-to-STOP prediction: (b2), (c2) and (d2).