| Literature DB >> 16925834 |
Danielle Thierry-Mieg1, Jean Thierry-Mieg.
Abstract
BACKGROUND: Regions covering one percent of the genome, selected by ENCODE for extensive analysis, were annotated by the HAVANA/Gencode group with high quality transcripts, thus defining a benchmark. The ENCODE Genome Annotation Assessment Project (EGASP) competition aimed at reproducing Gencode and finding new genes. The organizers evaluated the protein predictions in depth. We present a complementary analysis of the mRNAs, including alternative transcript variants.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16925834 PMCID: PMC1810549 DOI: 10.1186/gb-2006-7-s1-s12
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Statistics of the 25 selected tracks, arranged in the order of the UCSC genome browser
| UCSC track | Model with introns | Model with introns and CDS | Single exon model (some clipped) | Unique introns in mRNA | All introns in mRNA | Input or method |
| HAVANA Gencode (Sanger, UK) known + putative | 1,691 | 70 | 3,618 | 9,693 | MEP,CA,H | |
| EGASP model submissions | ||||||
| AceView (NCBI, US) | 1,630 | 1,460 | 24 | 3,530 | 9,597 | ME,(H) |
| UP Dogfish (Sanger, UK) | 204 | 204 | 15 | 1,679 | 1,679 | CA |
| Exogean (ENS, France) | 554 | 538 | 2 | 2,855 | 6,178 | MEP,CA |
| UP ExonHunter (U Waterloo, Canada) | 807 | 807 | 220 | 3,237 | 3,237 | MEP,CA |
| Fgenesh (U London, UK) | 462 | 458 | 97 | 2,610 | 3,241 | P,CA |
| UP GeneId (IMIM, Spain) | 267 | 267 | 51 | 1,905 | 1,905 | A |
| UP GeneMark (Georgia IT, US) | 551 | 551 | 81 | 2,185 | 2,185 | A |
| UP Jigsaw (TIGR, US) | 259 | 259 | 67 | 2,168 | 2,168 | MEP,CA |
| PairagonAny (Wash U, US) | 471 | 437 | 38 | 2,300 | 3,470 | MEP?,CA |
| UP SGP2 (IMIM, Spain) | 552 | 552 | 159 | 2,645 | 2,645 | P,CA |
| P Twinscan-MARS (Wash U,US) | 547 | 547 | 108 | 2,501 | 4,943 | CA |
| UP Augustus Any (U Göttingen, Germany) | 312 | 316 | 87 | 2,291 | 2,291 | MEP,CA |
| UP GeneZilla (TIGR, US) | 477 | 477 | 179 | 2,758 | 2,758 | A |
| UP Saga (UC Berkeley, US) | 331 | 331 | 47 | 1,737 | 1,737 | CA |
| UCSC gene tracks | ||||||
| *Known Gene (UCSC) | 501 | 477 | 53 | 2,264 | 4,427 | MP |
| *P CCDS | 201 | 201 | 14 | 1,296 | 1,508 | MP,H |
| *RefSeq (NCBI, US) | 342 | 325 | 41 | 2,082 | 2,922 | M(E)P,H |
| *MGC | 323 | 310 | 19 | 1,400 | 2,101 | M |
| *Ensembl (EBI, UK) | 427 | 418 | 58 | 2,429 | 3,548 | MEP,CA |
| *AceView (Aug 2005 NCBI) | 1,792 | 1,627 | 902 | 3,812 | 9,792 | ME, (H) |
| *ECgene (Korea) | 3,851 | 3,551 | 2,569 | 3,942 | 30,660 | ME,C |
| *U NscanEst (Wash U, US) | 282 | 252 | 27 | 2,292 | 2,292 | ME,CA |
| *UP GenScan (MIT, US) | 395 | 395 | 59 | 3,042 | 3,042 | A |
The number of models, with or without introns (after clipping at region boundaries), the number of spliced coding models, and the number of unique and multiply used introns are given over the 31 ENCODE test regions. Coded information has been added in front of the track name: asterisks distinguish standard gene tracks, available genome-wide, from an ENCODE only track; a U track predicts a unique model per gene; P predicts protein coding regions only. According to their documentation, the programs use different input or methods: M, E, P stand for human mRNA, EST, protein sequences or alignments, respectively; C stands for for conservation, or use of cDNA or protein evidence from other species; A stands for ab initio prediction; H stands for Hand curation; and parenthesized letters stand for minimal use of the particular type. Notice the low proportion of Gencode mRNA models with an annotated CDS (in bold).
Figure 1Comparison of introns between the Gencode reference and the 24 tracks, ordered by decreasing sensitivity, over the 31 test regions. Gencode validates 3,618 unique introns and a total of 9,693 introns in its alternative transcripts. (a) Projected measure: each intron is counted only once per method. Introns with the same coordinates as Gencode introns are shown in green and novel introns in red. The Gencode introns missed in each track (false negative) correspond to the distance between the 'true positive' bar and the Gencode reference, but are not explicitly represented. (b) Quantitative measure: all alternative variants are counted separately. Introns identical to Gencode introns, but over-used relative to Gencode are counted (in yellow) separately from novel introns that are not known to Gencode.
Figure 2Comparison of whole transcripts. (a) Strategy for selecting the best one to one matching pairs. (b) Comparison of whole transcripts through their intron signatures. The number of transcripts identical to Gencode, best-matching but different from Gencode, new transcripts in Gencode genes and new transcripts in new genes are represented.
Figure 3Consensus analysis. (a) Sensitivity and specificity at identifying 1,556 consensus transcripts from the pool of the following evidence-based tracks: RefSeq, Known Gene, Ensembl, Gencode, AceView, ECgene and ExonWalk. The sensitivity and specificity of all tracks at identifying these consensus models is plotted and listed in Table 2. (b)Closest neighbor consensus, evaluated by switching the track of reference. This figure shows the number of evidence-based models from CCDS, RefSeq, UCSC Known Genes, Gencode, or AceView, ExonWalk and Ensembl whose intron-exon structure is exactly matched by the 25 tracks. Tracks are arranged in decreasing order of averaged detection sensitivity, defined here as the sum of all evidence-based models from these seven reference tracks detected exactly.
Sensitivity and specificity of each method at detecting the 1,556 consensus transcripts
| Track | Number of models with introns | Consensual models (of 1,556 total) | Sensitivity | Specificity |
| *AceView | 1,792 | 1,302 | 84% | 73% |
| Gencode | 1,691 | 1,255 | 81% | 74% |
| *ECgene | 3,851 | 1,198 | 77% | 31% |
| AceView | 1,630 | 1,165 | 75% | 71% |
| *ExonWalk | 892 | 511 | 33% | 57% |
| *Known Gene | 501 | 432 | 28% | 86% |
| Exogean | 554 | 404 | 26% | 73% |
| *RefSeq | 342 | 332 | 21% | 97% |
| Pairagon | 471 | 310 | 20% | 66% |
| *Ensembl | 427 | 295 | 19% | 69% |
| *MGC | 323 | 217 | 14% | 67% |
| Fgenesh | 462 | 217 | 14% | 47% |
| *P CCDS | 201 | 152 | 10% | 76% |
| UP Jigsaw | 259 | 150 | 10% | 58% |
| *U NscanEst | 282 | 104 | 7% | 37% |
| UP Augustus | 312 | 100 | 6% | 32% |
| P Twinscan | 547 | 77 | 5% | 14% |
| UP GeneMark | 551 | 50 | 3% | 9% |
| UP SGP2 | 552 | 48 | 3% | 9% |
| UP GeneZilla | 477 | 47 | 3% | 10% |
| UP ExonHunter | 807 | 41 | 3% | 5% |
| UP GeneID | 267 | 38 | 2% | 14% |
| *UP GenScan | 395 | 37 | 2% | 9% |
| UP Dogfish | 204 | 33 | 2% | 16% |
| UP Saga | 331 | 18 | 1% | 5% |
Sensitivity and specificity of each method at detecting the 1,556 consensus transcripts across the pool of the following evidence based tracks: RefSeq, Known Gene, Ensembl, Gencode, AceView, ECgene and ExonWalk, as in Figure 3a. Coded information has been added in front of the track name: asterisks distinguish standard gene tracks, available genome-wide, from an ENCODE only track; a U track predicts a unique model per gene; P predicts protein coding regions only.