| Literature DB >> 22984983 |
Estelle Proux-Wéra1, David Armisén, Kevin P Byrne, Kenneth H Wolfe.
Abstract
BACKGROUND: Yeasts are a model system for exploring eukaryotic genome evolution. Next-generation sequencing technologies are poised to vastly increase the number of yeast genome sequences, both from resequencing projects (population studies) and from de novo sequencing projects (new species). However, the annotation of genomes presents a major bottleneck for de novo projects, because it still relies on a process that is largely manual.Entities:
Mesh:
Year: 2012 PMID: 22984983 PMCID: PMC3507789 DOI: 10.1186/1471-2105-13-237
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Classifying pairs of consecutive HSPs. In this hypothetical example, a TBLASTN search using a 100 amino acid query protein produces two HSPs that are close together in the genome. We classify the relationship between two consecutive HSPs into one of four categories: (i) Frameshift. Two consecutive HSPs are in different frames but the distance between them is similar in both the query and the subject. (ii) Region of low similarity. Two consecutive HSPs are in the same frame, separated by a similar distance in both the query and the subject, with no stop codon between them. (iii) Intron. Two consecutive HSPs for which query and subject coordinates are dissimilar. This possibility is only considered if an existing gene from the same pillar and species group contains an intron. (iv) Duplication. If all other possibilities have been excluded, two consecutive HSPs suggest a probable local gene duplication.
Figure 2Method for defining start and stop codon coordinates. The thick black bar indicates the location of the original BLAST HSP, and the thick grey bar indicates the gene coordinates reported by YGAP. M and asterisk (*) represent the locations of all possible start (ATG) and stop (TAA/TAG/TGA) codons in the same frame as the HSP. The start codon is chosen by searching around the beginning of the HSP as follows: (A) If the HSP (or the upstream HSP, in the case where a pair of HSPs is being considered) begins with a methionine codon, no change is made to the starting coordinate. (B) If the HSP does not begin with methionine, the ORF is extended to the furthest upstream methionine. (C) If during extension a stop codon is encountered before reaching a methionine, the software instead searches for a leading methionine within the first 45 nucleotides of the HSP. (D) If no suitable starting methionine is found using these steps, the original coordinates of the HSP are kept and the gene is tagged for manual inspection. Stop codons are found by walking downstream from the HSP, unless there is a stop codon within the HSP (in which case the HSP is trimmed accordingly).
Figure 3Screenshots from the YGAP website. (A) Upload screen. (B) Results page including links to several types of output files and gene lists. (C) Mini-YGOB browser showing the new annotated species (here, T. blattae, a post-WGD species, in yellow/orange), compared to genomes of E. gossypii (non-WGD species, in green), S. cerevisiae (post-WGD species, in blue), and the Ancestral genome (in pink).
Comparison of automatic reannotations of the genome by YGAP and AUGUSTUS, to the reference annotation
| Chr_1 | 75 | 76 | 7 | 5 | 1 | 1 | 2 | 3 | 8 | 9 |
| Chr_2 | 354 | 346 | 3 | 13 | 2 | 2 | 5 | 6 | 28 | 26 |
| Chr_3 | 135 | 126 | 1 | 9 | 6 | 0 | 6 | 4 | 9 | 16 |
| Chr_4 | 668 | 646 | 8 | 24 | 2 | 1 | 15 | 16 | 47 | 54 |
| Chr_5 | 236 | 232 | 1 | 6 | 7 | 0 | 6 | 10 | 18 | 24 |
| Chr_6 | 101 | 106 | 2 | 8 | 1 | 1 | 8 | 5 | 14 | 3 |
| Chr_7 | 457 | 446 | 1 | 9 | 6 | 1 | 7 | 14 | 37 | 45 |
| Chr_8 | 237 | 235 | 2 | 7 | 2 | 1 | 9 | 8 | 22 | 20 |
| Chr_9 | 180 | 176 | 0 | 7 | 5 | 1 | 4 | 4 | 20 | 21 |
| Chr_10 | 312 | 300 | 0 | 9 | 4 | 0 | 6 | 5 | 25 | 32 |
| Chr_11 | 284 | 275 | 1 | 9 | 5 | 0 | 1 | 0 | 18 | 24 |
| Chr_12 | 445 | 414 | 3 | 14 | 5 | 2 | 10 | 10 | 24 | 48 |
| Chr_13 | 405 | 387 | 2 | 10 | 7 | 0 | 5 | 9 | 22 | 39 |
| Chr_14 | 353 | 337 | 0 | 11 | 4 | 3 | 3 | 4 | 25 | 31 |
| Chr_15 | 465 | 451 | 3 | 14 | 4 | 0 | 5 | 12 | 34 | 42 |
| Chr_16 | 412 | 385 | 10 | 17 | 4 | 0 | 7 | 7 | 25 | 49 |
| Total | 5119 | 4938 | 44 | 172 | 65 | 13 | 99 | 117 | 376 | 483 |
Columns show the numbers of genes in each category, when compared to genes in the reference YGOB annotation of the S. cerevisiae genome (which is based on Saccharomyces Genome Database annotation).
Comparison of intron structure predictions in and by YGAP and AUGUSTUS
| YGAP | 146 | 2 | 127 | 17 | 122 | 266 | |
| | AUGUSTUS | 221 | 90 | 87 | 44 | 121 | 252 |
| YGAP | 146 | 12 | 123 | 11 | 58 | 192 | |
| AUGUSTUS | 251 | 173 | 58 | 20 | 94 | 172 |
Note that False Negatives in YGAP include not only those genes for which no intron was predicted by the software, but also those for which intron coordinates could not be defined and were tagged for manual curation. The total number of introns studied (rightmost column) differs between YGAP and AUGUSTUS because some genes were not predicted by both methods.
Comparison of automatic annotations of the genome by YGAP and AUGUSTUS, to the reference (Scas) annotation
| scf7180000013410 (chr. 1) | 1427 | 1289 | 12 | 34 | 31 | 5 | 19 | 46 | 56 | 198 |
| scf7180000013411 (chr. 2) | 851 | 764 | 8 | 28 | 12 | 5 | 13 | 19 | 35 | 106 |
| scf7180000013405 (chr. 3) | 544 | 485 | 3 | 17 | 13 | 2 | 14 | 25 | 28 | 84 |
| scf7180000013408 (chr. 4) | 462 | 418 | 1 | 10 | 9 | 1 | 3 | 9 | 16 | 59 |
| scf7180000013414 (chr. 5) | 386 | 338 | 3 | 16 | 13 | 8 | 6 | 18 | 11 | 50 |
| scf7180000013415b (chr. 6) | 374 | 336 | 3 | 7 | 6 | 1 | 10 | 15 | 17 | 56 |
| scf7180000013412 (chr. 7) | 387 | 342 | 3 | 16 | 8 | 1 | 11 | 25 | 18 | 57 |
| scf7180000013409 (chr. 8) | 331 | 294 | 1 | 8 | 11 | 11 | 6 | 9 | 15 | 44 |
| scf7180000013415a (chr. 9) | 290 | 250 | 4 | 12 | 1 | 3 | 3 | 8 | 17 | 47 |
| scf7180000013407 (chr. 10) | 208 | 185 | 2 | 5 | 5 | 1 | 5 | 8 | 10 | 34 |
| Total | 5260 | 4701 | 40 | 153 | 109 | 38 | 90 | 182 | 223 | 735 |
Columns show the numbers of genes in each category, when compared to genes in the reference (Scas) manual annotation of N. castellii genes.
Numbers of annotated genes requiring frameshift corrections or manual attention in and
| Automatically correctedb | - | - | 97 | 86 |
| Unable to correctc | 93 | 3 | 76 | 33 |
| Tagged for manual inspectiond | 390 | 155 | 465 | 216 |
a Confirmed by comparison to the curated annotations of S. cerevisiae and N. castellii.
b Frameshifts corrected using the reads file.
c Either because the reads were not helpful or there were no frameshift (e.g. genes in which natural ribosomal frameshifting occurs).
d These potential genes may contain undetected introns, untranslatable sequences due to inaccurate prediction of exon locations, or may begin or end in undefined (N) nucleotides.