| Literature DB >> 21543339 |
G W Williams1, P A Davis, A S Rogers, T Bieri, P Ozersky, J Spieth.
Abstract
The Caenorhabditis elegans genome sequence was published over a decade ago; this was the first published genome of a multi-cellular organism and now the WormBase project has had a decade of experience in curating this genome's sequence and gene structures. In one of its roles as a central repository for nematode biology, WormBase continues to refine the gene structure annotations using sequence similarity and other computational methods, as well as information from the literature- and community-submitted annotations. We describe the various methods of gene structure curation that have been tried by WormBase and the problems associated with each of them. We also describe the current strategy for gene structure curation, and introduce the WormBase 'curation tool', which integrates different data sources in order to identify new and correct gene structures. Database URL: http://www.wormbase.org/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21543339 PMCID: PMC3092607 DOI: 10.1093/database/baq039
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The number of curated CDSs and non-coding genes in C. elegans.
Figure 2.Screenshot of the curation system (to the right) in action with the ACeDB FMAP editor (to the left) displaying a simplified and annotated view of a typical anomaly of a curated CDS structure together with the structures predicted by Twinscan and mGene. There is evidence from the mGene prediction, EST alignment and a weak C. brenneri protein homology for an extra exon at the 3′-end. The curation system has been set to find all the anomalies in the clone F53F8 and some of these can be seen in the list at the bottom. Many of these anomalies are currently outside of the current FMAP view, which is centred around the CDS F53F8.7.
Figure 3.Relationships of the various components of the curation tool and the genome database. The components of the ACeDB database are shown in yellow and the components of the curation tool are shown in brown. The curator interacts with both the curation tool GUI to find regions with anomalies and the ACeDB FMAP genome editor to correct those regions.
Types of sequence curation anomaly
| Name | Description of the anomaly | Score |
|---|---|---|
| UNMATCHED_RST5 | 5′ RACE tags that are not near the 5′-end of a CDS | 5 |
| UNMATCHED_TWINSCAN | Twinscan predicted exons that do not overlap any CDS exons | 1 |
| UNMATCHED_GENEFINDER | Genefinder predicted exons that do not overlap any CDS exons | 1 |
| JIGSAW_DIFFERS_FROM_CDS | Predicted jigsaw exons that differ from the CDS exons | 1 |
| CDS_DIFFERS_FROM_JIGSAW | CDS exons that do not overlap exons predicted by the program jigsaw | 1 |
| UNMATCHED_WABA | WABA well-conserved coding regions that do not match any CDS exons | Logarithm of the WABA score |
| OVERLAPPING_EXONS | CDS exons that overlap a CDS exon or any other sort of gene in the opposite sense | 5 |
| SHORT_EXONS | CDS exons shorter than 30 bases | 1 |
| LONG_EXONS | CDS exons longer than 20 000 bases | 1 |
| SHORT_INTRONS | CDS introns shorter than 25 bases | 1 |
| REPEAT_OVERLAPS_EXON | CDS exons that substantially overlap RepeatMasked regions | 1 |
| INTRONS_IN_UTR | UTRs which have three or more exons | 1 |
| SPLIT_GENE_BY_TWINSCAN | CDS that overlap two or more Twinscan predictions indicating they should be split | 1 |
| UNMATCHED_EST | EST alignments with no matching CDS exons or pseudogenes or transposons or repeats | 1 |
| UNMATCHED_MASS_SPEC_PEPTIDE | Mass spectrometry peptide positions that are no longer completely covered by a CDS exon or transposon | 10 |
| EST_OVERLAPS_INTRON | CDS introns (excluding ones from isoforms) that are completely covered by an aligned EST or other transcript alignment | 5 |
| UNMATCHED_EXPRESSION | Tiling array highly expressed regions that do not match a CDS | 10 |
| UNCONFIRMED_INTRON | Introns of EST/mRNA alignments that do not exactly match CDS introns and which do not overlap with pseudogenes, etc. | 10 |
| WEAK_INTRON_SPLICE_SITE | Splice sites of CDS introns that have weak scores | 1 |
| UNMATCHED_PROTEIN | BLASTX protein alignments to the genome which do not overlap CDS exons or pseudogenes or transposons, etc. | Logarithm of the BLASTX score |
| UNMATCHED_EST | EST/mRNA alignments with no matching CDS exons or pseudogenes or transposons | 3 |
| FRAMESHIFTED_PROTEIN | BLASTX protein alignments to the genome that indicate an apparent frameshift | Logarithm of the BLASTX score |
| MERGE_GENES_BY_PROTEIN | BLASTX protein alignments to the genome which overlap two genes indicating that the genes should be merged | Logarithm of the BLASTX score |
| NOT_PREDICTED_BY_MGENE | The curated CDS is not predicted by mGene | 2 |
| NOVEL_MGENE_PREDICTION | mGene predicts a CDS which does not overlap with a curated CDS | 2 |
| UNMATCHED_MGENE | mGene predicted exons that do not overlap any CDS exons | 2 |
Figure 4.Numbers of changes to CDS structures, new protein-coding genes and new isoforms created in each WormBase release, showing a marked rise in curation activity from release 176 (marked by the arrow) onwards.