| Literature DB >> 22434842 |
Rachel A Harte1, Catherine M Farrell, Jane E Loveland, Marie-Marthe Suner, Laurens Wilming, Bronwen Aken, Daniel Barrell, Adam Frankish, Craig Wallin, Steve Searle, Mark Diekhans, Jennifer Harrow, Kim D Pruitt.
Abstract
The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, protein-coding region annotations for the human and mouse reference genome assemblies. The CCDS data set reflects a 'gold standard' definition of best supported protein annotations, and corresponding genes, which pass a standard series of quality assurance checks and are supported by manual curation. This data set supports use of genome annotation information by human and mouse researchers for effective experimental design, analysis and interpretation. The CCDS project consists of analysis of automated whole-genome annotation builds to identify identical CDS annotations, quality assurance testing and manual curation support. Identical CDS annotations are tracked with a CCDS identifier (ID) and any future change to the annotated CDS structure must be agreed upon by the collaborating members. CCDS curation guidelines were developed to address some aspects of curation in order to improve initial annotation consistency and to reduce time spent in discussing proposed annotation updates. Here, we present the current status of the CCDS database and details on our procedures to track and coordinate our efforts. We also present the relevant background and reasoning behind the curation standards that we have developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts. Examples are provided to illustrate the application of these guidelines. DATABASE URL: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi.Entities:
Mesh:
Year: 2012 PMID: 22434842 PMCID: PMC3308164 DOI: 10.1093/database/bas008
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Status of current CCDS builds (as of 7 September 2011)
| Organism → | Human (Build 37.3) | Mouse (Build 37.2) |
|---|---|---|
| GeneIDs | 18 471 | 19 508 |
| CCDS IDs | 26 473 | 22 187 |
| Public CCDS IDs | 26 400 | 21 921 |
| Genes with >1 CCDS ID | 4999 | 1986 |
| Genes with >6 CCDS IDs | 76 | 15 |
aPublic CCDS IDs are all those that are not currently under review or pending an update or withdrawal
Types of CCDS QA tests performed prior to acceptance of CCDS candidates
| CCDS QA test | Test purpose |
|---|---|
| Subject to NMD | Checks for transcripts subject to NMD |
| Quality low | Checks for low coding propensity |
| Has non-consensus splice sites | Checks for non-canonical splice sites |
| Predicted pseudogene | Checks for genes that are predicted to be pseudogenes by UCSC |
| Ortholog not found/not conserved | Checks for genes that are not conserved (UCSC calculation) and/or are not in a HomoloGene cluster |
| Too short | Checks for transcripts or proteins that are unusually short, typically <100 amino acids |
| RefSeq is not an NP_ | Checks if the RefSeq has model (XP_) status; only NCBI matches with NP_ IDs are permitted as CCDS ID accessions |
| CDS start or stop not in alignment | Checks for a start or stop codon in the reference genome sequence |
| Internal stop | Checks for the presence of an internal stop codon in the genomic sequence; possibly a selenocysteine codon or ribosomal frameshift |
| Length mismatch versus genome | Checks if the protein encoded by the reference genome sequence is the same length as the matching annotation sequences |
| NCBI:Ensembl protein length different | Checks if the protein encoded by the NCBI RefSeq is the same length as the EBI/WTSI protein |
| Low percent identity versus genome | Checks for >99% overall identity between the matching annotations and the genomic-encoded protein |
| NCBI:Ensembl low percent identity | Checks for >99% overall identity between the NCBI and EBI/WTSI proteins |
| Accession dead | Checks if an associated RefSeq is no longer valid |
| GeneID changed | Checks if the GeneID has been changed |
| Gene discontinued | Checks if the GeneID is no longer valid |
| Not protein coding | Checks if the GeneID no longer has a protein-coding locus type |
| More than one GeneID represented | Checks for accessions associated with >1 GeneID; allowed only for readthrough genes that encode the same protein as an individual gene |
aAll tests are performed following the annotation comparison step of each CCDS build and are independent of individual annotation group QA tests performed before the annotation comparison.
bWhen the stop codon occurs >50 nt upstream of the last splice site (7, 8).
cSplice donor-acceptor pairs other than GT-AG, GG-AG and AT-AC.
dPredicted retrotransposed genes (9).
eNCBI’s database for the automated detection of homologs (http://www.ncbi.nlm.nih.gov/homologene/).
Figure 1.The flowchart outlines the CCDS review process (light gray boxes). CCDS IDs undergo status changes during and following the review process, as indicated by the colored boxes, where light green indicates ‘Public’ status, red indicates an ongoing review that has not yet reached consensus, orange indicates a pending update or withdrawal that has reached consensus, and purple indicates ‘Withdrawn’ status.
Figure 2.UCSC Genome Browser view of the human KLHL35 (kelch-like 35) gene. CCDS8237.1 was based on AK091109.1 (mRNA track, blue). This CCDS ID has now been withdrawn because a retained intron introduces a premature termination codon, rendering the transcript an NMD candidate. CCDS44685.2 representing the completely processed full-length variant remains valid for this gene.
Figure 3.UCSC Genome Browser view of CCDS4929.1, which was updated to version 2, representing a variant of the human CRISP3 (cysteine-rich secretory protein 3) gene. The CDS was extended at the 5′-end. (a) Both the longer protein (258 amino acids) encoded by the update and the shorter protein (245 amino acids) have predicted signal peptides (SignalPv4.0) of 32 amino acids and 19 amino acids, respectively. (b and c) Base-level view. The upstream AUG start codon (b) has the weaker Kozak context (blue box) and is only conserved among primates (red box), whereas the downstream AUG (c) is conserved among more mammals (46-way alignment and conservation track).