| Literature DB >> 29126148 |
Shashikant Pujar1, Nuala A O'Leary1, Catherine M Farrell1, Jane E Loveland2, Jonathan M Mudge2, Craig Wallin1, Carlos G Girón2, Mark Diekhans3, If Barnes2, Ruth Bennett2, Andrew E Berry2, Eric Cox1, Claire Davidson2, Tamara Goldfarb1, Jose M Gonzalez2, Toby Hunt2, John Jackson1, Vinita Joardar1, Mike P Kay2, Vamsi K Kodali1, Fergal J Martin2, Monica McAndrews4, Kelly M McGarvey1, Michael Murphy1, Bhanu Rajput1, Sanjida H Rangwala1, Lillian D Riddick1, Ruth L Seal5, Marie-Marthe Suner2, David Webb1, Sophia Zhu4, Bronwen L Aken2, Elspeth A Bruford5, Carol J Bult4, Adam Frankish2, Terence Murphy1, Kim D Pruitt1.
Abstract
The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.Entities:
Mesh:
Year: 2018 PMID: 29126148 PMCID: PMC5753299 DOI: 10.1093/nar/gkx1031
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset was made public. Details about CCDS releases are available on the CCDS Releases and Statistics web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=SHOW_STATISTICS).
Figure 2.Fraction of all genes in a CCDS release that are represented by at least two current CCDS IDs.
Figure 3.Changes in the human (A) and mouse (B) datasets with every new CCDS release. ‘New’ = new CCDS IDs added; ‘dropped’ = CCDS ID present in the previous release but withdrawn in the subsequent release; ‘updated’ = CCDS IDs that have an incremented accession version compared to the previous release, indicating a sequence update in the coding region.
Figure 4.A view of the graphical display accessed from the report page of CCDS3542.1 (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=ALLFIELDS&DATA=CCDS3542&ORGANISM=0&BUILDS=CURRENTBUILDS) using the purple ‘S’ icon. (A) Transcripts and proteins from NCBI Annotation Release 108. (B) Transcripts and proteins from Ensembl Release 85. The green bar indicates the gene; transcripts are shown in purple and proteins are shown in red color. Positioning the cursor over any of these objects (gene, transcript or protein) opens a tool tip which includes additional information and links. Proteins in the NCBI annotation display that are in the CCDS set include a link to the CCDS ID in the tool tip. The gray box to the right (indicated by vertical arrow) is the tool tip corresponding to the protein accession NP_002514.1. Differences between any two objects can also be revealed as vertical lines (indicated by horizontal arrows) when the objects (NM_002523.2 and ENST00000265634 in the figure) are selected using the ‘Control’ or ‘Command’ button on the keyboard.
Description of CCDS ‘Review Status’ categories
| Review status category | Public description of category | Detailed description |
|---|---|---|
| Provisional | ‘this record has not been manually reviewed by the collaboration’ | The CCDS ID does not have a ‘validated’ or ‘reviewed’ RefSeq or a VEGA accession associated with it; nor was it reviewed by the CCDS collaboration |
| Reviewed | • ‘by RefSeq and HAVANA’ | • The CCDS ID is associated with at least one ‘validated’ or ‘reviewed’ RefSeq AND at least one VEGA accession, which are manually reviewed by curators at NCBI and Ensembl-HAVANA groups, respectively. |
| • ‘by CCDS collaboration’ | • The CCDS ID was reviewed by curators in the CCDS collaboration. | |
| • ‘by RefSeq, HAVANA and CCDS collaboration’ | • The CCDS ID meets both ‘by RefSeq and HAVANA’ and ‘by CCDS collaboration’ review requirements. |
Figure 5.Distribution of human and mouse CCDS IDs by their ‘Review status’ in the current human (Release 20) and mouse (Release 21) CCDS releases at the time of data freeze. Details of the review status categories and sub-categories are provided in Table 1. Reviewed 1 = CCDS IDs reviewed ‘by RefSeq and HAVANA’, Reviewed 2 = CCDS IDs reviewed ‘by CCDS collaboration’, Reviewed 3 = CCDS IDs reviewed ‘by RefSeq, HAVANA and CCDS collaboration’.
Data types used in CCDS manual curation decisions
| Data type | Curation decisions |
|---|---|
| RNA-seq ( | Determination of transcript or gene structure or extent, inferred exon combination, splice variant existence |
| CAGE tags ( | Determination of transcription start sites, 5′ UTR extension |
| H3K4me3 methylation | Determination of general 5′ completeness of transcripts or genes |
| CpG islands | Determination of general 5′ completeness of transcripts or genes (in conjunction with other data) |
| Long read transcriptome data | Splice variants; especially useful for genes with poor INSDC transcript support |
| Proteomics | Determination of gene biotype, novel exons, novel protein termini. |
| Ribosome profiling | Determination of translation start codons or the coding status of genes with questionable biotypes |
| Conservation in other species | Determination of gene biotype, annotation of proteins with little or no data about gene function, determination of translation start codon |
| Conserved protein domains | Determination of gene biotype, annotation of proteins with little or no data about gene function |
| PhyloCSF | Determination of gene biotype, annotation of uncharacterized proteins |
| polyA-seq ( | Determination of 3′ completeness |