| Literature DB >> 26578580 |
Paul A Kitts1, Deanna M Church2, Françoise Thibaud-Nissen2, Jinna Choi2, Vichet Hem2, Victor Sapojnikov2, Robert G Smith2, Tatiana Tatusova2, Charlie Xiang2, Andrey Zherikov2, Michael DiCuccio2, Terence D Murphy2, Kim D Pruitt2, Avi Kimchi2.
Abstract
The NCBI Assembly database (www.ncbi.nlm.nih.gov/assembly/) provides stable accessioning and data tracking for genome assembly data. The model underlying the database can accommodate a range of assembly structures, including sets of unordered contig or scaffold sequences, bacterial genomes consisting of a single complete chromosome, or complex structures such as a human genome with modeled allelic variation. The database provides an assembly accession and version to unambiguously identify the set of sequences that make up a particular version of an assembly, and tracks changes to updated genome assemblies. The Assembly database reports metadata such as assembly names, simple statistical reports of the assembly (number of contigs and scaffolds, contiguity metrics such as contig N50, total sequence length and total gap length) as well as the assembly update history. The Assembly database also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Consortium (INSDC) and the assembly represented in the NCBI RefSeq project. Users can find assemblies of interest by querying the Assembly Resource directly or by browsing available assemblies for a particular organism. Links in the Assembly Resource allow users to easily download sequence and annotations for current versions of genome assemblies from the NCBI genomes FTP site. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.Entities:
Mesh:
Year: 2015 PMID: 26578580 PMCID: PMC4702866 DOI: 10.1093/nar/gkv1226
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Same name and different sequence content: the Zv7 UCSC and NCBI zebrafish assemblies. Panel A: part of chr21 in the Zv7 zebrafish assembly as displayed in the UCSC genome browser (http://genome.ucsc.edu). Panel B: the same span of chr21 of the Zv7 assembly as displayed in the NCBI Sequence Viewer. The UCSC Zv7 assembly has many Ensembl gene predictions in this region of chr21, whereas the same region in the RefSeq version of Zv7 chr21 at NCBI shows the rb1 and dub genes on the right but no other gene models. The reason for this discrepancy is that NCBI found that one component in this region matched sequences from mouse chromosome X and replaced this foreign component with a gap when they made the RefSeq version of chr21. Zv7 has since been replaced by newer versions of the zebrafish assembly that do not have the mouse contamination.
The number of assemblies in the Assembly database at each assembly level
| Assembly level | Number of assembliesa |
|---|---|
| Contig | 31 757 |
| Scaffold | 13 028 |
| Chromosome | 1 183 |
| Complete Genome | 9 187 |
aCounts taken on 30 August 2015.
Figure 2.The NCBI genome assembly model. The diagram depicts the assembly organization for a eukaryote with two nuclear chromosomes and a mitochondrial genome. The full assembly is comprised of a primary assembly-unit containing nuclear sequences, a non-nuclear assembly-unit containing mitochondrial sequences and an alternate locus group assembly-unit containing scaffolds that have been aligned to chromosome 2 of the primary assembly.
Updates that change the assembly version
| • Addition of a new sequence to the assembly. |
| • Removal of a sequence from the assembly. |
| • Use of a different version of any of the components in the assembly. For example a WGS contig or clone sequence. |
| • Changes to the joining of component sequences into contigs and scaffolds. For example a change in component order, orientation, span of component used or change in length. |
| • Changes to the arrangement of scaffolds on a chromosomea. For example a change in scaffold order or orientation, or a change in gap length. |
| • Assignment of an unplaced scaffold to a chromosome, or any change to the chromosome assignment of an unlocalized scaffold. |
| • A different sequence accessionb used for a contig, scaffold or chromosome. |
| • Re-assignment of a contig, scaffold or chromosome to a different assembly-unit. |
| • Changes to the placement or alignment of an alternate locus scaffold or patch scaffold. |
aChromosome in this table also includes linkage groups, organelle genomes and plasmids or other replicons.
bFeature locations are reported using sequence accessions.
The number of species and assemblies in the Assembly database by taxonomic group
| DDBJ/ENA/GenBank | RefSeq | |||
|---|---|---|---|---|
| Taxonomic group | Speciesa | Assembliesa | Speciesa | Assembliesa |
| Archaea | 451 | 617 | 269 | 414 |
| Bacteria | 9 736 | 47 555 | 7 366 | 34 514 |
| Fungi | 598 | 1 198 | 164 | 167 |
| Invertebrates | 337 | 402 | 81 | 81 |
| Plants | 173 | 248 | 62 | 62 |
| Protozoa | 187 | 294 | 69 | 70 |
| Mammals | 110 | 180 | 88 | 94 |
| Other Vertebrates | 137 | 151 | 83 | 83 |
| Viruses & viroids | na | na | 4 782 | 4 905 |
| All | 11 729 | 50 645 | 12 964 | 40 390 |
aCounts taken on 30 August 2015.
Figure 3.An example of the Assembly details page. The figure shows the upper portion of the cat (Felis catus) genome assembly GCF_000181335.2 page, including the metadata section and global statistics table. This figure does not show the lower portion of the page that contains tables displaying the assembly contents and detailed statistics.