| Literature DB >> 21906294 |
Björn Hammesfahr1, Florian Odronitz, Marcel Hellkamp, Martin Kollmar.
Abstract
BACKGROUND: Nowadays, the sequencing of even the largest mammalian genomes has become a question of days with current next-generation sequencing methods. It comes as no surprise that dozens of genome assemblies are released per months now. Since the number of next-generation sequencing machines increases worldwide and new major sequencing plans are announced, a further increase in the speed of releasing genome assemblies is expected. Thus it becomes increasingly important to get an overview as well as detailed information about available sequenced genomes. The different sequencing and assembly methods have specific characteristics that need to be known to evaluate the various genome assemblies before performing subsequent analyses.Entities:
Year: 2011 PMID: 21906294 PMCID: PMC3180467 DOI: 10.1186/1756-0500-4-338
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Schematic organisation of the database. The diagram shows the major tables of the database and their connections. Some of the content fields of the three main tables Species, Project, and Genome File are listed. Details to publications are obtained from NCBI via their API. The References table contains the major sequencing centres and species project web pages.
Figure 2Contig distribution for three sample genome assemblies. A) Example of a low-coverage mammalian genome. B) Example of a high-coverage insect genome. C) Example of a chromosome assembly. All chromosomes are plotted as separate entries.
Figure 3Screenshots of diArks "Genome Files" search module and several result views. A) The new "Genome Files" search module of diArk allows a detailed search for species that were sequenced with a specific sequencing method, for certain assembly methods, for specific genome types, for the completeness of the assembly, for illegal characters (not a/A, t/T, g/G, c/C, n/N), and for genomes provided by diArk. Furthermore, the data can be filtered by the GC-content, by the sequence coverage, and the release date of the genome assemblies. B) The "Genome Files" result view provides an overview about the different genome assemblies generated by the sequencing centres. Clicking on the symbols provides further details and the possibility to download the genome file. C) The "References" result view provides an overview about some data analysis options the species project pages offer, like BLAST pages or access to genome browsers. D) The "Genome Stats" result view gives a species based overview about several genome statistics, like the chromosome numbers and the GC-contents, with the species ordered according to their taxonomy so that closely related organisms can be compared.
Figure 4Eukaryotes sequenced worldwide. A) The pie chart shows the sequenced species sorted by taxa for which genome assemblies have been released. B) The graph shows the increase of total sequenced eukaryotes, genome data as well as EST data, in dependence of the year. Note that the lower numbers in the figures compared to the numbers given in the text are due to the fact that dates, at which genomes had been made available, are not known for every genome assembly. C) The graph shows the sequenced eukaryotes separated according to complete and incomplete (low-coverage genomes) genome assemblies. In addition, publications of genome assemblies are plotted. D) The diagram shows the number of publications of genome assemblies separated to four major publishing groups, the Nature Journals, the PLoS Journals, Science, and the Proceedings of the National Academy of Science (PNAS).
Figure 5Species sequenced in relation to taxa. A), B) The pie charts show the number of sequenced species ordered by several major taxa. Graphs were drawn separately for species A) whose genome was sequenced and B) for which transcriptome data is available. C) Species are plotted according to the year in which the first genome assembly has been released. The species are combined to the same taxa as in A) and B).
Figure 6Number of species sequenced by a certain sequencing method per year. The diagram shows the number of species sequenced with different sequencing methods. For species that were sequenced using several methods (e.g. the whole genome library was sequenced with 454 and the BAC library sequenced with Sanger), every method is counted.
Figure 7Genome assembly characteristics. A) The graph shows the GC-content and the genome size of completed genome assemblies (thus excluding low-coverage genomes). For better visualisation the genome size is plotted logarithmically. B) The diagram shows the box plot of the genome sizes of some major taxa for which many completed genome assemblies are available. C) Same as B) but the genome sizes are plotted logarithmically to better visualize the sizes of the smaller genomes.
diArk's content in comparison to other databases
| diArk | GOLD | NHGRI | NCBI Genome | ISC | |
|---|---|---|---|---|---|
| # species (unique/total) | 806 | 1876/2153 | 187/248 | 986/1090 | 287/360 |
| # mRNA sequencing projects | 562 | 350 (EST) | 11 (RNA) | - | 6 (cDNA) |
| # genome sequencing projects | 1499 | 1705 | 160 | 1078 | - |
| # genomes marked as "sequenced" 1) | 613 | 358 (completed) | 88 (completed) | 431 | 105 |
| # genomes marked as "published" 2) | 358 | 156 | - | 285 | - |
| taxonomy | full taxonomy | two major taxa | one major taxon | two major taxa | one major taxon |
| sequencing method | ✓ | - | - | - | - |
| assembly method | ✓ | - | - | - | - |
| GC-content (# species) | 589/613 | 142/1876 | - | - | - |
| genome size (# species) | 589/613 | 510/1876 | - | ✓ | - |
| assembly details | ✓ | - | - | - | - |
| genome assembly files analysed | 2109 | - | - | - | - |
| species common names | ✓ | ✓ | ✓ | - | ✓ |
| links to species pages | ✓ | ✓ | - | - | - |
| detailed info about species pages | ✓ | - | - | - | - |
| sequencing centre reference | ✓ | ✓ | ✓ | ✓ | ✓ |
| funding agency | - | ✓ | ✓ | - | ✓ |
| target (survey sequencing, draft, etc.) | - | - | ✓ | ✓ | ✓ |
| project status | - | ✓ | ✓ | ✓ | ✓ |
| database search options | ✓ | ✓ | - | limited | limited |
| database content view options | 7 result tabs | 1 table | 1 table | 1 table | 1 table |
| accessibility/speed | fast | slow | fast | fast | fast |
1) In this analysis, all genomes, for which assemblies were announced, are regarded as "sequenced" independently of the various status that the different databases give (draft, completed, published) and independently of the genome coverage.
2) The numbers of published genomes have been retrieved as follows: diArk: 1) Using the Search page, select Projects_Search_Module, select "Sequencing type" Genome, and "Select all references" All Projects; 2) Add Search_Module, select Publications_Search_Module, and select "Select all publications" All Publications. GOLD: The number of published genomes is given, separated by kingdoms, in the "Complete Published" list. NCBI Genome: The number of published genomes has been derived by counting the links to PubMed.
NHGRI: http://www.genome.gov/10002154 (acquisition of data: 2011-03-10)
NCBI Genome Projects: http://www.ncbi.nlm.nih.gov/genomeprj, http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi (acquisition of data: 2011-03-10)
ISC: http://www.intlgenome.org/viewDatabase.cfm (acquisition of data: data as of 2011-03-10)