| Literature DB >> 21210981 |
Brian D O'Connor1, Barry Merriman, Stanley F Nelson.
Abstract
BACKGROUND: Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.Entities:
Mesh:
Year: 2010 PMID: 21210981 PMCID: PMC3040528 DOI: 10.1186/1471-2105-11-S12-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1SeqWare Query Engine schema. The HBase database is a generic key-value, column oriented database that pairs well with the inherent sparse matrix nature of variant annotations. (a) The primary table stores multiple genomes worth of generic features, variants, coverages, and variant consequences using genomic location within a particular reference genome as the key. Each genome is represented by a particular column family label (such as “variant:genome7”). For locations with more than one called variant the HBase timestamp is used to distinguish each. (b) Secondary indexing is accomplished using a secondary table per genome indexed. The key is the tag being indexed plus the ID of the object of interest, the value is the row key for the original table. This makes lookup by secondary indexes, “tags” for example, possible without having to iterate over all contents of the primary table.
Datasets
| Dataset | Technology | SNVs & Indels | SV | Translocations | Reference |
|---|---|---|---|---|---|
| European-Venter | Sanger | Y | Y | N | Levy |
| European-Watson | 454 | Y | Y | N | Wheeler |
| European- Quake | Helicos | Y | Y | N | Pushkarev |
| Asian | Illumina | Y | Y | N | Wang |
| Yoruban 18507 | Illumina | Y | Y | N | Bentley |
| Yoruban 18507 | SOLiD | Y | Y | N | McKernan |
| Korean | Illumina | Y | Y | N | Ahn |
| Korean-AKI | Illumina | Y | Y | N | Kim |
| 3 human genomes | Complete Genomics | Y | Y | N | Drmanac |
| AML T/N | Illumina | Y | Y | N | Ley |
| AML genome | Illumina | Y | Y | N | Mardis |
| Melanoma | Illumina | Y | Y | N | Pleasance |
| Lung cancer | SOLiD | Y | Y | N | Pleasance |
| U87MG | SOLiD | Y | Y | Y | Clark |
Fourteen whole genome datasets were loaded into the database, including the U87MG genome, with the March 2006 assembly of the human genome used as reference (NCBI36/hg18). Variant types (SNVs, small/large indels, SVs, etc) loaded and publication references are noted for each respective dataset. This table was adapted from Snyder et al. 2010.
Figure 2Load and query performance. Comparisons of load and query times between the HBase and BerkeleyDB backend. (a) Load times for the “1102 GBM” tumor/normal genomes where compared between HBase and BerkeleyDB. Both used a single-threaded approach to better compare relative performance. Both perform similarly but over time the load times for BerkeleyDB increase faster than with HBase. (b) Comparison of querying the 1102 genome database between BerkeleyDB, HBase single threaded, and HBase using MapReduce. Beyond 3M variants BerkeleyDB query times increase dramatically while both query types for HBase perform linearly, with MapReduce consistently exhibiting the best performance.