| Literature DB >> 26394715 |
Xiaowei Zhan1, Dajiang J Liu2,3.
Abstract
Next-generation sequencing has enabled the study of a comprehensive catalogue of genetic variants for their impact on various complex diseases. Numerous consortia studies of complex traits have publically released their summary association statistics, which have become an invaluable resource for learning the underlying biology, understanding the genetic architecture, and guiding clinical translations. There is great interest in the field in developing novel statistical methods for analyzing and interpreting results from these genotype-phenotype association studies. One popular platform for method development and data analysis is R. In order to enable these analyses in R, it is necessary to develop packages that can efficiently query files of summary association statistics, explore the linkage disequilibrium structure between variants, and integrate various bioinformatics databases. The complexity and scale of sequence datasets and databases pose significant computational challenges for method developers. To address these challenges and facilitate method development, we developed the R package SEQMINER for annotating and querying files of sequence variants (e.g., VCF/BCF files) and summary association statistics (e.g., METAL/RAREMETAL files), and for integrating bioinformatics databases. SEQMINER provides an infrastructure where novel methods can be distributed and applied to analyzing sequence datasets in practice. We illustrate the performance of SEQMINER using datasets from the 1000 Genomes Project. We show that SEQMINER is highly efficient and easy to use. It will greatly accelerate the process of applying statistical innovations to analyze and interpret sequence-based associations. The R package, its source code and documentations are available from http://cran.r-project.org/web/packages/seqminer and http://seqminer.genomic.codes/.Entities:
Keywords: genome annotation; R; information retrieval; next-generation sequencing; statistical genetics
Mesh:
Year: 2015 PMID: 26394715 PMCID: PMC4794281 DOI: 10.1002/gepi.21918
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Comparison of querying files of summary association statistics
| Function | Task | Time complexity | Memory complexity |
|---|---|---|---|
|
| Retrieve summary association statistics from 100 randomly chosen regions | 0.32 sec | 550 KB |
|
| Retrieve summary association statistics from 100 randomly chosen regions. Also retrieve covariance matrix between these summary association statistics | 1.12 sec | 1.3 MB |
|
| Read entire file of summary association statistics into memory | 7.8 min | 103 MB |
|
| Read entire file of summary association statistics and their covariance matrix into memory | 34.6 hr | 263 GB |
We compared the performance of SEQMINER for querying files of summary association statistics and files of correlations coefficients between summary association statistics.
Comparison of time and memory complexity for annotating sequence variants
| Tools | Chunk size | Time (second) | Memory (kilobytes) |
|---|---|---|---|
| SEQMINER | Entire chromosome | 8,371 | 63,072 |
| VariantAnnotation | 5,000 | 144,403 | 8,364,784 |
| 10,000 | 125,748 | 16,236,324 | |
| 20,000 | 116,078 | 30,078,896 |
We benchmarked the performance of SEQMINER and VariantAnnotation for annotating sequence variants in chromosome 1 from the 1000 Genomes Project phase 1 datasets. Annotation by SEQMINER was done using function annotateVCF. VariantAnnotation cannot analyze chromosome 1 in one batch due to memory constraints. We compared the performance of VariantAnnotation by dividing the chromosome 1 dataset into chunks and annotating each chunk separately. For measuring memory consumption, we recorded peak memory usage. Cumulative time is recorded for annotating the entire chromosome.
Comparison of time and memory complexity for querying selected genes/ranges
| Tool | Task | Time (seconds) | Memory (kilobytes) |
|---|---|---|---|
| SEQMINER | Extract 100 randomly | 76 | 37,948 |
| VariantAnnotation | Selected ranges | 1,313 | 1,122,204 |
| SEQMINER | Extract 100 randomly | 462 | 59,736 |
| VariantAnnotation | Selected genes | 1,718 | 1,461,404 |
We compared the performance of SEQMINER and VariantAnnotation in extracting nonsynonymous variants from 100 randomly selected genes or ranges. Whole genome datasets from the 1000 Genomes Project phase 1 were used. To extract randomly selected genes, we used readVCFToListByGene function in SEQMINER. For VariantAnnotation, we first determined the genomic ranges for each gene and extract variants within these genomic ranges. We then predicted the function of retrieved variants and select the subset of variants that were nonsynonymous.