| Literature DB >> 31077313 |
Guilhem Sempéré1,2,3, Adrien Pétel2,4, Mathieu Rouard2,5, Julien Frouin6,7, Yann Hueber2,5, Fabien De Bellis6,7, Pierre Larmande2,4.
Abstract
BACKGROUND: The study of genetic variations is the basis of many research domains in biology. From genome structure to population dynamics, many applications involve the use of genetic variants. The advent of next-generation sequencing technologies led to such a flood of data that the daily work of scientists is often more focused on data management than data analysis. This mass of genotyping data poses several computational challenges in terms of storage, search, sharing, analysis, and visualization. While existing tools try to solve these challenges, few of them offer a comprehensive and scalable solution.Entities:
Keywords: BrAPI; GA4GH; HapMap; MongoDB; NoSQL; PLINK; REST; SNP; VCF; genomic variations; indel; interoperability; web
Mesh:
Year: 2019 PMID: 31077313 PMCID: PMC6511067 DOI: 10.1093/gigascience/giz051
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:High-level diagram of Gigwa architecture and features.
Figure 2:Benchmark results.
Figure 3:Discriminating variants. A, Filtering parameters and variant distribution. B, A discriminated variant's genotype with complementary information (males in blue, females in yellow). C, Chromosome region as reported by Feulner et al. [27] showing the strongest association with sex determination.
Benchmarking test description
| Test No. | Aims | Methods |
|---|---|---|
|
| Assess evolution of tool speed performance. Involved Gigwa v1, Gigwa v2, VCFtools v0.1.13 (originally benchmarked) [ | Run on configuration 1 using dataset 1 (along with sub-sampled versions, so as to obtain 6 different databases), all with the same number of individuals (i.e., 3,000) but with various numbers of markers. Query was a MAF range between 10% and 30% applied to the first 2,000 individuals |
|
| (i) Assess performance of latest versions of tools (Gigwa v2 and VCFtools v0.1.16) when simultaneously querying on variant-level (indexed in Gigwa) and genotype-level (unindexed in Gigwa) fields. (ii) Estimate the benefit of migrating to high-performance hardware by monitoring differences in response times between tools | Run on configuration 2 using dataset 1 without its derivatives, sub-sampling now being performed on the fly by restricting the search to a varying list of chromosomes. The query was the same MAF range query as above |
|
| (i) Test Gigwa v2’s suitability for working on very large datasets. (ii) Compare trends with those observed in a small dataset (Test 2) | Run on configuration 2 using dataset 2, sub-sampling being performed on the fly by restricting the search to a varying list of chromosomes. The query was the same MAF range query as above |