| Literature DB >> 26395405 |
Kristopher A Standish1,2, Tristan M Carland3, Glenn K Lockwood4, Wayne Pfeiffer5, Mahidhar Tatineni6, C Chris Huang7, Sarah Lamberth8, Yauheniya Cherkas9, Carrie Brodmerkel10, Ed Jaeger11,12, Lance Smith13,14, Gunaretnam Rajagopal15,16, Mark E Curran17, Nicholas J Schork18.
Abstract
MOTIVATION: Next-generation sequencing (NGS) technologies have become much more efficient, allowing whole human genomes to be sequenced faster and cheaper than ever before. However, processing the raw sequence reads associated with NGS technologies requires care and sophistication in order to draw compelling inferences about phenotypic consequences of variation in human genomes. It has been shown that different approaches to variant calling from NGS data can lead to different conclusions. Ensuring appropriate accuracy and quality in variant calling can come at a computational cost.Entities:
Mesh:
Year: 2015 PMID: 26395405 PMCID: PMC4580299 DOI: 10.1186/s12859-015-0736-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Technical summary of processing pipeline. a Storage requirement (GB) per sample for output file of each processing step. b Computational cost of mapping raw reads versus file size with 8 (green) or 16 (blue) computing threads. c Computational cost (SUs) per sample for each processing step. d Computational cost of PrintReads step versus file size with 8 (green) or 16 (blue) computing threads
Fig. 2Variant calling assessment. a Total computational cost (SUs) of calling variants on chromosome 21 in varying group sizes. Adjusted R2 provided for linear (purple) and quadratic (green) fits. b Computational cost per sample (relative to individual variant calling approach) of calling variants on chromosome 21 in varying group sizes. Adjusted R2 provided, assuming linear (purple) and quadratic (green) fits for total computational cost. c, d Sensitivity c and specificity d of variant calls on NA12878 versus estimated proportion of European admixture within a group (normalized by chromosome). e Admixture estimates of groups used for variant calling
Fig. 3Comparison of conventional and HC variant calls. a Average number of variants called per genome by HC (green), conventional (blue), or both (orange). b Distribution of percent concordance between conventional and HC calls for 437 samples. c Number of variant calls made by conventional pipeline versus number of variant calls made by HC for each patient, colored by self-reported race. d Number of pipeline specific variant calls per 100 Kb in HLA region
Fig. 4Summary of pipeline-specific variant calls. Proportion of novel variant sites and Ti/Tv ratios for pipeline-specific calls. a Variants across the genome. b Variants in HLA region. c Non-HLA variants
Fig. 5Impact of variability between pipelines. a Number of intergenic and intronic pipeline-specific variants. b Pipeline-specific variants in exonic, UTR, and non-coding RNA elements. c Functional impact of pipeline-specific protein-coding variants (blue = Conventional; green = HC; light = Novel; dark = Known). d Principal components 1 and 2 calculated from genotypes (MAF >1 %) using variants calls from conventional (left) or HC (right) pipelines. Individuals coded by self-reported Race (red = American Indian/Alaska Native; orange = Asian; yellow = Multiple; green = White; blue = Black/African American; purple = Other) and Ethnicity (circle = Hispanic/Latino; triangle = Not Hispanic/Latino)
Recommended job packing approach for best practices pipeline. Assumes node with 16 processors and 64 GB of memory
| Step | Tool | Memory per | Cores per | Commands |
|---|---|---|---|---|
| command (GB) | command | per node | ||
| Map | BWA | 32 | 8 | 2 |
| Bam | Samtools | 4 | 1 | 16 |
| Merge | Samtools | 4 | 1 | 16 |
| Sort | Samtools | 4 | 1 | 16 |
| MarkDuplicates | PicardTools | 7 | 2 | 8 |
| TargetCreator | GATK | 7 | 2 | 8 |
| IndelRealigner* | GATK | 12 | 3 | 5 |
| BaseRecalibrator | GATK | 30 | 8 | 2 |
| PrintReads* | GATK | 30 | 8 | 2 |
| HaplotypeCaller | GATK | 60 | 16 | 1 |
*Smaller memory allocation and more samples per node may prove more computationally efficient