| Literature DB >> 30824715 |
Kyu-Baek Hwang1, In-Hee Lee2, Honglan Li1, Dhong-Geon Won1, Carles Hernandez-Ferrer2, Jose Alberto Negron2, Sek Won Kong3,4.
Abstract
Comprehensive and accurate detection of variants from whole-genome sequencing (WGS) is a strong prerequisite for translational genomic medicine; however, low concordance between analytic pipelines is an outstanding challenge. We processed a European and an African WGS samples with 70 analytic pipelines comprising the combination of 7 short-read aligners and 10 variant calling algorithms (VCAs), and observed remarkable differences in the number of variants called by different pipelines (max/min ratio: 1.3~3.4). The similarity between variant call sets was more closely determined by VCAs rather than by short-read aligners. Remarkably, reported minor allele frequency had a substantial effect on concordance between pipelines (concordance rate ratio: 0.11~0.92; Wald tests, P < 0.001), entailing more discordant results for rare and novel variants. We compared the performance of analytic pipelines and pipeline ensembles using gold-standard variant call sets and the catalog of variants from the 1000 Genomes Project. Notably, a single pipeline using BWA-MEM and GATK-HaplotypeCaller performed comparable to the pipeline ensembles for 'callable' regions (~97%) of the human reference genome. While a single pipeline is capable of analyzing common variants in most genomic regions, our findings demonstrated the limitations and challenges in analyzing rare or novel variants, especially for non-European genomes.Entities:
Year: 2019 PMID: 30824715 PMCID: PMC6397176 DOI: 10.1038/s41598-019-39108-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Heatmaps visualizing dissimilarity between analytic pipelines. Jaccard distances between a pair of analytic pipelines and reference variant sets from the 1000 Genomes Project (1KGP) and the Garvan Institute (X-TENs D and J) for (a) SNPs of NA12878, (b) indels of NA12878, (c) SNPs of NA19240, and (d) indels of NA19240 were respectively calculated and scaled into [0, 1].
Figure 2Effect size of factors related to call concordance between analytic pipelines. Negative binomial regression was performed using six factors – minor allele frequency (MAF) and predicted functional impact of variants, and repetitive DNA elements, GC content, depth of coverage, and mapping quality (MAPQ) at variant loci – to predict call concordance between analytic pipelines, i.e., the number of pipelines called a variant. Statistically significant associations (Wald tests, P < 0.001) are denoted by ‘*’.
Figure 3Performance comparison of analytic pipelines and their ensembles. Performances were evaluated using variant call sets from the 1000 Genomes Project for (a) SNPs of NA12878, (b) indels of NA12878, (c) SNPs of NA19240, and (d) indels of NA19240. Analytical positive predictive value (PPV) and analytical sensitivity of each pipeline without variant filtering are presented. For two ensemble methods, performance curves according to cutoff values for variant filtering are depicted. The inside plots are magnified version for clearly showing the performance of high-performance pipelines.