| Literature DB >> 26639839 |
Sohyun Hwang1,2, Eiru Kim2, Insuk Lee2, Edward M Marcotte1.
Abstract
The success of clinical genomics using next generation sequencing (NGS) requires the accurate and consistent identification of personal genome variants. Assorted variant calling methods have been developed, which show low concordance between their calls. Hence, a systematic comparison of the variant callers could give important guidance to NGS-based clinical genomics. Recently, a set of high-confident variant calls for one individual (NA12878) has been published by the Genome in a Bottle (GIAB) consortium, enabling performance benchmarking of different variant calling pipelines. Based on the gold standard reference variant calls from GIAB, we compared the performance of thirteen variant calling pipelines, testing combinations of three read aligners--BWA-MEM, Bowtie2, and Novoalign--and four variant callers--Genome Analysis Tool Kit HaplotypeCaller (GATK-HC), Samtools mpileup, Freebayes and Ion Proton Variant Caller (TVC), for twelve data sets for the NA12878 genome sequenced by different platforms including Illumina2000, Illumina2500, and Ion Proton, with various exome capture systems and exome coverage. We observed different biases toward specific types of SNP genotyping errors by the different variant callers. The results of our study provide useful guidelines for reliable variant identification from deep sequencing of personal genomes.Entities:
Mesh:
Year: 2015 PMID: 26639839 PMCID: PMC4671096 DOI: 10.1038/srep17875
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A flow diagram summarizing the performance comparison of thirteen variant calling pipelines.
Summary of data sets used in this study.
| Plaform | Accession | Exome Capture | WGS/WES | Exome coverage |
|---|---|---|---|---|
| HiSeq2000 | SRR1611178 | SeqCap EZ Human Exome Lib v3.0 | WES | 79.93× |
| HiSeq2000 | SRR1611179 | SeqCap EZ Human Exome Lib v3.0 | WES | 79.84× |
| HiSeq2000 | SRR292250 | SeqCap EZ Exome SeqCap v2 | WES | 116.06× |
| HiSeq2000 | SRR515199 | SureSelect v4 | WES | 298.45× |
| HiSeq2000 | SRR098401 | SureSelect v2 | WES | 116.84× |
| HiSeq2500 | SRR1611183 | SeqCap EZ Human Exome Lib v3.0 | WES | 129.94× |
| HiSeq2500 | SRR1611184 | SeqCap EZ Human Exome Lib v3.0 | WES | 111.90× |
| HiSeq2000 | ERR194147 | UCSC Known gene | WGS | 45.68× |
| HiSeq2000 | SRX485062 | UCSC Known gene | WGS | 56.60× |
| HiSeq2500 | SRX515284 | UCSC Known gene | WGS | 56.87× |
| HiSeq2500 | SRX516752 | UCSC Known gene | WGS | 43.61× |
| IonProton | NA12878_combine | UCSC Known gene | WGS | 9.87× |
Figure 2Summary of variant calling performances by thirteen pipelines.
Performance of variant calling pipelines measured by APR for (A) SNP and (B) indel for multiple Illumina data sets and represented as a box plot. For Ion Proton, an APR for single data set for each of callers are indicated. Ion Proton data were pre-aligned by Tmap.
Figure 3Venn diagrams summarizing called variants by different callers.
The mean percentage with standard deviation of confidence variant calls with equal to or higher than the quality score threshold of 20 are represented for (A) Illumina data sets and (B) Ion Proton data set.
Examples of SNP calling errors.
| In a variant call set | In gold standard set | |
|---|---|---|
| Ignoring the reference allele (IR) | AA | RA |
| Adding the reference allele (AR) | RA | AA |
| Other SNP FP calls | AA | AB |
| AB | AA | |
| RA | RB |
R, reference allele; A, B, alternative SNP allele
Figure 4Probability of two types of SNP genotyping errors (IR and AR) for each caller as a function of different sequencing platforms.
Probabilities of IR or AR for twelve Illumina data sets are summarized as box-and-whisker plots.