| Literature DB >> 24831545 |
Ming Yi1, Yongmei Zhao2, Li Jia2, Mei He3, Electron Kebebew3, Robert M Stephens1.
Abstract
To apply exome-seq-derived variants in the clinical setting, there is an urgent need to identify the best variant caller(s) from a large collection of available options. We have used an Illumina exome-seq dataset as a benchmark, with two validation scenarios--family pedigree information and SNP array data for the same samples, permitting global high-throughput cross-validation, to evaluate the quality of SNP calls derived from several popular variant discovery tools from both the open-source and commercial communities using a set of designated quality metrics. To the best of our knowledge, this is the first large-scale performance comparison of exome-seq variant discovery tools using high-throughput validation with both Mendelian inheritance checking and SNP array data, which allows us to gain insights into the accuracy of SNP calling through such high-throughput validation in an unprecedented way, whereas the previously reported comparison studies have only assessed concordance of these tools without directly assessing the quality of the derived SNPs. More importantly, the main purpose of our study was to establish a reusable procedure that applies high-throughput validation to compare the quality of SNP discovery tools with a focus on exome-seq, which can be used to compare any forthcoming tool(s) of interest. Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.Entities:
Mesh:
Year: 2014 PMID: 24831545 PMCID: PMC4081058 DOI: 10.1093/nar/gku392
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Summary of parameter settings used for SNP calling and filtering for the selected SNP tools
All tools used default settings or suggested settings based on either direct communication with its author, or technical support, or forum communication. Although 0.99 is the default setting for VQSR, two thresholds of the VQSR step of GATK (0.99 and 0.90) were used to assess the robustness of the tools and were designated as GATK0.99 and GATK0.90; samtools_group was designated for SAMtools calls using pooled samples simultaneously, whereas samtools_individual designated as the SAMtools calls using individual samples one at a time. UG: Unified genotyper from GATK.
Summary of Mendelian inheritance error rates on SNPs derived from the new and old versions of GATK and different mappers
Summary of Mendelian inheritance error rates amongst the chosen family trio members (samples #9, #10 and #2; Supplementary Figure S1) for the SNPs detected with the new (V2.0 up to V2.2.4) or old (up to V1.6.7) versions of GATK and different mappers including Eland and BWA.
*Name designation for variations of GATK versions and mappers in the following format:
GATK_version_NumSample(option)_MapperContigOptionsForBWA(Option)_Mapper_VQSRThreshold. Version: V1 or V2 of GATK; NumSample: optional, if 5S, 5 samples with trio relations; if 17S, 17 samples in larger family; MapperContigOptionsForBWA: optional, if NC; no contigs in genome reference; C: with contigs in genome reference;
Mapper: BWA or Eland methods; VQSRThreshold: VQSR thresholds as 099 for 0.99 or 090 for 0.90. Similar results were obtained for the other two trio sets (with sample #3 or #4 as child) available in the family (data not shown).
Summary of Ti/Tv ratios of SNP call sets derived from new and old versions of GATK and different mappers
Summary of Ti/Tv ratios of SNP call sets derived from the new (V2.0 up to V2.2.4) or old (up to V1.6.7) versions of GATK and different mappers including Eland and BWA. *Name designation for variations of GATK versions and mappers in the following format:
GATK_version_NumSample(option)_MapperContigOptionsForBWA(Option)_Mapper_VQSRThreshold. Version: V1 or V2 of GATK; NumSample: optional, if 5S, 5 samples with trio relations; if 17S,17 samples in larger family; MapperContigOptionsForBWA: optional, if NC; no contigs in genome reference; C: with contigs in genome reference. Mapper: BWA or Eland methods.; VQSRThreshold: VQSR threshold as 099 for 0.99 or 090 for 0.90.
Summary of Mendelian inheritance error rates on SNPs detected from selected tools amongst the chosen family trio members
This table was assessing the SNPs detected with the selected SNP detection tools for samples #9, #10 and #2 from a family trio (Supplementary Figure S1). GATK0.90 and GATK0.99: GATK with two different VQSR thresholds at 0.90 and 0.99, respectively. All other SNP tools called SNPs at their default settings or suggested by authors as described at Table 1. Although the result reported here considered only SNPs on the array within the target interval regions of the genome defined by the exome enrichment kit (exome-subset), a similar result was obtained when all SNPs on the array were used (data not shown). Samtools_group: SAMtools calls with all available samples together. Samtools_Individual: SAMtools calls with each sample assessed individually. We included both, since there was some debate within the community of SAMtools users whether multi-sample SNP calling enhances the power for calling SNPs shared between samples and reduces the power for singleton SNPs (communication from the SAMtools developer in samtools-help forum discussion). Similar results were obtained for the other two trio sets (with sample #3 or #4 as child) available in the family (data for all of these family trio sets were shown at the top panels of multiple tables for Supplementary Table S4).
Figure 1.Distribution of SNP positions across family trio in selected SNP callers. SNPs of family trio composed of samples #9 (mother), #10 (father) and #2 (son) (Supplementary Figure S1), which were generated from GATK0.90 and GATK0.99 (VQSR at 0.90 and 0.99 threshold levels), samtools (call SNPs from samples either as a group or as individuals), VarScan, Partek, CLCBio, Illumina CASAVA and SNP array, were subjected to Venn diagram analysis for their positions. The numbers shown in the overlap indicate shared SNVs between the trio members and those in unique areas indicate unique SNVs for those members. Numbers in black are the number of SNV positions passing MIEC, whereas numbers in gray are the number of SNV positions failing MIEC. Similar results were obtained for the other two trio sets (with sample #3 or # 4 as child) available in the family (data not shown).
Figure 2.Distribution of SNP positions of the common SNPs of all members of family trio detected by GATK 0.99, samtools Individuals and Illumina CASAVA. Common SNPs of all members from family trio including #9, #10 and #2, which were generated from GATK 0.99 (0.99 threshold levels), samtools (call SNPs from samples as individuals) and Illumina CASAVA (Figure 1), were subjected to further Venn diagram analysis for their positions in details. The numbers shown in the overlapping areas indicate shared variants between the tools and those in unique areas indicate unique variants for each tool. (a) Numbers in black are the number of SNV positions passing MIEC, whereas numbers in gray are the number of variants designed for detection on the SNP array. (b) Numbers are percentage of SNVs passing MIEC that are also SNPs designed for detection on the SNP array (gray number divided by black number in each corresponding section of (a). (c) Numbers in black are the numbers of NGS-detected SNVs passing MIEC that are also designed for detection on the array, whereas numbers in gray are the number of SNPs designed for detection on SNP array for the same positions of NGS-detected SNVs passing MIEC, which has also passed MIEC within array data. (d) Percentage of NGS-detected SNVs passing MIEC that were also array-detected SNPs passing MIEC (gray number divided by black number in each corresponding section of (c). (e) Numbers in black indicate number of SNVs passing MIEC, whereas numbers in gray indicate the number of SNVs failing MIEC. (f) Error rate of MIEC based on (e) (gray number divided by black number in each corresponding section of (e).
Summary of Ti/Tv ratios of SNP call sets derived from the selected SNP detection tools
This table was assessing the SNPs detected with the selected SNP detection tools for one of the family members #2 (similar results are obtained for #9 and #10). GATK0.90 and GATK0.99: GATK with two different VQSR thresholds at 0.90 and 0.99, respectively. All other SNP tools called SNPs at their default settings or suggested by authors as described in Table 1. GATK_NoFilter: raw SNP call set (using Unified genotyper of GATK); samtools_NoFilter: raw SNP call set from Samtools_Group (using SAMtools without filtering); Other SAMtools results used –d 10 option to filter raw SNP call set (similar results obtained using –D option for filtering). samtools_group: SAMtools calls using all samples together; samtools_individual: SAMtools calls using each sample individually. (Note: the six novel SNPs of the SNP array were probably caused by the difference between the SNP array and dbSNP in SNP annotations).
Figure 3.Heads-up comparison of GATK and CASAVA on the subset of SNPs passing MIEC under one of the three best MIEC scenarios. The three best MIEC scenarios are as follows. (i) Both parents are homozygous variant and child has to be homozygous variant. (ii) Both parents are homozygous reference and child has to be homozygous reference. (iii) One of parents is homozygous variant and the other parent is homozygous reference and the child has to be heterozygous variant. SNVs derived from GATK raw calls (No VQSR Filtering), GATK0.99 or CASAVA that meets the MIEC scenario (i) were subjected to Venn diagram analysis. (a) Numbers in black indicate number of SNVs derived from NGS data that passed the above MIEC scenario. Numbers in gray indicate the number of SNVs derived from NGS data that passed the above MIEC scenario and also were designated SNVs for detection on SNP array. (b) Percentage of numbers in gray over the numbers in black in each area of (a). (c) numbers in black indicate numbers of SNVs derived from NGS data that passed the above MIEC scenario and also were designated SNPs for detection on SNP array. Numbers in gray are the numbers of SNPs designed for detection on SNP array for the same positions of NGS-detected SNVs passing MIEC, which has also passed MIEC within array data. (d) Percentage of numbers in gray over the numbers in black in each area of (c). A similar observation was made for other two scenarios (data not shown).