| Literature DB >> 25361890 |
Christine F Baes1, Marlies A Dolezal, James E Koltes, Beat Bapst, Eric Fritz-Waters, Sandra Jansen, Christine Flury, Heidi Signer-Hasler, Christian Stricker, Rohan Fernando, Ruedi Fries, Juerg Moll, Dorian J Garrick, James M Reecy, Birgit Gredler.
Abstract
BACKGROUND: Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays.Entities:
Mesh:
Year: 2014 PMID: 25361890 PMCID: PMC4289218 DOI: 10.1186/1471-2164-15-948
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Variant genotypes compared for concordance between the array-based and sequence based methods to determine concordance, sensitivity and discrepancy between the two assays (a) and measures of concordance (b)
| a) |
| ||||
| Homozygous reference | Heterozygous | Homozygous alternative | |||
|
|
|
| |||
| Homozygous reference |
|
|
|
| |
| Heterozygous |
|
|
|
| |
| Homozygous alternative |
|
|
|
| |
| Genotype not identified |
|
|
|
| |
| b) | SNP concordance |
| |||
| Genotype concordance |
| ||||
| Non-reference sensitivity |
| ||||
| Non-reference discrepancy |
| ||||
Array-based information from the Illumina BovineHD BeadChip® (BovineSNP50 v1 DNA Analysis BeadChip® not shown) was considered the “gold-standard” and compared to next-generation sequencing-based variants obtained using a Illumina HiSeq2000 platform with various variant identification software, where genotypes are identified as:
a = homozygous reference in both NGS-based data and array-based data.
b = homozygous reference in NGS-based data, but as heterozygous in array-based data.
c = homozygous reference in NGS-based data, but as homozygous alternative in array-based data.
d = heterozygous in NGS-based data, but as homozygous reference in array-based data.
e = heterozygous in both NGS-based data and array-based data.
f = heterozygous in NGS-based data, but as homozygous alternative in array-based data.
g = homozygous reference in NGS-based data, but as homozygous reference in array-based data.
h = heterozygous in NGS-based data and array-based data, but as heterozygous in array-based data.
i = homozygous alternative in both NGS-based data and array-based data.
k = not found in NGS-based data, but as homozygous reference in array-based data.
l = not found in NGS-based data and array-based data, but as heterozygous in array-based data.
m = not found in NGS-based data, but as homozygous alternative in array-based data.
(Table adapted from DePristo et al., [7] and Jansen et al., [30] ).
Figure 1Distributions of single nucleotide variant counts (a), insertion and deletion counts (b), and multi-allelic site counts (c) identified per animal. For Platypus results, multi-nucleotide variants were split into allelic primitives for fair comparison between software. Single nucleotide variant counts (a), insertion and deletion counts (b), and multi-allelic site counts (c) identified per animal (n = 65; BTA1-29, BTAX) using single sample variant detection with Platypus, Samtools, and the UnifiedGenotyper following three pre-calling approaches.
Total number of single nucleotide variants (SNVs), insertions and deletions (InDels), and Transition/Transversion Ratios found using single and multi sample calling methods with HaplotypeCaller (HC), Platypus (PL), Platypus results after multi-nucleotide variants were split into allelic primitives (PL_PRIM), Samtools (SAM), and the UnifiedGenotyper (UG) for 65 animals
| Calling method | Total number of SNVs | Total number of InDels | Transition/Transversion ratio | |||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| HC | - | 19,901,885 | - | 2,685,032 | - | 2.138 |
| PL | 17,709,672 | 16,894,054 | 2,973,025 | 2,890,066 | 2.178 | 2.165 |
| PL_PRIM | 20,869,015 | 19,759,134 | 2,864,147 | 2,890,412 | 2.105 | 2.058 |
| SAM | 20,647,891 | 18,767,273 | 2,682,094 | 1,997,791 | 2.176 | 2.240 |
| UG | 21,984,283 | 22,048,382 | 2,485,677 | 2,741,468 | 2.024 | 1.974 |
The combined results of all 65 single samples represent single sample calling results.
Figure 2Average transition/transversion ratios over all animals using single sample variant identification (a) and transition/transversion ratios for variant identification with single and multi sample detection methods, as well as combined over all multi sample detection methods (b). Average transition/transversion ratios for variant identification with single sample detection methods using Platypus, Platypus Primitives, Samtools, UnifiedGenotyper and HaplotypeCaller (n = 65 samples, BTA1-29) are shown in (a). Transition / transversion ratios for variant identification with single and multi sample detection methods using Platypus, Platypus Primitives, Samtools, UnifiedGenotyper and HaplotypeCaller (n = 65 samples, BTA1-29) and a consensus data set (variants called by Platypus Primitives + Samtools + UnifiedGenotyper + HaplotypeCaller) are shown in (b).
Figure 3Consensus single nucleotide variants (a) and insertions and deletions (b) identified using multi sample variant detection methods. Consensus single nucleotide variants (a) and insertions and deletions (b) identified from whole genome sequencing data using four multi sample variant detection methods (Platypus Primitives, Samtools, UnifiedGenotyper and HaplotypeCaller).
Figure 4Average per-sample wall clock computation time required for common preparatory steps InDel realignment and base quality score recalibration (n = 65 samples, chromosomal region 5 Mb in length).
Figure 5Wall clock computation time required for variant identification using Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper on a chromosomal region 5 Mb in length with single (SS) or multi (MS) sample variant identification methods and varying numbers of samples (10, 20, 30 40, 50, 60).
Figure 6Average wall clock computation time required for multi sample variant identification with varying numbers of samples (10, 20, 30, 40, 50, 60) and different lengths of chromosomal regions (5 Mb and 10 Mb) using different software (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper).
Figure 7Non-reference sensitivity (a) and non-reference discrepancy (b) for single nucleotide variants identified using Platypus Primitives, Samtools, UnifiedGenotyper and Haplotype Caller (single vs. multi sample variant identification) using variants identified with the Illumina BovineHD BeadChip® as a gold standard. Indel realignment and base quality score recalibration were conducted for both single and multi sample calling results.
Figure 8Single nucleotide variant concordance (a) and single nucleotide variant concordance by array genotype (b) with variants identified using Platypus Primitives, Samtools, UnifiedGenotyper and Haplotype Caller (single vs. multi sample variant identification) and variants identified with the Illumina BovineHD BeadChip® as a gold standard. Indel realignment and base quality score recalibration were conducted for both single and multi sample calling results.
Figure 9Genotype concordance between genotypes identified using Platypus Primitives, Samtools, UnifiedGenotyper and HaplotypeCaller (single vs. multi sample variant identification) and genotypes identified with the Illumina BovineHD BeadChip® as a gold standard. Indel realignment and base quality score recalibration were conducted for both single and multi sample calling results.
Figure 10Genomic relationship between the 65 sequenced animals. Genomic relationship between the 65 sequenced animals was estimated using array genotypes (autosomal SNPs with known position) filtered separately for Cluster 1 (Brown Swiss, Braunvieh, Original Braunvieh; lower left corner of heat map) and Cluster 2 (Simmental, Swiss Fleckvieh, Holstein, Red Holstein; upper right corner of heat map). After filtering, the merged data set consisted of 38,317 common SNPs. The off-diagonals reflect the estimated pairwise identities by descent.