| Literature DB >> 31072870 |
Mary M Happ1, Haichuan Wang1, George L Graef1, David L Hyten2.
Abstract
Obtaining genome-wide genotype information for millions of SNPs in soybean [Glycine max (L.) Merr.] often involves completely resequencing a line at 5X or greater coverage. Currently, hundreds of soybean lines have been resequenced at high depth levels with their data deposited in the NCBI Short Read Archive. This publicly available dataset may be leveraged as an imputation reference panel in combination with skim (low coverage) sequencing of new soybean genotypes to economically obtain high-density SNP information. Ninety-nine soybean lines resequenced at an average of 17.1X were used to generate a reference panel, with over 10 million SNPs called using GATK's Haplotype Caller tool. Whole genome resequencing at approximately 1X depth was performed on 114 previously ungenotyped experimental soybean lines. Coverages down to 0.1X were analyzed by randomly subsetting raw reads from the original 1X sequence data. SNPs discovered in the reference panel were genotyped in the experimental lines after aligning to the soybean reference genome, and missing markers imputed using Beagle 4.1. Sequencing depth of the experimental lines could be reduced to 0.3X while still retaining an accuracy of 97.8%. Accuracy was inversely related to minor allele frequency, and highly correlated with marker linkage disequilibrium. The high accuracy of skim sequencing combined with imputation provides a low cost method for obtaining dense genotypic information that can be used for various genomics applications in soybean.Entities:
Keywords: high density SNP data; imputation; low cost genotyping; skim sequencing; soybean
Mesh:
Year: 2019 PMID: 31072870 PMCID: PMC6643887 DOI: 10.1534/g3.119.400093
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
The number of markers and genotyping rate in each low coverage subset from 0.1X to 1X sequencing depth. As coverage decreases, the total number of markers captured and completeness of the SNP panel decreases.
| Mean Depth | Genotyping Rate | Number of SNPs | Reads | Base Pairs |
|---|---|---|---|---|
| 1 | 32.44% | 1,288,463 | 6,327,889 | 949,183,385 |
| 0.9 | 30.41% | 1,240,823 | 5,695,100 | 854,265,047 |
| 0.8 | 27.77% | 1,174,619 | 5,062,311 | 759,346,708 |
| 0.7 | 24.91% | 1,097,843 | 4,429,522 | 664,428,370 |
| 0.6 | 21.80% | 1,005,880 | 3,796,734 | 569,510,031 |
| 0.5 | 18.47% | 895,596 | 3,163,945 | 474,591,693 |
| 0.4 | 14.98% | 760,167 | 2,531,156 | 379,673,354 |
| 0.3 | 11.40% | 590,786 | 1,898,367 | 284,755,016 |
| 0.2 | 7.85% | 375,343 | 1,265,578 | 189,836,677 |
| 0.1 | 4.74% | 133,747 | 632,789 | 94,918,339 |
Figure 1Comparing density plots for LD measures D' (A) and r2 (B) demonstrates that using whole genome sequencing with imputation results in a dataset that has a higher proportion of SNPs is strong pairwise linkage with each other, represented in the heavier tails in red near D' and r2 values of 1.
Figure 2A) Overall accuracy of filtered and raw imputed datasets were plotted across the evaluated depths. For all study panels, concordance rapidly erodes below a sequencing depth of ∼0.3X. B) Examining accuracy in the context of minor allele frequency (MAF) reveals that error occurs at higher rates as MAF approaches a maximum of 0.5.
Figure 3Proportion of errors made as categorized by whether the minor/major/heterozygous alleles was misimputed. In over half of all the errors made, Beagle overimputes the minor allele when the major allele is the true genotype. Incorrect heterozygous imputations make up a minor proportion of the total error and would likely be filtered out in inbred panels.
Figure 4Comparing the smoothed frequency of errors made at individual SNP sites with LD measures D' (A) and r2 (B) demonstrates the strong influence of linkage disequilibrium on imputation accuracy.
Figure 5A) The power to detect a moderate effect QTL becomes increasingly sensitive to error for both major to minor and vice versa errors at intermediary MAFs. B) Comparing the power to detect the same QTL with 300 samples at a 0% genotyping error vs. 500 samples with a 5% error rate demonstrates that cost savings can be used to increase study sizes in order to recover power losses introduced by the imputation error of both major and minor alleles.