Literature DB >> 31072870

Generating High Density, Low Cost Genotype Data in Soybean [Glycine max (L.) Merr.].

Mary M Happ¹, Haichuan Wang¹, George L Graef¹, David L Hyten².

Abstract

Obtaining genome-wide genotype information for millions of SNPs in soybean [Glycine max (L.) Merr.] often involves completely resequencing a line at 5X or greater coverage. Currently, hundreds of soybean lines have been resequenced at high depth levels with their data deposited in the NCBI Short Read Archive. This publicly available dataset may be leveraged as an imputation reference panel in combination with skim (low coverage) sequencing of new soybean genotypes to economically obtain high-density SNP information. Ninety-nine soybean lines resequenced at an average of 17.1X were used to generate a reference panel, with over 10 million SNPs called using GATK's Haplotype Caller tool. Whole genome resequencing at approximately 1X depth was performed on 114 previously ungenotyped experimental soybean lines. Coverages down to 0.1X were analyzed by randomly subsetting raw reads from the original 1X sequence data. SNPs discovered in the reference panel were genotyped in the experimental lines after aligning to the soybean reference genome, and missing markers imputed using Beagle 4.1. Sequencing depth of the experimental lines could be reduced to 0.3X while still retaining an accuracy of 97.8%. Accuracy was inversely related to minor allele frequency, and highly correlated with marker linkage disequilibrium. The high accuracy of skim sequencing combined with imputation provides a low cost method for obtaining dense genotypic information that can be used for various genomics applications in soybean.

Entities: Chemical Species

Keywords: high density SNP data; imputation; low cost genotyping; skim sequencing; soybean

Mesh：

Year: 2019 PMID： 31072870 PMCID： PMC6643887 DOI： 10.1534/g3.119.400093

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Genomics research has yielded a variety of tools which allow for more efficient and precise translation of genetic variation into crop improvements. Panels of single nucleotide polymorphisms (SNPs) obtained through SNP arrays or genotyping-by-sequencing (GBS) are the most common tool used to explore and make associations between genetic and phenotypic variation. Genomics-assisted crop breeding continues to demand increasing densities of genotype information to successfully dissect and predict genetically complex traits (Hamblin ; Lorenz ). Current approaches of directly ascertaining a high density of SNP genotype data on large populations are cost prohibitive or fall short of being able capture the maximum amount of genetic space. Fixed SNP arrays and GBS are popular options for SNP genotyping in crops. Panels ranging in densities of up to ∼600,000 variants are now common in several crop species (Rasheed ). However, recent genomics studies are utilizing datasets consisting of one million or more markers to answer complex, quantitative genetic questions. The need for this high density of markers is rendering current arrays and GBS approaches inadequate to generate the magnitude of data modern genomic studies require (Tian ; Patil ; Li ). High-depth whole genome sequencing can achieve these marker densities. One study utilizing high-depth whole genome sequencing in soybean found 9,107,000 high quality SNPs (Valliyodan ). Despite advances and the plummeting cost of next generation sequencing (NGS) data, this approach still presents a heavy financial burden, as several reads are required at each variant site to ensure data quality and completeness. Decreasing genome coverage in the interest of cost savings introduces missing data, which decreases power and can produce biased results. Imputation of missing data has the potential to allow the researcher to recover nearly all of the missing data points resulting from skim sequencing, drastically reducing genotyping expenses associated generating complete, high quality, high resolution SNP datasets. By predicting the unobserved genotypes based on the surrounding variants and their correlation to a complete reference panel, missing data can be amended to the correct allele genotype. This technique has been developed and extensively used in human genomic research, and is now commonly extended to other organisms (Pei ; Howie ; Howie ). Seen frequently in plants is the use of imputation to fill missing data points in GBS data (Chan ; Chung ). Specially designed populations such as bi-parental, nested, and multi-parent where the founders are genotyped to a high depth and used for the reference haplotypes has been shown to boost accuracy (Tian ; Swarts ; Huang ; Bayer ; Cericola ). Crop breeding programs working with inbred species and/or inbred lines are uniquely positioned to leverage imputation algorithms in an extremely accurate manner. Near complete homozygosity through inbreeding or double haploids allows calling of genotypes despite having sampled one allele at the site. Large haplotype blocks in historically inbred crops theoretically permit imputation accuracy to extend across large physical regions, where genotyped markers are sparse but in high correlation with each other. Success with such a combinatorial approach has been reported in rice, using ∼1X coverage sequence data of 517 individuals. Imputation of the missing genotypes in these individuals without a reference panel to produce a SNP panel of ∼3.6 million markers with >98% accuracy (Huang ). This was confirmed in a later study that also included simulations performed down to 0.1X depth. Falling below a depth of 0.5X resulted in steep accuracy consequences, with concordance falling to 76% at the 0.1X level. (Wang ). Incorporation of a reference panel has been shown to result in large accuracy improvements at sequencing coverage less than <1X in humans, where imputation at the 0.1X level was improved from less than 5% accuracy to ∼70% (Pasaniuc ). With the growing amount of sequence data present in public databases for many common crops, it is possible to generate an extensive reference panel that might improve accuracy at ultra-low sequence coverage and further cut per sample genotyping cost. In this study, we report on a low coverage whole genome sequencing with imputation approach in a naturally inbred crop, soybean, for producing a low cost, high quality, high density SNP dataset. A reference panel was generated using publicly available high-depth sequencing data for 106 lines, and employed for imputing the missing genotypes of 114 lines sequenced at ultra-low depth. Coverages from 0.1X – 1X depth at intervals of 0.1X were evaluated. The factors influencing error rates and extensibility within/outside soybean were investigated, and the consequences of error rates and types of error on a typical genome-wide association study (GWAS) were explored.

Materials & Methods

Reference Panel

The reference panel for genotype imputation was generated using publicly available sequence data deposited in the NCBI Short Read Archive from study number SRP062245 (Valliyodan ). This unfiltered, raw dataset consisted of 106 Glycine max lines sequenced at an average of 17.1X coverage (Supplementary Table 1). The raw reads were filtered for adapter sequence contamination, base quality, and truncated reads using Trimmomatic (Bolger ). Bowtie2 was used to map reads to the Glycine max Wm82.a2.v1 reference genome with the “very sensitive” option (Langmead and Salzberg 2012). Reads with a mapping quality score of less than 20 were discarded. SNPs were called using the GATK3.7 HaplotypeCaller tool for an initial panel of 13,052,759 SNPs across all lines (Poplin ). SNP calls with five or less reads supporting the call were filtered out, as well as calls with a confidence score of less than 20. To control for potential sample contamination/mixing, the inbreeding coefficient, also called the F statistic (Jain and Workman 1967), was calculated using the software Plink1.9 (Purcell ). As soybean is historically an inbred crop, one can expect F statistics close to one in Glycine max. Seven samples fell below a cutoff of 0.9 and were discarded from the final reference panel. All heterozygous calls in the remaining 99 lines were filtered, leaving only biallelic SNPs for consideration. The final reference panel spanned 10,803,148 biallelic homozygous SNPs in 99 lines compared to 10,417,285 SNPs found by Valliyodan et al. using the same data set.

Imputation Panel

To generate a low sequence coverage panel for imputation, 114 experimental lines selected from the University of Nebraska soybean breeding program (Supplementary Table 1) were sequenced to a depth of 1X or greater on an Illumina NextSeq 500 (Illumina Hayward, Hayward, CA) using the manufacturer’s protocol and 150 base pair paired end reads. DNA was isolated from lyophilized leaf tissue collected from twenty plants per genotype using a CTAB based extraction method (Keim 1988) scaled down for a 96 well plate by dividing all reagent volumes by 40. Extracted genomic DNA was fragmented using a Covaris S220 with the manufacturer’s recommended settings for generating ∼350 base pair length fragments (Covaris, Inc., Woburn, MA 01801). Double sided size selection was performed using KAPA Pure Beads to retain only fragments within the 250-450 base pair range using the manufacturer’s protocol and eluted in 40 μl of TE buffer (Roche Sequencing Solutions, Santa Clara, CA 95050). After testing DNA concentration, samples were standardized to 62.5 ng /μl. Libraries were prepared using a custom protocol adapted from literature to perform A-tailing and end-repair in one reaction, and avoid PCR after adapter ligation by extending the incubation time (Kozarewa and Turner 2011; Knapp ) . To perform end repair and A-tailing, 16 μl of fragmented genomic DNA for each sample was combined with 1 μl of T4 polynucleotide kinase (PNK) (10U/μl), 1 μl of T4 DNA polymerase (5U/μl), 1 μl of DreamTaq DNA Polymerase (5U/ μl), 2.7 μl of Cut Smart Buffer (10x), 2.2 μl of dATP (10mM), 0.8 μl of dNTP (10 mM), and 0.3 μl of ATP (10mM). Samples were incubated in a thermocycler for 30 min at 20°, and then immediately ramped to 65° and held at this temperature for 30 min. Samples then proceeded immediately to adapter ligation. To the 25 μl of end repaired and A-tailed product the following was added: 10 μl of T4 DNA Ligase Buffer, 3 μl of T4 DNA Ligase (2000U/μl), 3 μl of PEG 6000, 2 μl of PCR grade water, and 2 μl of uniquely barcoded adapters (30mM). Samples were incubated on a thermocycler for 45 min at 20°. After this time, samples were immediately cleaned using KAPA Pure Beads to retain fragments within the 350-550 base pair range and eluted in 20 μl of TE buffer. Multiplexing was performed by combining 5 μl of each individual library. Libraries were quantified using the KAPA Library Quantification Kit for Illumina platforms. To create subsets simulating depths from 0.1X to 1X at intervals of 0.1X, reads were randomly selected from the raw datasets based upon the total number of reads obtained for each genotype. Each dataset was trimmed for adapter contamination, base quality and truncated reads using Trimmomatic, and then mapped to the Glycine max Wm82.a2.v1 reference genome with Bowtie2 using the “very sensitive” option. Mapped reads below a quality score of 20 were filtered. The genotypes at all 10,803,148 SNP positions in the reference panel were called in the low coverage imputation panel using GATK3.7 Haplotype Caller. Genotyping SNPs from a single read has been found accurate in rice whole genome sequencing and maize GBS applications (Swarts ; Wang ) . Any heterozygous calls were discarded, as well as calls not matching the two allele options at that position. For each subset, a random 5% of calls were masked and considered “true” genotypes for evaluating imputation accuracy. To characterize how genetically distinct the experimental lines were from one another, a genomic relatedness matrix was constructed according to the van Raden metric using the R package “synbreed”. Prior to calculation, the imputed dataset was filtered to retain variants with a Beagle posterior genotype probability (GP) score above 0.9, pairwise r2 LD metric below 0.4, and variant site missing rate below 5% using Plink1.9 (Purcell ).

Imputation Concordance Evaluation

For the sake of computational efficiency, imputation was performed on a per chromosome basis using Beagle 4.1 (Browning and Browning 2016) with the low memory option. To assess accuracy, the imputed genotype calls were compared to the masked calls, and the percent of those in agreement constituted overall concordance using GATK 3.7’s Genotype Concordance tool (McKenna ). This accuracy assessment was performed across sequencing depths and minor allele frequencies. Three post imputation datasets were considered to quantify any accuracy improvement obtained by filtering poorly imputed sites. This included the raw imputed dataset, and two datasets filtered on GP. Values with GP scores under 0.45 and 0.9 were filtered for the latter two evaluation panels, respectively. VCFtools0.01.12a was used to bin by minor allele frequency, and Plink1.9 (Purcell ; Danecek ) was used to filter on GP score. GP score filtering thresholds were determined after examining their relationship to error rate (Supplementary Table 2)

Error and Linkage Disequilibrium

Error in relationship to linkage disequilibrium (LD) was examined as a potential metric of extensibility to other soybean population and crop species. D' and r2 statistics were calculated for all pairwise reference panel SNPs using Plink1.9 (Gaunt ; Purcell ). Proportion of errors made at each SNP site across was calculated by comparing the imputed values to the masked values across all subsets of depths. To reduce noise, data were smoothed through the application of a rolling average window with a width of 1500 SNPs after ordering by the respective pairwise LD metric. A second order polynomial was fit to describe the D' and error relationship, and a simple linear regression was fit to describe the relationship between r2 and error.

Relationship Between Samples & Reference Panel

Close relatedness between the sample and reference genotypes has been previously reported to increase imputation precision. Relatedness matrices were generated based on five different coefficients and averaged the top five scores from each sample genotype as a metric for gauging degree of relatedness to the reference panel. These measures were plotted against concordance scores from the imputed data filtered for GP scores above 0.9 and averaged across all depth levels. A simple linear regression model was fit to assess potential correlation. Relatedness matrices were calculated using the R package “synbreed”, using options corresponding to measures described by vanRaden, Astle and Balding, Reif, Hayes and Goddard, and Euclidean distances (Wimmer ).

Genome Representation

Genomic studies improve as the linkage between the genotyped polymorphism and underlying causative gene increases. The extent of LD between two markers therefore constitutes proxy for the correlation of the marker and underlying gene(s) of interest. To assess how well the panel represented variation across the genome, the distribution of LD in the imputed experimental dataset was compared to the SoySNP50k Array positions extracted from the imputed experimental dataset (Song ). SNPs with MAF below 0.05 were filtered out, a quality control step implemented in most genomic studies. Both D' and r2 were calculated using Plink1.9, and distributions plotted in R3.4 (Team 2017).

Error and Beagle Posterior Genotype Probability

To explore the possibility of using GP values as a post imputation filtering metric, proportion of error across depth subsets was plotted against GP. A rolling average window with a width of 500 SNPs was applied to the proportion error after ordering by GP, and a second degree polynomial was fit to describe the relationship in R3.4.

Error Type:

Allele frequencies exhibit some degree of influence on the results of many genomics studies. Therefore, how imputation error skews this metric is of significant interest. Masked and imputed datasets were coded according to the major allele in the reference dataset. Errors were binned into four categories, homozygous major to minor, homozygous minor to homozygous major, homozygous major to heterozygous, and homozygous minor to heterozygous, based on which allele was incorrectly imputed and which allele was true. Because all heterozygous calls were filtered in the initial data generation, no heterozygous to major, or heterozygous to minor category exists.

Power Analysis

In the interest of determining the potential cost of imputation error, a basic power calculation for minor to major and major to minor errors in a GWAS was performed. Using an R implementation of Purcell’s “Genetic Power Calculator” (Purcell ), power was calculated to detect a moderate effect QTL across minor allele frequency bins from <0.025 to 0.5. Simulations assumed an additive genetic model, 300 genotypes, LD between the QTL and marker of 0.8 D', a significance threshold that mirrored the Bonferroni correction for 1,716,234 SNPs (the final size of the SNP dataset after quality control filtering), and a QTL effect size of 1 standard deviation. Error rates from 1–10% were tested at intervals of 1%, with 100 iterations of the simulation performed at each error level. To investigate the possibility of including more genotypes to overcome power losses associated with imputation error, simulations were also performed for 150, 500, and 1000 genotypes for a 5% error rate at the same conditions as specified above.

Cost Analysis

Decreasing cost per sample allows a researcher to expand a study to overcome power loss introduced through the imputation error. To illustrate the impact of this, per sample sequencing costs were calculated using current Illumina NextSeq500 high throughput 300 cycle sequencing kit prices, cost analysis of a custom library prep protocol, and CTAB DNA extraction method (Supplementary Table 3). The retained cost per sample and average raw concordance were plotted as depth decreased.

Data Availability

Raw sequencing data directly generated by this project for use in creating the study panel has been deposited in the NCBI Short Read Archive under accession number PRJNA512147. The reference panel used for genotype imputation was generated using previously publicly available sequence data deposited in the NCBI Short Read Archive from study number SRP062245 (Valliyodan ). Supplementary figures and tables can be found in “Supplementary Figures and Tables.” The 114 × 114 relatedness matrix is available in Supplementary File 1. Supplemental material available at FigShare: https://doi.org/10.25387/g3.7975751.

Results

SNP Genotyping & Imputation

The reference panel for imputation was constructed using 106 Glycine max lines sequenced at an average of 17.1X coverage using publicly available sequencing data deposited in the NCBI Short Read Archive (Valliyodan ) (Supplementary Table 1). After quality control measures were applied to the raw and mapped sequence data (see Materials & Methods), a final reference panel of 10,803,148 biallelic homozygous SNPs across 99 lines was generated. SNPs discovered in the reference panel were used to genotype experimental lines in the study panel. This consisted of 114 lines that were sequenced to a depth of at least 1X. Coverages from 0.1X to 1X were analyzed by randomly subsetting reads from the raw sequence data. Of the 10,803,148 million markers discovered in the reference panel, the number of SNPs genotyped by this low coverage study panel subsets ranged from 133,747 to 1,288,463 markers. These subsets also ranged in missing data rates from 95.26 to 67.56% for those markers (Table 1). Using the reference panel, genotype values for all missing positions were imputed.

Table 1

The number of markers and genotyping rate in each low coverage subset from 0.1X to 1X sequencing depth. As coverage decreases, the total number of markers captured and completeness of the SNP panel decreases.

Mean Depth	Genotyping Rate	Number of SNPs	Reads	Base Pairs
1	32.44%	1,288,463	6,327,889	949,183,385
0.9	30.41%	1,240,823	5,695,100	854,265,047
0.8	27.77%	1,174,619	5,062,311	759,346,708
0.7	24.91%	1,097,843	4,429,522	664,428,370
0.6	21.80%	1,005,880	3,796,734	569,510,031
0.5	18.47%	895,596	3,163,945	474,591,693
0.4	14.98%	760,167	2,531,156	379,673,354
0.3	11.40%	590,786	1,898,367	284,755,016
0.2	7.85%	375,343	1,265,578	189,836,677
0.1	4.74%	133,747	632,789	94,918,339

Close relatedness between the lines in the experimental panel may bias the overall accuracy of the imputation results. To evaluate this, vanRaden relatedness scores were calculated using real and imputed genotypes. The resulting values ranged from -0.44858 to 0.91112, with a median value of -0.02851, mean of -0.00286, and standard deviation of 0.17650. Strong relationships are generally indicated by values over 0.4. Our experimental panel exhibits few strongly related lines, with only 2.7% of all possible pairwise combinations showing a relationship above this threshold. Therefore, we would conclude the majority of our experimental genotypes to be distally/non-related (Supplementary File 1). An alternative to this whole genome sequencing approach are fixed SNP arrays. However, this method provides less total SNPs for genomic studies and may not capture as much of the genome. High LD between SNPs can be extended to assume a strong correlation to other genomic variation between them. Plotting the density distributions of r2 and D' LD measures for the Soy50KSNP Array and imputed dataset demonstrated that whole genome sequencing with imputation had a greater concentration of values toward higher linkage values. Generally, a D' or r2 of over 0.8 between is considered “strong linkage”. The imputed dataset provided 1,716,234 SNPs after common quality control filters, with 36.00% and 85.66% of r2 and D' values above 0.8, respectively. This is in comparison to the 42,133 SNPs in the fixed array, where 24.20% and 80.00% of r2 and D' values are above 0.8 (Figure 1). If high LD indicates a better tagging of underlying variation, the imputed dataset captures the genome’s SNP variation better than the Soy50KSNP Array.

Figure 1

Comparing density plots for LD measures D' (A) and r2 (B) demonstrates that using whole genome sequencing with imputation results in a dataset that has a higher proportion of SNPs is strong pairwise linkage with each other, represented in the heavier tails in red near D' and r2 values of 1.

Imputation Accuracy

Prior to imputation, 5% of genotype calls from the skim sequencing data were withheld to assess accuracy. Overall imputation accuracy was consistent for raw and filtered datasets as sequencing depth decreased from 1X until 0.3X, where accuracy drops off by an average of 3.5% from 0.3X to 0.1X (Figure 2A, Supplementary Table 2). Assessing the error type of this study showed that 53.13% of the errors made were incorrect imputation of the minor allele when the major allele was true. Of the remaining errors, 35.10% were incorrect imputation of the major allele when the minor allele was true, and 11.77% were incorrect imputation of heterozygous calls. No heterozygous to major/minor errors exist as at heterozygous calls were filtered in the initial panels (Figure 3).

Figure 2

Figure 3

Proportion of errors made as categorized by whether the minor/major/heterozygous alleles was misimputed. In over half of all the errors made, Beagle overimputes the minor allele when the major allele is the true genotype. Incorrect heterozygous imputations make up a minor proportion of the total error and would likely be filtered out in inbred panels.

A) Overall accuracy of filtered and raw imputed datasets were plotted across the evaluated depths. For all study panels, concordance rapidly erodes below a sequencing depth of ∼0.3X. B) Examining accuracy in the context of minor allele frequency (MAF) reveals that error occurs at higher rates as MAF approaches a maximum of 0.5. Proportion of errors made as categorized by whether the minor/major/heterozygous alleles was misimputed. In over half of all the errors made, Beagle overimputes the minor allele when the major allele is the true genotype. Incorrect heterozygous imputations make up a minor proportion of the total error and would likely be filtered out in inbred panels. Filtering on Beagle’s posterior genotype probability (GP) to improve dataset quality was successful. When imputed positions with a GP score of less than 0.45 were discarded, accuracy improved by an average of 2.50% across sequencing depths. A more stringent filter that only kept positions with a GP score over 0.9 resulted in a 4.26% increase in accuracy (Supplementary Table 2). This practice did reintroduce some missing data, which varied across depth and filtering level. Data loss as a result of post imputation filtering was below 5% for all depths at a filtering level of GP > 0.45, but quickly inflates when filtering for imputation quality of GP > 0.9 to a missing data rate of 20.82% at 0.1X (Supplementary Figure 1). While filtering on Beagle’s posterior genotype probability may reduce falsely imputed genotypes, it must be balance with the reintroduction of missing data it causes. The error rate at individual marker loci may not be well captured by the overall concordance across all SNPs. Examining concordance in the context of minor allele frequency (MAF) reveals as MAF values approach a maximum of 0.5, concordance decreases. Application of post imputation filters of GP values increases overall accuracy through improved concordance at these increased MAFs (Figure 2B). This trend is uniform across all sequencing depths (Supplementary Figure 2). Through examining imputation accuracy in this manner, it is apparent that higher error rates are occurring at SNP positions at MAFs nearest 0.5 than is described by the average concordance measure. Error rates in imputation may be influenced by characteristics specific to the population and crop species to which it is applied. The correlation between variants is a cornerstone to the success of imputation. If the correlation between alleles is high then imputation accuracy should also be high and as the correlation between alleles decrease then the accuracy of imputation should also decrease. This correlation between alleles can be measured with LD. Soybean is a historically inbred crop with long ranging LD (Zhou ). As D' and r2 approach 1, where neighboring SNPs are in perfect linkage with each other, error rates are at their lowest. Both relationships demonstrate a very strong correlation with R2 values of 0.98 and 0.89 for r2 and D' respectively (Figure 4), indicating LD is an important factor to consider when applying this technique to other soybean populations or other crop species.

Figure 4

Comparing the smoothed frequency of errors made at individual SNP sites with LD measures D' (A) and r2 (B) demonstrates the strong influence of linkage disequilibrium on imputation accuracy.

Comparing the smoothed frequency of errors made at individual SNP sites with LD measures D' (A) and r2 (B) demonstrates the strong influence of linkage disequilibrium on imputation accuracy. Relationship of the study genotypes to the reference panel genotypes has been suggested as a strong influencer of imputation accuracy. Plotting calculated values for five unique kinship metrics against concordance for each genotype did not demonstrate any strong linear relationships. The maximum correlation for any of the measures was for Reif’s method, at an R2 of 0.26. Examination of the standard error, shows that the study population varies narrowly in terms of relatedness to the reference panel. Additionally, assessing the raw values suggests that the study population is weakly related to reference genotypes. This is best illustrated with the vanRaden and Astle & Balding measurements, where a “strong” relationship is usually indicated by values approximately >= 0.4. In both these cases, the largest measure does not exceed 0.18 and 0.16 (VanRaden 2008; Astle and Balding 2009). The combination of diminished values and narrow standard error indicates a weak relationship of the study panel to the reference panel (Supplementary Figure 3). The evidence of a weak relationship suggests that relationship was not a strong influencer of the high imputation accuracies obtained.

GWAS Power

Understanding the effect error rate has on genomic studies is important when selecting an appropriate genotyping technique. To determine the effect of the error rate of skim sequencing and imputation has on GWAS we performed power simulations of detecting a moderate effect QTL in a panel of 300 individuals. This power study showed significantly decreased power to detect QTL with increasing errors at MAF from 0.1-0.3. This was most pronounced when the minor allele was incorrectly imputed as the major allele. Above 0.3 MAF, power for QTL detection was minimally affected by error (Figure 5A). Studying the effect of three additional sample sizes, while assuming a 5% error rate demonstrates the potential for experimenters to recover power losses through inclusion of more genotypes. Including 500 individuals at this fixed error rate recovers and even slightly improves power at mid-range MAFs over studying 300 genotypes with no genotyping error (Figure 5B).

Figure 5

A) The power to detect a moderate effect QTL becomes increasingly sensitive to error for both major to minor and vice versa errors at intermediary MAFs. B) Comparing the power to detect the same QTL with 300 samples at a 0% genotyping error vs. 500 samples with a 5% error rate demonstrates that cost savings can be used to increase study sizes in order to recover power losses introduced by the imputation error of both major and minor alleles.

Discussion

This study illustrates the potential of low coverage sequencing with imputation as an economical approach to obtaining high density SNP genotype information in soybean. Accelerating improvement of complex phenotypes through genomics necessitates high quality, high resolution marker data. However, studies are often limited by the cost required to obtain this information through high coverage sequencing. The combination of low coverage sequencing with imputation presents an option that drastically cuts costs while retaining a high level of accuracy. Implementing a similar method in rice allowed researchers to generate a high quality, dense SNP dataset using 1X depth whole genome sequence (Huang ; Wang ). This analysis in soybean, which differs in the inclusion of a reference panel for imputation, determined sequencing depth could be reduced to 0.3X with no significant accuracy losses. Analogous results have been demonstrated in humans, where it was concluded that a reasonably accurate and dense dataset could be obtained from 0.2X coverage supplemented with imputation using a reference panel (Pasaniuc ). To our knowledge, this is the first work to examine using imputation with real sequence data at less than 1X coverage in the construction of a high quality, highly affordable SNP dataset in plants. The effect of imputation method and structure of the reference panel have not been specifically examined in the context of application to skim sequencing, providing future avenues for research and improvement. While SNP arrays and GBS are popular options for obtaining genotype information, high precision genomics demands markers to be in close linkage to the contributing genes. Regions of the genome with sparsely correlated markers may therefore contain overlooked causal variation (Hirschhorn and Daly 2005; Witte 2010). Skim sequencing with imputation, as investigated here, tags a significantly larger portion of the genome in tighter LD than the current soybean 50k array. This effect may be presumed to extend to GBS datasets of a similar size. Such a boost in resolution may therefore reveal QTL in regions of the genome that would not have been captured through smaller datasets. The accuracy and extensibility of this approach in other soybean populations, as well as other crops is based on several factors. To explore potential limitations in this method, population LD and the relatedness of reference panel to study lines were examined. Both of these factors have been implicated as strong influencers of imputation accuracy due to the innate reliance of the technique on the presence of sample haplotypes within the reference panel, as well as the extent of correlation between observed markers (Hickey ; He ). The strong inverse relationship observed between the proportion of SNPs incorrectly imputed at a given position and LD measurements suggests that for soybean populations and other crops with shorter range LD, imputation accuracy will likely decrease. There was no significant relationship detected between kinship measures and accuracy. However, the study genotypes exhibited little variation for any of the calculated metrics, which can be seen in the low standard deviations. Without a wide range of values to examine, identifying a clear trend is unlikely. The positive effect of relatedness on imputation accuracy is documented in other literature (Hickey ; Ma ; Boison ), and should therefore be a consideration in expanding this method to other soybean populations and crop species. The overall weak kinship between study and reference panels in this data may also be viewed as a positive, since high levels of imputation accuracy were achieved despite this populations being interpreted as distally related. The power to detect a QTL is partially dependent on the allele frequency at that loci (Ardlie ; Tabangin ). Therefore, the relationship between imputation accuracy, minor allele frequency (MAF), and statistical power may be considered particularly important. In agreement with an analysis performed with maize, the data showed steadily decreasing imputation accuracy as MAF increased with the exception of very rare alleles (MAF < 0.05) (Hickey ). An opposite tendency was observed with respect to statistical power losses across MAF, so it can be interpreted that at the loci a SNP dataset would display the highest imputation error rates, the GWAS is least affected by them. This trend has also been supported in human imputation analyses looking at sample size inflation factors under different imputation error types (Huang ). In both cases, power consequences were greater for incorrect imputation of the minor allele. It is unclear how a combination of error types at a SNP locus would influence genomic studies. Decisions on the level of decreased coverage that can be tolerated should consequently be made not on the overall average concordance, but by examining the concordance across minor allele frequencies in relation to the maximum allowable error to retain power. The cost savings associated with this method can be used to include more sample genotypes, not only recovering power losses at low minor allele frequencies, but potentially increasing total power. Similar results in humans have indicated sampling more genotypes with small error is more beneficial over fewer genotypes with perfect accuracy (Pasaniuc ). Comparing the raw accuracy along sequencing depths along with per sample costs, shows that at the previously identified critical threshold of 0.3X coverage, there is only a 0.85% loss of accuracy relative to using a 1X sequence, while costs decreased 57% (Supplementary Figure 4). Moreover, the use of public sequence data to construct a broad reference panel eliminates the cost and limitations of assembling special populations and sequencing the founders to a high coverage to serve as the reference haplotypes.

Conclusion

Here it is demonstrated that low coverage sequencing accompanied with imputation from a reference panel can be extended below 1X depth in soybean to capture high density, reasonably accurate SNP genotype information economically. The tremendous drop in per sample sequencing cost over high depth methods may allow researchers to expand the number of study genotypes in their investigations, while representing a larger portion of the genome than fixed SNP arrays and GBS data. The potential for success of this genotyping method within and outside of soybean is highly reliant on population LD. Furthermore, researchers should examine accuracy and power within the context of minor allele frequency to make informed decisions about sequencing depth tolerances. As genomics demands increasing SNP panel densities across a wide range of genotypes, skim sequencing with imputation constitutes a financially feasible and highly accurate way to meet these requirements.

39 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. Genome-wide association studies of 14 agronomic traits in rice landraces.

Authors: Xuehui Huang; Xinghua Wei; Tao Sang; Qiang Zhao; Qi Feng; Yan Zhao; Canyang Li; Chuanrang Zhu; Tingting Lu; Zhiwu Zhang; Meng Li; Danlin Fan; Yunli Guo; Ahong Wang; Lu Wang; Liuwei Deng; Wenjun Li; Yiqi Lu; Qijun Weng; Kunyan Liu; Tao Huang; Taoying Zhou; Yufeng Jing; Wei Li; Zhang Lin; Edward S Buckler; Qian Qian; Qi-Fa Zhang; Jiayang Li; Bin Han
Journal: Nat Genet Date: 2010-10-24 Impact factor: 38.330

3. Genotype Imputation with Millions of Reference Samples.

Authors: Brian L Browning; Sharon R Browning
Journal: Am J Hum Genet Date: 2016-01-07 Impact factor: 11.025

4. Efficient methods to compute genomic predictions.

Authors: P M VanRaden
Journal: J Dairy Sci Date: 2008-11 Impact factor: 4.034

5. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies.

Authors: Bogdan Pasaniuc; Nadin Rohland; Paul J McLaren; Kiran Garimella; Noah Zaitlen; Heng Li; Namrata Gupta; Benjamin M Neale; Mark J Daly; Pamela Sklar; Patrick F Sullivan; Sarah Bergen; Jennifer L Moran; Christina M Hultman; Paul Lichtenstein; Patrik Magnusson; Shaun M Purcell; David W Haas; Liming Liang; Shamil Sunyaev; Nick Patterson; Paul I W de Bakker; David Reich; Alkes L Price
Journal: Nat Genet Date: 2012-05-20 Impact factor: 38.330

6. The effect of minor allele frequency on the likelihood of obtaining false positives.

Authors: Meredith E Tabangin; Jessica G Woo; Lisa J Martin
Journal: BMC Proc Date: 2009-12-15

7. Cubic exact solutions for the estimation of pairwise haplotype frequencies: implications for linkage disequilibrium analyses and a web tool 'CubeX'.

Authors: Tom R Gaunt; Santiago Rodríguez; Ian Nm Day
Journal: BMC Bioinformatics Date: 2007-11-02 Impact factor: 3.169

8. Genetic Basis Underlying Correlations Among Growth Duration and Yield Traits Revealed by GWAS in Rice (Oryza sativa L.).

Authors: Fengmei Li; Jianyin Xie; Xiaoyang Zhu; Xueqiang Wang; Yan Zhao; Xiaoqian Ma; Zhanying Zhang; Muhammad A R Rashid; Zhifang Zhang; Linran Zhi; Shuyang Zhang; Jinjie Li; Zichao Li; Hongliang Zhang
Journal: Front Plant Sci Date: 2018-05-22 Impact factor: 5.753

9. Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors: Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

10. Optimized Use of Low-Depth Genotyping-by-Sequencing for Genomic Prediction Among Multi-Parental Family Pools and Single Plants in Perennial Ryegrass (Lolium perenne L.).

Authors: Fabio Cericola; Ingo Lenk; Dario Fè; Stephen Byrne; Christian S Jensen; Morten G Pedersen; Torben Asp; Just Jensen; Luc Janss
Journal: Front Plant Sci Date: 2018-03-21 Impact factor: 5.753

4 in total

Review 1. Genetic and Genomic Resources for Soybean Breeding Research.

Authors: Jakob Petereit; Jacob I Marsh; Philipp E Bayer; Monica F Danilevicz; William J W Thomas; Jacqueline Batley; David Edwards
Journal: Plants (Basel) Date: 2022-04-27

2. Comparing a Mixed Model Approach to Traditional Stability Estimators for Mapping Genotype by Environment Interactions and Yield Stability in Soybean [Glycine max (L.) Merr.].

Authors: Mary M Happ; George L Graef; Haichuan Wang; Reka Howard; Luis Posadas; David L Hyten
Journal: Front Plant Sci Date: 2021-03-31 Impact factor: 5.753

3. Candidate Genes Modulating Reproductive Timing in Elite US Soybean Lines Identified in Soybean Alleles of Arabidopsis Flowering Orthologs With Divergent Latitude Distribution.

Authors: Nicholas Dietz; Yen On Chan; Andrew Scaboo; George Graef; David Hyten; Mary Happ; Brian Diers; Aaron Lorenz; Dechun Wang; Trupti Joshi; Kristin Bilyeu
Journal: Front Plant Sci Date: 2022-04-29 Impact factor: 5.753

Review 4. Robotic Technologies for High-Throughput Plant Phenotyping: Contemporary Reviews and Future Perspectives.

Authors: Abbas Atefi; Yufeng Ge; Santosh Pitla; James Schnable
Journal: Front Plant Sci Date: 2021-06-25 Impact factor: 5.753

4 in total