The precise detection of causal DNA mutations (deoxyribonucleic acid) is very crucial for forward genetic studies. Several sources of errors contribute to false-positive detections by current variant-calling algorithms, which impact associating phenotypes with genotypes. To improve the accuracy of mutation detection, we implemented a binning method for the accurate detection of likely ethyl methanesulfonate (EMS)-induced mutations in a sequenced mutant population. We also implemented a clustering algorithm for detecting likely false negatives with high accuracy. Sorghum bicolor is a very valuable crop species with tremendous potential for uncovering novel gene functions associated with highly desirable agronomical traits. We demonstrate the precision of the described approach in the detection of likely EMS-induced mutations from the publicly available low-cost sequencing of the M3 generation from 600 sorghum BTx623 mutants. The approach detected 3,274,606 single nucleotide polymorphisms (SNPs), of which 96% (3,141,908) were G/C to A/T DNA substitutions, as expected by EMS-mutagenesis mode of action. We demonstrated the general applicability of the described method and showed a high concordance, 94% (3,074,759) SNPs overlap between SAMtools-based and GATK-based variant-calling algorithms. Our clustering algorithm uncovered evidence for an additional 223,048 likely false-negative shared EMS-induced mutations. The final 3,497,654 SNPs represent an 87% increase in SNPs detected from the previous analysis of the mutant population, with an average of one SNP per 125 kb in the sorghum genome. Annotation of the final SNPs revealed 10,263 high-impact and 136,639 moderate-impact SNPs, including 7217 stop-gained mutations, which averages 12 stop-gained mutations per mutant, and four high- or medium-impact SNPs per sorghum gene. We have implemented a public search database for this new genetic resource of 30,285 distinct sorghum genes containing medium- or high-impact EMS-induced mutations. Seedstock for a select 486 of the 600 described mutants are publicly available in the Germplasm Resources Information Network (GRIN) database.
The precise detection of causal DNA mutations (deoxyribonucleic acid) is very crucial for forward genetic studies. Several sources of errors contribute to false-positive detections by current variant-calling algorithms, which impact associating phenotypes with genotypes. To improve the accuracy of mutation detection, we implemented a binning method for the accurate detection of likely ethyl methanesulfonate (EMS)-induced mutations in a sequenced mutant population. We also implemented a clustering algorithm for detecting likely false negatives with high accuracy. Sorghum bicolor is a very valuable crop species with tremendous potential for uncovering novel gene functions associated with highly desirable agronomical traits. We demonstrate the precision of the described approach in the detection of likely EMS-induced mutations from the publicly available low-cost sequencing of the M3 generation from 600 sorghum BTx623 mutants. The approach detected 3,274,606 single nucleotide polymorphisms (SNPs), of which 96% (3,141,908) were G/C to A/T DNA substitutions, as expected by EMS-mutagenesis mode of action. We demonstrated the general applicability of the described method and showed a high concordance, 94% (3,074,759) SNPs overlap between SAMtools-based and GATK-based variant-calling algorithms. Our clustering algorithm uncovered evidence for an additional 223,048 likely false-negative shared EMS-induced mutations. The final 3,497,654 SNPs represent an 87% increase in SNPs detected from the previous analysis of the mutant population, with an average of one SNP per 125 kb in the sorghum genome. Annotation of the final SNPs revealed 10,263 high-impact and 136,639 moderate-impact SNPs, including 7217 stop-gained mutations, which averages 12 stop-gained mutations per mutant, and four high- or medium-impact SNPs per sorghum gene. We have implemented a public search database for this new genetic resource of 30,285 distinct sorghum genes containing medium- or high-impact EMS-induced mutations. Seedstock for a select 486 of the 600 described mutants are publicly available in the Germplasm Resources Information Network (GRIN) database.
ethyl methanesulfonateN‐ethyl‐N‐nitrosoureaGermplasm Resources Information NetworkN‐methyl‐N‐nitrosoureanext‐generation sequencingsingle nucleotide polymorphismsSequence Read Archive
INTRODUCTION
Plant forward and reverse genetics studies play a vital role in harnessing the untapped potential of plants in the search for new sources of fuel energy, improving crop productivity, discovering desirable agronomic traits, and enhancing ecosystem services (Doebley et al., 2006; Fernie & Yan, 2019; Grierson et al., 2011; Hartwig et al., 2012; Li et al., 2001; McCallum et al., 2000; Meyer & Purugganan, 2013; Mickelbart et al., 2015; Olsen & Wendel, 2013; Schneeberger, 2014; Schneeberger & Weigel, 2011). Despite the tremendous progress in plant gene function studies, a significant number of genes found in plant genomes still lack experimentally determined function. The rapid advances in cost‐effective, high‐throughput sequencing techniques such as next‐generation sequencing (NGS) technologies (Goodwin et al., 2016; Metzker, 2010) and modern mutation breeding techniques such as TILLING (Targeting Induced Local Lesions in Genomes) (McCallum et al., 2000) had been pivotal in recent large‐scale mutant studies that have provided expansive resources for crop forward genetics (Addo‐Quaye et al., 2018; Bolon et al., 2011; Caldwell et al., 2004; Chia et al., 2012; Cooper et al., 2008; Henry et al., 2014; Jiao et al., 2016; Krasileva et al., 2017; Lethin et al., 2020; Li et al., 2017; Szurman‐Zubrzycka et al., 2018; Talamè et al., 2008; Till et al., 2004). A diverse collection of physical mutagens (including fast neutrons, carbon ion beam, X‐rays, gamma rays, and ultraviolet light) and chemical mutagens (including ethyl methanesulfonate [EMS], N‐ethyl‐N‐nitrosourea [ENU], and N‐methyl‐N‐nitrosourea [MNU]) are available for inducing mutations in living organisms (Giles, 1940; Loveless, 1958; Mikaelsen et al., 1968; Nagai & Locher, 1937; Sax, 1940; Waugh et al., 2006; Westergaard, 1957). A disadvantage of traditional mutagens is that the modifications occur at random genomic positions, which results in the uncertainty of genomic consequences. Although precise genome‐editing techniques are on the ascendency, their general utility is still limited (Anzalone et al., 2020; Biswas et al., 2021; Lemmon et al., 2018; van de Wiel et al., 2017), and hence, traditional mutagenesis techniques remain useful. Various variant‐calling algorithms are currently available for identifying induced mutations from next‐generation sequencing datasets (DePristo et al., 2011; Garrison & Marth, 2012; Koboldt et al., 2012; Li et al., 2009; McKenna et al., 2010). The detection accuracy of these in silico tools are impacted by the inherently higher sequencing error rate of NGS technologies (when compared with Sanger sequencing) and other sources of error such as reference genome misassembly, imprecise sequence alignments, and sample and library preparation errors (Robasky et al., 2014). To expose false positives, present in the initial variant‐calling, various practices such as the removal of shared variants, low phred‐scale variant‐quality‐score variants, extreme strand‐biased variants, and excessively high‐coverage depth variants have been proposed (Flibotte et al., 2010; Nordström et al., 2013; Sarin et al., 2010; Uchida et al., 2011; Van der Auwera et al., 2013). In the specific case of EMS mutagenesis, the well‐described EMS mutation spectrum can provide valuable insights in the selection of effective parameter threshold values for variant filtering purposes. For example, in a rice exome‐capture mutagenesis study, the mutation spectrum was vital in the selection of a pertinent read coverage threshold for variant discovery (Henry et al., 2014). Although these filtering approaches are very potent, a possible unintended consequence is the elimination of bona fide false‐negative variants. In the era of rapid advances in high‐throughput phenotyping (phenomics) for both in‐field and greenhouse experiments (Araus et al., 2018; Bolger et al., 2019; Furbank & Tester, 2011; Houle et al., 2010; Olsen & Wendel, 2013; Tardieu et al., 2017; Weischenfeldt et al., 2013; White et al., 2012), precisely detecting true variants while filtering false‐positive variants and recovering bona fide false‐negative variants would greatly improve the precision of mutation detection and greatly enhance phenotype‐to‐genotype association (Araus et al., 2018).S. bicolor is a grass species that is very closely related to maize and has a relatively compact genome size of around 730 million bases (Paterson, 2008; Paterson et al., 2009; Price et al., 2005; Swigoňová et al., 2004). Sorghum is globally cultivated for food (grain sorghum), feed (forage sorghum), beverage, and fuel (cellulosic sorghum) and is characterized by the ability to grow in nutrient‐poor soils (Boyles et al., 2019; Dillon et al., 2007; Griebel et al., 2019; Pedersen et al., 2005; Scully et al., 2015; Wu et al., 2013; Xin et al., 2009; Zheng et al., 2011). In our previous study of the M3 generation from 600 independently EMS‐mutagenized sorghum individuals, we described an effective approach for EMS‐induced mutation detection (Addo‐Quaye et al., 2018). The approach included the following SNP filtering steps: (1) removal SNP with variant quality score less than 20, (2) removal of SNPs characterized by extremely high coverage depths (greater than 100× coverage), (3) removal of strand‐bias SNP (lacking read coverage on both genome strands), and (4) removal of replicate SNPs detected in multiple individuals (or shared variants). In this article, we describe an improved approach to detect likely EMS‐induced mutations in the low‐coverage NGS sequencing from mutant populations. This generally applicable approach leverages the well‐described action of EMS mutagenesis, which results in predominantly G/C to A/T DNA transitions (Greene et al., 2003; Loveless, 1969). We also implemented a clustering algorithm for uncovering bona fide false‐negative shared variants in sequenced mutant populations. Applying the method, we describe a highly improved indexed mutant collection and genetic resources for the publicly available M3 generation from 600 EMS‐mutagenized sorghum individuals (Addo‐Quaye et al., 2018).
RESULTS
NGS mapping and initial variant‐calling
To improve our variant prediction, we implemented a variant‐calling and filtering pipeline to detect likely EMS‐induced mutations in a sequenced mutant population (Figure 1) and evaluated the new approach by reanalyzing the 600 sorghum EMS mutants in our previous study (Addo‐Quaye et al., 2018). A total of 30,027,070,132 NGS paired‐end reads were retrieved from the NCBI Sequence Read Archive (SRA) with an average of 50,045,117 reads per sample. Of these, 98.3% (29,521,006,672) mapped to the current sorghum reference genome (Version 3.1), and 28,132,018,400 were properly paired mapped reads (Table 1). Initial variant‐calling using SAMtools detected 85,940,118 variants including 82,645,413 single nucleotide polymorphisms (SNPs). The percentage of SNPs that were G/C to A/T substitutions was 36.4% (30,096,471 SNPS; Table 2). Similarly using GATK variant‐calling algorithm with single individual genotyping and GATK with joint genotyping, 89,338,788 and 106,488,567 variants were initially detected, respectively, and these included 84,193,419 and 95,488,475 SNPs, respectively. The corresponding percentages of GATK SNPs that were G/C to A/T substitutions were 37.3% (single individual genotyping) and 37.9% (joint genotyping), respectively (Table 2). From here, we simply refer to GATK variant‐calling for single mutant individual genotyping as GATK and GATK analysis with joint genotyping for all 600 mutant individuals as GATK‐Joint.
FIGURE 1
The proposed computational pipeline for the detection of likely EMS‐induced variants in a mutant population. The total number of retained SNPs shown at the different pipeline stages was derived from the SAMtools variant‐calling results for 600 sorghum EMS‐mutagenized individuals
TABLE 1
Illumina NGS paired‐end reads mapping statistics for the 600 EMS‐mutagenized sorghum BTx623 individuals
Total
Mean
All NGS reads
30,027,070,132
50,045,117
Mapped reads
29,521,006,672
49,201,678
Properly paired reads
28,132,018,400
46,886,697
Number of mutants
600
‐
TABLE 2
Summary variant‐filtering statistics for the 600 EMS‐mutagenized sorghum BTx623 individuals
SAMtools (Sb_v2.1)a
SAMtools
GATK‐Singleb
GATK‐Jointc
Total
GCAT (%)d
Total
GCAT (%)
Total
GCAT (%)
Total
GCAT (%)
Initial variants
65,384,440
33.7
85,940,118
36.4
89,338,788
37.3
106,488,567
37.9
Initial SNPs
60,407,348
33.7
82,645,413
36.4
84,193,419
37.3
95,488,475
37.9
Unique SNPse
6,313,576
65.9
6,373,258
65.4
3,886,456
87.8
3,750,600
89.5
Filtered SNPsf
3,289,616
96.7
3,274,606
95.9
3,514,942
92.4
3,435,789
93.5
SAMtools SNP‐calling using Sorghum BTx623 reference genome version 2.1.
GATK SNP‐calling with single individual genotyping.
GATK SNP‐calling with joint genotyping.
Percentage of SNPs that are G/C to A/T DNA substitutions.
Subset of SNPs detected only once in the entire mutant population.
SNPs retained after SNP quality‐score filtering.
The proposed computational pipeline for the detection of likely EMS‐induced variants in a mutant population. The total number of retained SNPs shown at the different pipeline stages was derived from the SAMtools variant‐calling results for 600 sorghum EMS‐mutagenized individualsIllumina NGS paired‐end reads mapping statistics for the 600 EMS‐mutagenized sorghum BTx623 individualsSummary variant‐filtering statistics for the 600 EMS‐mutagenized sorghum BTx623 individualsSAMtools SNP‐calling using Sorghum BTx623 reference genome version 2.1.GATK SNP‐calling with single individual genotyping.GATK SNP‐calling with joint genotyping.Percentage of SNPs that are G/C to A/T DNA substitutions.Subset of SNPs detected only once in the entire mutant population.SNPs retained after SNP quality‐score filtering.
SNP filtering
For independently mutagenized individuals, it is highly unlikely for an EMS‐induced mutation to be detected in two or more mutants. The subtraction of shared (likely false‐positive) variants retained 6,373,258 SNPs (for SAMtools), 3,886,456 SNPs (for GATK), and 3,750,600 SNPs (for GATK‐Joint). Interestingly for both GATK‐based approaches, 88% and 90% (3,412,446 and 3,356,788) of the retained unique (non‐shared) SNPs were G/C to A/T substitutions (Table 2), whereas the percentage for SAMtools was only 65.4% (4,166,924 SNPs). However, the variant quality scores for the predicted mutations ranged between 3.0 and 228.0 (for SAMtools), 10.0 and 214,334.8 (for GATK), and 10.0 and 65,576,475.2 (for GATK‐Joint). This shows the default minimum variant quality score for SAMtools is lower than the GATK equivalent (3.0 vs. 10.0). Hence excluding the 44% (2,810,705) of the above 6,373,258 SAMtools SNPs with variant quality scores less than 10.0, 91.5% (3,259,639) of the remaining 3,562,553 SNPs were G/C to A/T SNPs. This shows the approximate correction for the lower minimum SNP quality score for SAMtools, produces results comparable to the above GATK‐based approaches. We next assigned SNPs to quality‐score bin categories corresponding to their variant quality scores. The mutation spectrum for all quality‐score bin categories was then calculated. Visual inspection of the mutation spectra for all variant‐quality‐score bin categories (Figure 2, Figures S1 and S2) shows that in all three approaches, the percentage of SNPs that were G/C to A/T transitions increased with increasing SNP quality‐score values. As shown in Figure 2, for SAMtools, the G/C to A/T percentage showed a striking increase from 40.91% (at quality score = 11) to 63.73% (at quality score = 12). Similarly, for GATK and GATK‐Joint, the G/C to A/T percentages jumped to 64.04% (at quality score = 26) and 66.36% (at quality score = 28), respectively (Figures S1 and S2). Based on these observations, the following SNP‐quality‐score thresholds were selected for subsequent quality‐score‐based filtering: 12 (for SAMtools), 26 (for GATK), and 28 (for GATK‐Joint). Applying the selected SNP quality‐score thresholds, 96% (3,141,908 out of 3,274,606) of SAMtools SNPs, 92.4% (3,248,054 out of 3,514,943) of GATK SNPs, and 93.5% (3,211,794 out of 3,435,790) of GATK‐Joint SNPs retained were G/C to A/T transitions (Figure 3; Table 2), and these are most likely EMS‐induced mutations. Comparison of the detected likely EMS‐induced SNPs showed a consensus (intersection) of 3,074,759 SNPs detected by all three approaches (Figure 4). The high concordance of SNPs between the SAMtools‐ and GATK‐based algorithms suggests the filtering method is generally applicable. A total of 181,432 SNPs were SAMtools‐specific, whereas 77,957 and 14,981 SNPs were, respectively, GATK‐ and GATK‐Joint‐specific. A total of 344,930 SNPs were detected in both GATK‐based approaches but not by SAMtools (Figure 4). Examination of the GATK‐based SNPs that were not detected by SAMtools showed that 55% (232,234 out of 422,887) of GATK SNPs and 58% (207,491 out of 359,911 SNPs) of GATK‐Joint SNPs were G/C to A/T substitutions (Table S1, Files S1 and S2). Hence, it is presumable that the majority of these GATK‐specific SNPs may not be true EMS‐induced SNPs. In our previous analysis of the 600 mutants, in which the SAMtools SNP‐calling was based on version 2.1 of the sorghum reference genome, we detected 1,753,403 SNPs, of which 96.7% (1,695,973 SNPs) were G/C to A/T substitutions (Addo‐Quaye et al., 2018). The 3,274,606 SNPs detected using the currently proposed method represents an 87% (1,521,203 SNPs) increase in the number of SNPs predicted. However, it is possible that the increase in SNPs detected could be due to differences in reference genome versions used in the separate studies. As a control, we repeated the analysis using the current approach with the older reference genome version 2.1. The initial number of SAMtools SNPs predicted in the genome version 2.1 was 60,407,348, and these included 6,313,576 unique SNPs (Table 2). Similarly, at an SNP‐quality‐score threshold of 12, 96.7% (3,182,253 out of 3,289,616 SNPs) were G/C to A/T substitutions (Figure 3; Table 2). Hence, the increase in the number of SNPs predicted is the result of a superior approach and not due to an improvement in genome assembly. Finally, we investigated the mutation spectrum for SNPs associated with extreme strand‐biased reads mapping at the corresponding SNP position in the genome. Of the 307,434 strand‐biased SNPs where nearly all the mapped reads (95%) at the SNP position were derived entirely from only the forward or only the reverse strand, over 94% (290,110 SNPs) were G/C to A/T substitutions (Figure S3; File S3). These supposedly extreme strand‐biased SNPs are most likely EMS‐induced mutations lacking sufficient read depth, because of the low‐coverage (6×) sequencing of the mutant genomes.
FIGURE 2
The binning approach to determine the variant‐quality‐score filtering threshold for the initial SAMtools‐predicted mutations in the 600 sorghum BTx623 mutants. The initially predicted SNPs are assigned to bin categories corresponding to their variant quality score. The EMS mutation spectrum for all SNPs in each bin category is calculated. The transition from the variant quality score of 11 (red arrow) to the variant‐quality‐score threshold of 12 (green arrow) shows the percentages of non‐G/C to A/T DNA substitutions decrease drastically, whereas G/C to A/T mutations become the dominant mutation type
FIGURE 3
The mutation spectrums for the likely EMS‐induced SNPs detected in the 600 sorghum BTx623 mutants. Using the SAMtools variant‐caller and sorghum reference genome versions 2.1 (shown in above figure as Sb_v2.1) and 3.1, 96.7% and 96% of detected SNPs were G/C to A/T mutations, respectively. Using sorghum reference genome version 3.1 and GATK with single individual genotyping and joint genotyping, 92.4% and 93.5% of SNPs were G/C to A/T mutations, respectively
FIGURE 4
The high concordance of high‐quality SNPs detected by the different variant‐calling approaches in the 600 sorghum mutants. A total of 3,274,606 (SAMtools), 3,514,943 (GATK with single individual genotyping), and 3,435,790 (GATK with joint genotyping; GATK‐Joint) filtered SNPs were predicted, and 3,074,759 of these SNPs were found by all three variant‐calling approaches. The corresponding percentages of likely canonical EMS‐induced DNA substitutions, for each category in the Venn diagram, are shown in parenthesis
The binning approach to determine the variant‐quality‐score filtering threshold for the initial SAMtools‐predicted mutations in the 600 sorghum BTx623 mutants. The initially predicted SNPs are assigned to bin categories corresponding to their variant quality score. The EMS mutation spectrum for all SNPs in each bin category is calculated. The transition from the variant quality score of 11 (red arrow) to the variant‐quality‐score threshold of 12 (green arrow) shows the percentages of non‐G/C to A/T DNA substitutions decrease drastically, whereas G/C to A/T mutations become the dominant mutation typeThe mutation spectrums for the likely EMS‐induced SNPs detected in the 600 sorghum BTx623 mutants. Using the SAMtools variant‐caller and sorghum reference genome versions 2.1 (shown in above figure as Sb_v2.1) and 3.1, 96.7% and 96% of detected SNPs were G/C to A/T mutations, respectively. Using sorghum reference genome version 3.1 and GATK with single individual genotyping and joint genotyping, 92.4% and 93.5% of SNPs were G/C to A/T mutations, respectivelyThe high concordance of high‐quality SNPs detected by the different variant‐calling approaches in the 600 sorghum mutants. A total of 3,274,606 (SAMtools), 3,514,943 (GATK with single individual genotyping), and 3,435,790 (GATK with joint genotyping; GATK‐Joint) filtered SNPs were predicted, and 3,074,759 of these SNPs were found by all three variant‐calling approaches. The corresponding percentages of likely canonical EMS‐induced DNA substitutions, for each category in the Venn diagram, are shown in parenthesis
Discovery of likely false‐negative shared variants
For multiple independently mutagenized individuals in a population, the pragmatic and very effective technique of subtracting shared variants could result in the unintended consequence of removing false‐negative shared variants. The false‐negative shared variants could originate from pollen contamination, sample mislabeling, and other sources of sample or library preparation errors. Using the initial unfiltered SAMtools‐predicted SNPs, we categorized all SNPs based on the number of mutant individuals in which they were detected (Figure 5a) and then computed the mutation spectrum for each category. A comparison of the G/C to A/T transition percentages for the various SNPs categories indicated a nontrivial signal for likely EMS‐induced SNPs shared by two or three mutants (Figure 5b). We next analyzed the initially subtracted shared SNPs from the earlier SNP filtering step, for evidence of true EMS‐induced mutations. Using our implemented clustering algorithm, we distinguished between localized clusters of mutant individuals with shared mutations (likely false negatives), from very large clusters of mutant individuals with global or near‐global shared variants (likely false positives). Overall, we identified 930,352 non‐empty clusters with an average 78 mutant individuals per cluster and an average of 1.27 shared mutations per cluster (Figure 6a; File S4). Ranking the clusters by the number of shared mutations within a cluster, the Top 43 clusters (Figure 6b) contained an average of 2831 shared mutations, and the highest ranked cluster (clust‐39453_2) contained 6837 mutations shared by two mutant individuals. Most of the top‐ranking clusters (41 out of 43) were from member size two (i.e., contained mutations shared by only two mutant individuals). Of the remaining two clusters, one cluster (clust‐930352_600) contained 4487 mutations found (or globally shared) in all 600 mutants, whereas the final of the Top 43 clusters contained 1238 mutations shared by a specific group of three mutants (clust‐125781_3). Excluding the mutations from the global cluster clust‐930352_600, the mutation spectrum for each set of recovered shared mutations for the 77 impacted mutants showed G/C to A/T percentages ranging from 96.7% to 99.7% (with an average of 98.7%), which suggests these are likely false‐negative (shared) EMS‐induced mutations (Figure S3). On the contrary, only 17.61% (473,979 of 2,692,200) of the SNPs originating from the 4487 genomic positions detected in all 600 mutants were G/C to A/T substitutions. In total, we recovered a total of 235,736 likely false‐negative shared SNPs originating from 117,249 distinct genomic positions, and these include 223,048 SNPs (~95%) having an SNP quality score of at least 12 (File S5). Excluding the detected shared false‐negative SNPs, the mutation spectrum from SNPs shared by two mutants showed a substantial reduction in canonical EMS‐induced mutations from 68% to 51% (Figure S4). This shows that, of the initially subtracted shared SNPs, 35% (232,022 of 661,156) of SNPs found shared between pairs of mutants, and only 1% (3,714 of 291,966) shared SNPs among three mutants, are likely bona fide EMS‐induced mutations. The remaining shared SNPs across most or all mutants are unlikely to be bona fide mutations, but likely represent genomic positions that are heterozygous in the parental genome and are segregating in the mutant population. Closer examination showed 28 of the 77 impacted mutants occurred as 14 clusters (pairs of mutants) that have nearly identical names and most likely have shared lineage from a common mutant ancestor (Table S2). In addition, a subset of 19 mutants were highly heterozygous with percent heterozygosity greater than 89% (Figure 6C; File S6). Fourteen of these 19 mutants were a subgroup of the 77 mutants in the Top 42 clusters. The high degree of heterozygosity is likely an indication of outcrossing in subsequent generations. Including the recovered and likely false‐negative shared SNPs with variant quality scores of at least 12, we retained a total of 3,497,654 likely EMS‐induced SNPs for further downstream analysis. The final retained SNPs consisted of 2,157,409 homozygous and 1,340,245 heterozygous mutations. Overall, the final SAMtools‐predicted SNPs represent an average of 5829 SNPs per mutant and about one mutation per 125 kb in the sorghum BTx623 genome.
FIGURE 5
Distribution of the initially predicted SNPs and their categories. The SNP categories are based on the number of times an SNP is detected in the mutant population. For example, Category 1 and Category 2 refer to mutations detected once and twice, respectively, in the mutant population. (a) The number of SNP predicted for each SNP category. (b) The percentages of G/C to A/T DNA substitutions for each SNPs category. The percentage of G/C to A/T DNA substitutions for Category 2 is unexpectedly higher than Category 1
FIGURE 6
The detection of groups of mutants with shared variants using a clustering algorithm. (a) A total of 930,352 clusters were identified, and these contained an average 78 mutant individuals per cluster, and an average of 1.3 shared mutation positions per cluster. Clusters with a minimum of 300 shared mutations are shown as blue dots. (b) The top 43 clusters contained an average of 2831 shared mutation positions, with 41 of the clusters defined by only pairs of mutant individuals. The number of mutant individuals in a cluster is appended to the end of the cluster ID name. Cluster clust‐930352_600 (shown in red box) contained 4487 mutation positions shared by all 600 mutants. Cluster clust‐125781_3 (shown in red box) also contained 1238 mutations shared by three mutants. (c) The percent homozygosity for all 600 sorghum mutants. Fourteen out of 19 highly heterozygous mutants (>89% heterozygosity) were from the top 42 clusters containing likely shared EMS‐induced mutations
Distribution of the initially predicted SNPs and their categories. The SNP categories are based on the number of times an SNP is detected in the mutant population. For example, Category 1 and Category 2 refer to mutations detected once and twice, respectively, in the mutant population. (a) The number of SNP predicted for each SNP category. (b) The percentages of G/C to A/T DNA substitutions for each SNPs category. The percentage of G/C to A/T DNA substitutions for Category 2 is unexpectedly higher than Category 1The detection of groups of mutants with shared variants using a clustering algorithm. (a) A total of 930,352 clusters were identified, and these contained an average 78 mutant individuals per cluster, and an average of 1.3 shared mutation positions per cluster. Clusters with a minimum of 300 shared mutations are shown as blue dots. (b) The top 43 clusters contained an average of 2831 shared mutation positions, with 41 of the clusters defined by only pairs of mutant individuals. The number of mutant individuals in a cluster is appended to the end of the cluster ID name. Cluster clust‐930352_600 (shown in red box) contained 4487 mutation positions shared by all 600 mutants. Cluster clust‐125781_3 (shown in red box) also contained 1238 mutations shared by three mutants. (c) The percent homozygosity for all 600 sorghum mutants. Fourteen out of 19 highly heterozygous mutants (>89% heterozygosity) were from the top 42 clusters containing likely shared EMS‐induced mutations
SNP impact annotation
Using the snpEff software, we annotated 10,263 distinct high impact SNP positions affecting 7979 distinct sorghum genes. This includes 7217 stop‐gained (nonsense) mutations and represent averages of 17 high impact SNPs and 12 stop‐gained mutations per mutant (Table 3). A total of 136,639 distinct missense SNPs were predicted to have a moderate impact on gene function in 29,835 distinct sorghum genes, which represents an average of 228 missense mutations per mutant. A total of 83,166 SNPs were synonymous (silent) substitutions. Overall, 30,285 distinct sorghum genes were impacted by moderate‐ to high‐impact SNPs. The remaining SNP effect annotations include 293,332 and 2,884,020 SNPs derived from introns and intergenic regions, respectively (File S7). Using the SIFT program and the UNIREF90 protein database, analysis of the non‐distinct 291,043 missense SNPs (overlapping various isoforms of gene transcripts) showed 106,520 were likely deleterious, 90,892 were likely tolerated, and the remaining 93,631 were indeterminate (File S8). The detailed annotation of all the high‐ and medium‐impact SNPs, including the description of the impacted genes has been included in the analysis (File S9).
TABLE 3
Summary predicted impact of likely EMS‐induced SNPs on gene function for the 600 sorghum BTx623 mutants
SNP impact type
Total
Mean
High impact
10,263
17
Stop gained
7217
12
Splice site donor
1370
2
Splice site acceptor
1432
2
Stop lost
19
0
Start lost
226
0
Moderate impact
136,639
228
Missense
136,639
228
Low impact
98,623
164
Silent
83,166
139
Start gained
7491
12
Intron variant
293,332
489
Intergenic region
2,884,020
4,807
Modifier
3,417,516
5,696
Summary predicted impact of likely EMS‐induced SNPs on gene function for the 600 sorghum BTx623 mutants
SNP confirmation
In our previous study using the sorghum BTx623 reference genome version 2.1, we cataloged 1,753,403 likely EMS‐induced SNPs, of which 96.7% were G/C to A/T substitutions (Addo‐Quaye et al., 2018). Low‐coverage resequencing of the M5 generation from 29 randomly selected individuals in the mutation population confirmed 96% of the homozygous SNPs present in the resequencing. Including false‐negative shared SNPs, our current study predicted 3,526,144 likely EMS‐induced SNPs in the sorghum reference genome version 2.1. Comparing the two SNPs datasets, we were able to confirm 96% (1,674,668 out of 1,753,403) of the SNPs from the previous study (File S10). The remaining 4% (78,735) SNPs lacking confirmation in the current study could be attributed to differences in bioinformatic software versions used in the previous and current studies. Although the in silico confirmation supported our approach, it does not necessarily represent a truly independent confirmation of our results. An independent confirmation would require resequencing a select few mutant individuals using Sanger sequencing to confirm the predicted mutations.
DISCUSSION
Although several mutant resources are currently available for sorghum genetics (Addo‐Quaye et al., 2018; Jiao et al., 2016; Xin et al., 2008), gene function discovery in sorghum and other crop species would greatly benefit from improvements in computational techniques for the precise detection of DNA mutations (Boyles et al., 2019). Several techniques have been suggested for filtering false positives, and these include subtraction of shared variants, applying prescribed variant‐quality‐score thresholds, and the removal of variant calls associated with either extreme strand‐bias reads or excessive read coverage depth (Flibotte et al., 2010; James et al., 2013; Nordström et al., 2013; Robasky et al., 2014; Sarin et al., 2010; Uchida et al., 2011; Van der Auwera et al., 2013). The inappropriate application of these best practices could result in the exclusion of bona fide DNA variants as false negatives or the retention of false‐positive detections. As an example, in our previous studies, using the variant‐quality‐score threshold of 20 in the initial filtering of SNPs showed marginal improvement (Addo‐Quaye et al., 2018, 2017; Thapa et al., 2017), whereas in another study, a quality‐score threshold of 100 was effective in the detection bona fide fast neutron‐induced mutations in rice (Li et al., 2017). In the specific case of EMS mutagenesis, the likelihood of detecting the same mutation in two or more independently mutagenized individuals in a population is very low, because EMS action impacts genomic positions randomly. Also, in comparison with other mutagens such as fast neutrons, EMS mutagenesis results in predominantly G/C to A/T DNA transitions (Greene et al., 2003; Loveless, 1969) and hence is limited to a lower diversity in categories of genome target sites (Addo‐Quaye et al., 2018; Henry et al., 2014; Li et al., 2017). However, the well‐known EMS mutation spectrum can be leveraged in determining the suitability or effectiveness of any of the abovementioned false‐positive variant filtering techniques for specific studies. Guided by these well‐described characteristics, and the expected mutation spectrum induced by EMS, we proposed and implemented an improved binning‐based approach for detecting induced mutations in the low‐coverage NGS sequencing data for the M3 generation from 600 EMS‐mutagenized sorghum individuals. Using the proposed approach, we empirically determined the phred‐scale variant‐quality‐score threshold precisely when the retained SNPs in the mutant population were predominantly G/C to A/T substitutions. Our results also suggests that using a phred‐scale variant‐quality‐score threshold of 20 for SNPs filtering in the previous study (Addo‐Quaye et al., 2018) may have been too conservative and probably caused widespread false negatives. A previous benchmark study using high‐confidence variants showed ~92% concordance among three variant callers using Illumina exome sequencing datasets (Hwang et al., 2015). Applying our approach to both GATK‐ and SAMtools‐based variant‐calling algorithms, our analysis similarly revealed a high concordance with 94% of SAMtools SNPs overlapping with GATK SNPs at a whole‐genome mutant population scale. Despite the high similarity in results, there were some differences between the SAMtools and GATK algorithms. Firstly, the initial subtraction of shared variants had an immediate desirable impact on the GATK‐based approaches. This was due to a comparatively lower default minimum variant‐quality‐score setting of 3.0 in SAMtools, which resulted in the retention of numerous poor‐quality variants, whereas the default setting for GATK was relatively higher (10.0). Secondly, estimating the phred‐scale variant‐quality‐score threshold for SAMtools was relatively straightforward. Thirdly, the vast majority of GATK‐specific SNPs that were not predicted by SAMtools were less likely to be EMS‐induced mutations. These differences in characteristics reflect the underlying assumptions in the two variant‐calling algorithms (DePristo et al., 2011; Li, 2011). Although subtraction of shared variants is a very potent technique, it could inadvertently result in bona fide false‐negative mutations being removed. Pollen contamination can be a source of shared variants in a sequenced population. Our previous approach to uncovering false‐negative shared variants involved retaining shared G/C to A/T substitutions with high variant quality score that were detected in up to two mutants (Addo‐Quaye et al., 2018). Using our proposed clustering algorithm proved to be a more comprehensive approach to detecting clusters of false‐negative shared variants in the mutant population. Previous studies had shown SNPs associated with extreme strand‐biased reads are a potential source of false positives in Illumina NGS sequencing data (Guo et al., 2012). Next, we investigated the evidence of extreme strand‐bias impact in the filtered SNP dataset. Our results showed most of the strand‐biased SNPs were likely EMS‐induced mutations. This suggests that for low‐coverage whole‐genome NGS sequencing datasets, the lack of sufficient read depth might result in artificial extreme strand‐biased reads mapping at some SNP positions and can result in these SNPs being discarded as false positives. Previous studies revealed that known variants that alter phenotypes usually corroborate their preliminary prediction as deleterious mutations, and as such, computational predicted deleterious mutations may likely alter phenotypes (Kono et al., 2016). Our analysis predicted 106,520 missense SNPs to be likely deleterious and hence likely to be candidates with phenotype‐altering potential. In summary, as a contribution to the ever‐growing sorghum genetic resources (Gladman et al., 2022), we have created an improved large‐scale mutant resource for sorghum genetics and created a light‐weight web portal for querying the mutant database with seamless links for requesting seed stock for the collections in the Germplasm Resources Information Network (GRIN) database.
MATERIALS AND METHODS
NGS sequence retrieval and reads mapping
In our previous study, sorghum BTx623 seeds were mutagenized using EMS, and after multiple generations of self‐fertilization, DNA was extracted and sequenced from leaf tissues obtained from the M3 generation (Addo‐Quaye et al., 2018). The Illumina next‐generation sequencing data for the M3 generation from 600 EMS‐mutagenized sorghum mutants were retrieved from the NCBI Sequence Read Archive (SRA Accession Number SRP065118) by using the fastq‐dump program (Version 2.9.0) with non‐default parameters: "‐‐split‐3" (Kodama et al., 2012; Leinonen et al., 2010). We retrieved the reference genome sequences and annotations for the sorghum BTx623 genome assembly versions 2.1 and 3.1, from the Phytozome web portal (Goodstein et al., 2012). Prior to NGS reads mapping, the reference genome sequences were indexed using index command in the BWA short reads aligner (Version 0.7.15) (Li & Durbin, 2009). The NGS reads for each sequenced sample were mapped to the sorghum reference genome using the BWA mem command with non‐default parameter: "‐t 64".
SAMtools‐based variant‐calling
The SAMtools variant‐calling was performed using Version 1.8 of the software package (Li et al., 2009). Prior to variant‐calling, the sorghum reference genome sequences were indexed using the SAMtools faidx command with default parameters. The BWA‐mapped NGS reads were sorted using the SAMtools sort command with non‐default parameters: "‐@ 64 ‐m 2G ‐O bam ‐T". Duplicate NGS reads were identified by using GATK MarkDuplicates command (Version 4.0.5.0) with default parameters (McKenna et al., 2010). The resulting BAM‐formatted files were indexed using the index command with non‐default parameters: "‐b ‐@ 64". The genotype likelihoods were computed using the BCFtools mpileup command (Version 1.8) with non‐default parameter: "‐Ou". Final variant detection was performed using the BCFtools call command with non‐default parameters: "‐vmO z". Based on the SAMtools zygosity call encoding in the output variant call files, SNPs were characterized as either homozygous or heterozygous.
GATK‐based per‐sample single individual variant‐calling
For the GATK‐based variant‐calling, the sorghum reference genome sequences were indexed using the GATK CreateSequenceDictionary command (V4.0.5.0) with default parameters (DePristo et al., 2011). Using the above BWA‐sorted and indexed NGS reads, the initial GATK single sample variant‐calling for single individuals was performed by invoking the GATK HaplotypeCaller with non‐default parameters: '‐‐java‐options "‐Xmx4g" HaplotypeCaller'.
GATK‐based joint genotyping
For joint genotyping with GATK, the initial steps were similar to the above described GATK approach for per‐sample individual variant calling, with the exception of invoking the GATK HaplotypeCaller with non‐default parameters: '‐‐java‐options "‐Xmx4g" HaplotypeCaller ‐ERC GVCF'. Next, we created two additional input files: The first was the map file that contained the sequenced sample ID names and their corresponding individual genotyping call filenames. The second file contained the list of chromosome and super‐scaffold names extracted from the reference genome sequence assembly file. Specifying these two input files, we executed the GATK GenomicsDBImport command with non‐default parameters: '‐‐java‐options "‐Xmx4g" ‐‐batch‐size 100 ‐‐reader‐threads 50' to merge the individual sample variant call files, which were then subsequently imported into a GenomicsDB workspace. Joint genotype calling was performed using the GATK GenotypeGVCFs command with non‐default parameters: "‐G StandardAnnotation ‐new‐qual".
Variants filtering
For both SAMtools‐based and GATK‐based approaches, all the initially predicted SNPs from the 600 mutants were combined and sorted by their genomic positions. Using this information, all variants detected at the same genomic position in multiple mutant individuals were subtracted (Uchida et al., 2011). We calculated the mutation spectrum for the retained unique SNPs to check the effect of the likely false‐positive subtraction. We next empirically determined the phred‐scaled variant‐quality‐score threshold for further SNPs filtering by using a binning method as follows. All the retained SNPs from the shared variants subtraction step were sorted into bin categories corresponding to their respective variant quality scores, and SNPs with the same quality‐score were assigned to the same bin category. With emphasis on the percentage of mutations that were G/C to A/T substitutions, we plotted the mutation spectrum for each quality‐score bin category. Through visual inspection by starting from the lowest to the highest phred‐scaled quality‐score bin category, we determined the threshold variant quality score, at which the percentage of SNPs that were G/C to A/T transitions approached the expected results of EMS action. The subset of SNPs with variant quality score less than the threshold score were filtered. We separately estimated the SNPs quality‐score thresholds for the SAMtools‐ and GATK‐based variant‐calling results.
Detection of likely false‐negative variants
We implemented a clustering algorithm for detecting and recovering likely false‐negative shared variants that were subtracted in the initial SNPs filtering step. We hypothesize that if a set of mutants are genetically related, then a sufficiently large number of (shared) mutations would be detected exclusively in the specific set of mutants and absent in the remaining unrelated mutants in the sequenced population. The SAMtools output variant call files for all 600 mutants were combined into a single file and sorted using the genome co‐ordinates from the SNPs. Using the combined SNP datasets, we generated an SNP summary table containing the following information for each distinct SNP present in the merged file: the genome coordinate of the SNP, the number of occurrences in the mutation population, and a corresponding list of mutants containing the SNP. For example, the summary SNP table entry " <3> " refers to the predicted SNP at genome position 23,080,886 on chromosome 4, which was detected in three mutants with corresponding sample SRA ID names SRR2759335, SRR2759504, and SRR2759646. Hence, the shared SNP at genomic position Chr04:23080886 was detected (predicted) only in the above‐listed set of three mutants. Next, for each set of listed mutants associated with an SNP entry in the summary table, we search the summary table to determine the number of additional shared SNPs that were detected only within the same listed set of mutants. For example, in the case of the above‐listed set of three mutants, we examine the summary table entries for any additional shared SNP that was predicted exactly three times in the sequenced population and having SRR2759335, SRR2759504, and SRR2759646 as the corresponding list of mutants. Each set of mutants (e.g., SRR2759335, SRR2759504, and SRR2759646) represents a candidate cluster that is assigned a cluster ID number, and their corresponding number of shared SNP positions and the percentage of mutations that were G/C to A/T substitutions are recorded. Next, the detected clusters were sorted and ranked based on the number of shared mutations in each cluster. The top‐ranked clusters with markedly high number of shared mutation counts were retained, and their corresponding percentages of G/C to A/T substitution were used as benchmark evidence of likely EMS‐induced origins (i.e., likely false‐negative shared SNPs). The identified likely false negatives were then added to the unique (non‐shared) filtered SNPs. For each affected mutant individual, the merging of the likely false negatives and the likely true positives was accomplished by appending the shared variants file to their corresponding filtered unique variants file and resorting the resulting variant call file by using the BEDTools sort command (Version v2.26.0) with the non‐default parameter "‐faidx chromNames" (Quinlan & Hall, 2010).We used the snpEff variant annotation package version 4.3t (Cingolani et al., 2012) to predict the impact of the detected SNPs on gene function. We created a custom snpEff database for the sorghum reference genome version 3.1 as follows. We edited the snpEff configuration file to include an entry for sorghum genome version 3.1. Next, we added the sorghum genome reference sequences, genome annotation .gff files, and the protein sequences to the required subfolders in the snpEff package, and we run the snpEff build command with non‐default option "‐gff3 ‐v Sbicolor_454_v3.0.1". Using the variant call files as input, the variant impact annotation was performed by using the snpEff effcommand.
SIFT annotation of missense SNP effect
For the final set of retained SNPs, we selected all missense substitutions, including the genome position and the DNA base substitution information from their corresponding variant call files. Next, we retrieved the subset of sorghum protein sequences which contained the predicted missense mutations. We next downloaded the UNIREF90 set of protein sequences (109,653,977 sequences in release version 2020_02; 22‐Apr‐2020) from the UniProt web portal (Consortium, 2014). Using the impacted sorghum protein sequences and the UNIREF90 FASTA sequences as query and database options, respectively, we used the SIFT4G algorithm (Vaser et al., 2016) with non‐default parameter "‐t 64" to predict the likelihood of the missense substitutions having a tolerated or deleterious effect in the genome. Based on the SIFT analysis results, each missense variant was assigned one of the following designations: TOLERATED, DELETERIOUS, or UNKNOWN. For all SNPs with predicted moderate or high impact on gene function, we appended the SIFT predictions and the gene function description from the sorghum reference genome version 3.1.
CONFLICT OF INTEREST
The Authors did not report any conflict of interest.
AUTHOR CONTRIBUTIONS
C.A.‐Q, C.W., M.T., and E.M.B. designed the experiments. J.M.S., M.Y.B., A.T.S., D.A., and C.A.‐Q. performed the variant‐calling and annotation. T.C.H. implemented the webserver, and Y.K. implemented the databases. C.K., D.L., and C.A.‐Q. performed the clustering analysis. C.A.‐Q. drafted the manuscript. C.A.‐Q., E.M.B., C.W., and M.T. edited the manuscript.Figure S1. The binning approach to determine the variant quality‐score filtering threshold for the initial GATK‐predicted single individual genotyping mutations in the 600 sorghum BTx623 mutants. The initially predicted SNPs are assigned to bin categories corresponding to their variant‐quality score. The EMS mutation spectrum for all SNPs in each bin category is calculated. At a variant‐quality threshold of 26, the percentages of non‐G/C to A/T DNA substitutions decrease drastically, while G/C to A/T mutations become the dominant mutation type.Click here for additional data file.Figure S2. The binning approach to determine the variant quality‐score filtering threshold for the initial GATK‐predicted joint genotyping mutations in the 600 sorghum BTx623 mutants. The initially predicted SNPs are assigned to bin categories corresponding to their variant‐quality score. The EMS mutation spectrum for all SNPs in each bin category is calculated. At a variant‐quality threshold of 28, the percentages of non‐G/C to A/T DNA substitutions decrease drastically, while G/C to A/T mutations become the dominant mutation type.Click here for additional data file.Figure S3. The mutation spectrums for the SAMtools‐predicted extreme strand‐biased SNPs and the likely false‐negative shared SNPs detected by the clustering algorithm. Almost 99% (232,662) of the 235,736 likely false‐shared SNPs and 94% (290,110 SNPs) of the 307,434 extreme strand‐biased SNPs were G/C to A/T substitutions, respectively.Click here for additional data file.Figure S4. The impact of the exclusion of likely false‐negative shared SNPs on the percentages of G/C to A/T DNA substitutions for the categories of initially predicted. The percentage of G/C to A/T DNA substitutions for category 2 is lower than for category 1.Click here for additional data file.Table S1. The mutation spectrums for the final filtered GATK‐specific SNPs that were not detected by SAMtools.Table S2. List of clusters containing pairs of sorghum mutant individuals with nearly identical names.Click here for additional data file.File S1. The combined variant call files for the filtered SNPs detected in GATK single individuals genotyping and not detected by the SAMtools variant‐caller.File S2. The combined variant call files for the filtered SNPs detected in GATK joint genotyping and not detected by the SAMtools variant‐caller.File S3. The subset of SAMtools variant‐calling result files for the filtered extreme strand‐biased SNPs.File S4. The 930,352 identified clusters of mutant individuals with shared SNPs from SAMtools variant‐calling.File S5. The SAMtools variant call files containing only the likely false‐negative shared SNPs for the 77 impacted mutant individuals.File S6. The percent homozygosity for the 600 sorghum mutants.File S7. The snpEff‐annotated variant call files for the likely EMS‐induced mutations in all 600 sorghum mutants.File S8. The SIFT prediction results for the likely EMS‐induced nonsynonymous substitutions in sorghum genes of the 600 mutants.File S9. The final detailed SNP annotations for the likely EMS‐induced high‐impact and medium impact SNPs in the 600 sorghum mutants.File S10. In silico confirmation of SNPs detected in previous study.Click here for additional data file.
Authors: Thomas J Y Kono; Fengli Fu; Mohsen Mohammadi; Paul J Hoffman; Chaochih Liu; Robert M Stupar; Kevin P Smith; Peter Tiffin; Justin C Fay; Peter L Morrell Journal: Mol Biol Evol Date: 2016-06-14 Impact factor: 16.240
Authors: Stefanie Griebel; Richard P Westerman; Adedayo Adeyanju; Charles Addo-Quaye; Bruce A Craig; Clifford F Weil; Suzanne M Cunningham; Bhavesh Patel; Osvaldo H Campanella; Mitchell R Tuinstra Journal: Theor Appl Genet Date: 2019-10-17 Impact factor: 5.699
Authors: Jared M Simons; Tim C Herbert; Coleby Kauffman; Marc Y Batete; Andrew T Simpson; Yuka Katsuki; Dong Le; Danielle Amundson; Elizabeth M Buescher; Clifford Weil; Mitch Tuinstra; Charles Addo-Quaye Journal: Plant Direct Date: 2022-05-25