| Literature DB >> 31900111 |
Glenn T Howe1, Keith Jayawickrama2, Scott E Kolpak3, Jennifer Kling3, Matt Trappe2, Valerie Hipkins4, Terrance Ye2, Stephanie Guida5, Richard Cronn6, Samuel A Cushman7, Susan McEvoy3.
Abstract
BACKGROUND: In forest trees, genetic markers have been used to understand the genetic architecture of natural populations, identify quantitative trait loci, infer gene function, and enhance tree breeding. Recently, new, efficient technologies for genotyping thousands to millions of single nucleotide polymorphisms (SNPs) have finally made large-scale use of genetic markers widely available. These methods will be exceedingly valuable for improving tree breeding and understanding the ecological genetics of Douglas-fir, one of the most economically and ecologically important trees in the world.Entities:
Mesh:
Year: 2020 PMID: 31900111 PMCID: PMC6942338 DOI: 10.1186/s12864-019-6383-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Flow chart of steps used to select SNPs for the Axiom genotyping array. SNPs on the Axiom array were selected from the Oregon State University (OSU) dataset described by Howe et al. [17] and the University of Hohenheim (UH) dataset described by Müller et al. [32]. ‘Discovered SNPs’ are the starting SNPs and isotigs from each dataset. Isotigs are transcript variants assembled using the Newbler de novo assembler. ‘Novel SNPs’ are SNPs in novel UH transcripts, which are transcripts missing from the OSU transcriptome [17]. ‘High-confidence SNPs’ are OSU SNPs with a target SNP probability (PS) < 0.001 or UH SNPs detected by 2 or 3 SNP detection programs. ‘Infinium genotyped SNPs’ are OSU SNPs previously genotyped using an Infinium genotyping array [17]. ‘Evaluated SNPs’ are the SNPs evaluated for suitability of flanking sequences. ‘Buildable SNPs’ are SNPs with at least one 35-nt flanking sequence with no other (i.e., non-target) high-confidence SNPs or indels. ‘Total buildable SNPs’ are the combined OSU and UH SNPs that were ranked for inclusion on the Axiom array using the variables described in Table 2
Transcript and probeset ranking variables versus genotyping success using an Axiom genotyping array
| Variable | No. of probesets | Category or mean | Percent or mean | Number | ||
|---|---|---|---|---|---|---|
| Success | Fail | Success | Fail | |||
| Transcript ranking variablesa | ||||||
| No. of hits to scaffoldsb (transcript mean) (v0.5) | 58,350 | 1 | 58.5 | 41.5 | 18,745 | 13,286 |
| > 1 | 41.5 | 58.5 | 9403 | 13,242 | ||
| 0 | 27.5 | 72.5 | 1011 | 2663 | ||
| Transcript confidence scoreb (absent for UH SNPs) | 54,625 | Higher | 55.8 | 44.2 | 13,987 | 11,087 |
| Lower | 49.6 | 50.4 | 14,663 | 14,888 | ||
| No. of SNPs per transcriptc | 58,350 | |||||
| Q3 | 56.2 | 43.8 | 9202 | 7173 | ||
| Q1 | 43.5 | 56.5 | 7375 | 9570 | ||
| Combined rankc (transcripts) | 58,350 | |||||
| Q1 | 52.5 | 47.5 | 7659 | 6930 | ||
| Q3 | 35.7 | 64.3 | 5214 | 9375 | ||
| Probeset-within-transcript ranking variables | ||||||
| Infinium successb,d | 6173 | SNP success | 74.5 | 25.5 | 4598 | 1575 |
| Probability of flanking SNPsb,e | 58,350 | Low | 50.8 | 49.2 | 27,732 | 26,844 |
| Moderate | 37.8 | 62.2 | 1427 | 2347 | ||
| No. of perfect allelesb (percent identity = 100%)(v0.5) | 58,350 | 1 | 53.5 | 46.5 | 23,916 | 20,799 |
| 0 | 39.2 | 60.8 | 5042 | 7810 | ||
| 2 | 25.7 | 74.3 | 201 | 582 | ||
| pConvertc | 57,381 | |||||
| Q3 | 57.7 | 42.3 | 8319 | 6087 | ||
| Q1 | 41.5 | 58.5 | 6429 | 9059 | ||
| Target SNP probabilityb,f (OSU SNPs) | 53,958 | 55.0 | 45.0 | 24,600 | 20,138 | |
| 39.7 | 60.3 | 3658 | 5562 | |||
| Target SNP probabilityb (UH SNPs) | 3725 | 3 programs | 23.3 | 76.7 | 128 | 422 |
| 2 programs | 12.0 | 88.0 | 381 | 2794 | ||
| Final rankc,g (transcripts and probesets-within-transcripts) | 58,350 | |||||
| Q1 | 61.5 | 38.5 | 8966 | 5622 | ||
| Q3 | 46.6 | 53.4 | 6800 | 7788 | ||
| Other variables | ||||||
| Recommendationb,h | 57,295 | Recommended | 54.7 | 45.3 | 17,779 | 14,748 |
| Neutral | 43.2 | 56.8 | 10,691 | 14,078 | ||
aTranscripts refer to the Newbler isotigs [17] or putative transcripts [32] used for SNP discovery. v0.5 is version 0.5 of the Douglas-fir reference genome. UH SNPs were those detected by Müller et al. [32], whereas OSU SNPs were those detected by Howe et al. [17]
bFor the categorical variables, percentages and numbers of probesets are reported for each category and means are absent. All differences among categories were highly significant (P < 0.0001) using a likelihood ratio chi-square test
cFor the ranks and continuous variables, means are reported in bold, and percentages and numbers of probesets are reported for the upper (Q3) and lower (Q1) quartiles. Categories are ranked by probeset success. Successful SNPs were those that had a call rate > 60% and were polymorphic. All differences between means were highly significant (P < 0.0001) using a T-test (non-rank variables) or a Wilcoxon rank test (Combined rank and Final rank variables)
dFor SNPs successfully genotyped with the Infinium platform, Axiom probeset success (74.5%) was significantly greater than the overall probeset success rate of 50.0% (P < 0.0001)
eLow (rank = 1) or moderate (rank = 2) chance of having flanking SNPs or indels
fThe P < 0.001 category indicates that 0.0001 ≤ P < 0.001
gThe final probeset rank was based on the combined transcript rank plus the probeset-within-transcript variables
hThe Affymetrix Recommendation variable was not used to select probesets because it is a categorical variable derived from pConvert
Percentages of successful SNPs using an Axiom genotyping array in Douglas-fir
| SNP categoryb | Final SNP call rate thresholda | Affymetrix abbreviation [ | ||||
|---|---|---|---|---|---|---|
| Default | Rescue | |||||
| 97% | 90% | 80% | 70% | 60% | ||
| Off-target variant | 1 | 1 | 1 | 1 | 1 | OTV |
| Other | 30 | 29 | 26 | 24 | 23 | Other |
| Call rate below threshold | 8 | 3 | 2 | 2 | 2 | CallRateBelowThreshold |
| Not Converted | 40 | 34 | 30 | 27 | 26 | OTV + Other + CallRateBelowThreshold |
| No minor homozygote | 13 | 13 | 13 | 13 | 13 | NoMinorHom |
| Monomorphic high resolution | 16 | 16 | 16 | 16 | 16 | MonoHighResolution |
| Polymorphic high resolution | 31 | 31 | 31 | 31 | 31 | PolyHighResolution |
| Rescued | – | 6 | 10 | 13 | 13 | Rescued from Other and CallRateBelowThreshold |
| Convertedc | 60 | 66 | 70 | 73 | 74 | PolyHighResolution + NoMinorHom + MonoHighResolution + Rescued |
| Percent successful (population ave) | 31.5 | 37.5 | 41.6 | 44.0 | 44.9 | PolyHighResolution + Rescued |
| Number successful (population ave) | 17,555 | 20,926 | 23,223 | 24,548 | 25,037 | PolyHighResolution + Rescued |
| Percent successful (population sum) | 37.1 | 42.9 | 46.9 | 49.5 | 50.4 | PolyHighResolution + Rescued |
| Number successful (population sum) | 20,669 | 23,917 | 26,180 | 27,616 | 28,094 | PolyHighResolution + Rescued |
aWe applied QC thresholds in one or two phases of analysis. The Default protocol consisted of the default Affymetrix parameters, including a CR threshold of 97%. In the Rescue protocols, we used the Default thresholds for phase 1, but then applied lower CR thresholds (60–90%) to the Other and CallRateBelowThreshold categories in phase 2
bSNPs (N = 55,766) were classified into six categories (OTV, Other, CallRateBelowThreshold, NoMinorHom, MonoHighResolution, PolyHighResolution) and one Rescued category. Successful SNPs were those that were polymorphic with a call rate (CR) exceeding the indicated CR threshold after one or two phases of analysis with alternative quality control (QC) thresholds. Table values are averages from two populations (C1/I1 and C2) that were analyzed separately, except for the ‘population sum’ rows, which are based on sums. The C1/I1 population consisted of coastal Douglas-fir (N = 1682) and interior Douglas-fir (N = 12) samples that passed QC thresholds and were analyzed together. The C2 population consisted of coastal Douglas-fir (N = 348) samples that passed QC thresholds and were analyzed independently
cConverted SNPs were those that were successfully assayed using the Default or Rescue protocol, but not necessarily polymorphic
Fig. 2SNP performance and population genetic statistics versus SNP call rate threshold in Douglas-fir. Using all related and unrelated trees in the study, we identified polymorphic SNPs using SNP call rate (CR) thresholds of 60, 70, 80, 90, and 97%. These successful SNPs were then tested on two populations of unrelated trees (NC1 = 112 and NC2 = 283). The values in the figure are median values averaged across the two populations for SNPs that were polymorphic and in HWE (P ≥ 0.01). CR is the measured SNP call rate (percent/100), HETobs is observed heterozygosity, PIC is polymorphic information content, MAF is minor allele frequency, and SNPs are the numbers of polymorphic SNPs in HWE. The scale on the right vertical axis shows the number of SNPs (dashed line), whereas the scale on the left is for all other variables (solid lines)
SNP ranking variables versus genotyping success using an Axiom genotyping array and stepwise logistic regression
| Variable | DF | Array design variables (ROC area = 0.6449)a | Final selected variables (ROC area = 0.6781)a | ||||
|---|---|---|---|---|---|---|---|
| Step entered | Chi-square statistic | Chi-square probability | Step entered | Chi-square statistic | Chi-square probability | ||
| Scaffold PID (best-hit – second-best hit) (v1.0)b | 1 | – | – | – | 1 | 4557.23 | < 0.0001 |
| No. of hits to scaffolds (transcript mean) (v0.5)c,d | 2 | 1 | 1531.38 | < 0.0001 | – | – | – |
| Target SNP probability | 1 | 3 | 642.62 | < 0.0001 | 2 | 588.16 | < 0.0001 |
| pConvert | 1 | 2 | 730.04 | < 0.0001 | 3 | 291.26 | < 0.0001 |
| Number of perfect alleles (PID = 100%) (v0.5)c | 2 | 4 | 302.18 | < 0.0001 | – | – | – |
| Number of SNPs per transcriptd | 66 | 5 | 285.60 | < 0.0001 | – | – | – |
| Number of hits to singletons (v1.0)b | 2 | – | – | – | 4 | 141.07 | < 0.0001 |
| Number of hits to gene models (v1.0)b | 2 | – | – | – | 5 | 85.06 | < 0.0001 |
| Number of hits to scaffolds (v1.0)b | 2 | – | – | – | 6 | 31.73 | < 0.0001 |
| Probability of flanking SNPs | 1 | 6 | 43.55 | < 0.0001 | 7 | 20.08 | < 0.0001 |
| Scaffold second-best hit PID (v1.0)b | 1 | – | – | – | 8 | 21.18 | < 0.0001 |
| Transcript confidence score | 1 | 7 | 6.77 | 0.0093 | 9 | 12.91 | 0.0003 |
| No. of hits to reference transcripts (v1.0)b | 2 | – | – | – | 10 | 14.67 | 0.0007 |
aArray design variables included variables calculated using v0.5 of the Douglas-fir reference genome. After genotyping, alternative variables were calculated using v1.0 of the reference genome and included in the set of final selected variables. Successful SNPs were those that had a call rate > 60% and were polymorphic. ROC area is the area under the receiver operating characteristic curve using cross-validation
bv1.0 variables are the number of BLAST hits or percent identities (PID) using v1.0 of the Douglas-fir reference genome (scaffolds, singletons, gene models, or transcripts) as the target and SNP sequences (71-mers) as the queries
cv0.5 variables were calculated using BLAST, Douglas-fir reference scaffolds (v0.5) as the target, and SNP sequences (71-mers) as the queries
dExcept for ‘reference transcripts,’ ‘transcript’ refers to the Newbler isotigs used for SNP discovery by Howe et al. [17]
Fig. 3Receiver operating characteristic (ROC) curves for two sets of variables used to predict SNP genotyping success in Douglas-fir. a Shows the predictive ability of variables used to design the Axiom array (Table 3). Some of these variables were calculated using an earlier version of the Douglas-fir reference genome (v0.5) [16]. b Shows the predictive ability of alternative design variables. We replaced some of the original design variables with new variables calculated using v1.0 of the Douglas-fir reference genome [16], resulting in the final selected variables described in Table 3. ROC curves are used to evaluate binary predictive models (e.g., predictions of SNP success versus failure). Successful SNPs were those that had a call rate > 60% and were polymorphic
SNP ranking variables versus SNP genotyping success using an Axiom genotyping array
| Variablea | No. of SNPs | Category | Percent | Number | ||
|---|---|---|---|---|---|---|
| Success | Fail | Success | Fail | |||
| Percent identity (PID)b,c | ||||||
| Scaffold PID (best hit) | 55,766 | > 80 | 50.9 | 49.1 | 27,936 | 26,906 |
| ≤ 80 | 17.1 | 82.9 | 158 | 766 | ||
| Scaffold PID (second-best hit) | 55,766 | ≤ 80 | 59.9 | 40.1 | 22,775 | 15,218 |
| > 80 | 29.9 | 70.1 | 5319 | 12,454 | ||
| Scaffold PID (best-hit, second-best hit) | 55,766 | > 80, ≤ 80 | 61.0 | 39.0 | 22,617 | 14,452 |
| > 80, > 80 | 29.9 | 70.1 | 5319 | 12,454 | ||
| ≤ 80, ≤ 80 | 17.1 | 82.9 | 158 | 766 | ||
| Number of hitsb | ||||||
| Number of hits to scaffolds | 55,766 | 1 | 60.9 | 39.1 | 22,946 | 14,753 |
| > 1 | 29.1 | 70.9 | 4980 | 12,115 | ||
| 0 | 17.3 | 82.7 | 168 | 804 | ||
| Number of hits to singletons | 55,766 | 0 | 51.5 | 48.5 | 27,922 | 26,319 |
| 1 | 11.8 | 88.2 | 79 | 589 | ||
| > 1 | 10.9 | 89.1 | 93 | 764 | ||
| Number of hits to gene models | 55,766 | 1 | 55.8 | 44.2 | 10,760 | 8522 |
| 0 | 50.8 | 49.2 | 16,208 | 15,705 | ||
| > 1 | 24.6 | 75.4 | 1126 | 3445 | ||
| Number of hits to reference transcripts | 55,766 | 1 | 54.1 | 45.9 | 12,389 | 10,529 |
| > 1 | 47.9 | 52.1 | 3618 | 3943 | ||
| 0 | 47.8 | 52.2 | 12,087 | 13,200 | ||
aSNP variables are the numbers of BLAST hits or percent identities (PID) using v1.0 of the Douglas-fir reference genome (scaffolds, singletons, gene models, or transcripts) as the target and SNP sequences (71-mers) as the queries. Percentages and numbers of SNPs are reported for each category. Successful SNPs were those that had a call rate > 60% and were polymorphic
bAll differences among categories were highly significant (P < 0.0001) using a likelihood ratio chi-square test
cSNP blast hits were categorized as either > 80% or ≤ 80% identity (PID)
Fig. 4Distributions of minor allele frequencies for successful Douglas-fir SNPs. Open bars represent successful SNPs, whereas solid bars represent successful SNPs that were in Hardy-Weinberg Equilibrium (HWE; P ≥ 0.01). Successful SNPs were SNPs that were polymorphic and had SNP call rates > 60%. Minor allele frequencies are averaged across two populations of unrelated trees (C1 = 112 trees and C2 = 283 trees)