| Literature DB >> 28869792 |
Davoud Torkamaneh1,2, Jérôme Laroche2, Aurélie Tardivel1,2,3, Louise O'Donoughue3, Elroy Cober4, Istvan Rajcan5, François Belzile1,2.
Abstract
Next-generation sequencing (NGS) and bioinformatics tools have greatly facilitated the characterization of nucleotide variation; nonetheless, an exhaustive description of both SNP haplotype diversity and of structural variation remains elusive in most species. In this study, we sequenced a representative set of 102 short-season soya beans and achieved an extensive coverage of both nucleotide diversity and structural variation (SV). We called close to 5M sequence variants (SNPs, MNPs and indels) and noticed that the number of unique haplotypes had plateaued within this set of germplasm (1.7M tag SNPs). This data set proved highly accurate (98.6%) based on a comparison of called genotypes at loci shared with a SNP array. We used this catalogue of SNPs as a reference panel to impute missing genotypes at untyped loci in data sets derived from lower density genotyping tools (150 K GBS-derived SNPs/530 samples). After imputation, 96.4% of the missing genotypes imputed in this fashion proved to be accurate. Using a combination of three bioinformatics pipelines, we uncovered ~92 K SVs (deletions, insertions, inversions, duplications, CNVs and translocations) and estimated that over 90% of these were accurate. Finally, we noticed that the duplication of certain genomic regions explained much of the residual heterozygosity at SNP loci in otherwise highly inbred soya bean accessions. This is the first time that a comprehensive description of both SNP haplotype diversity and SV has been achieved within a regionally relevant subset of a major crop.Entities:
Keywords: SVs; bioinformatics pipeline; genotype accuracy; heterozygosity; next-generation sequencing; sequence variants
Mesh:
Year: 2017 PMID: 28869792 PMCID: PMC5814582 DOI: 10.1111/pbi.12825
Source DB: PubMed Journal: Plant Biotechnol J ISSN: 1467-7644 Impact factor: 9.803
Number of detected variants using two different WGS variant‐calling pipelines (Fast‐WGS and SOAPsnp)
| Pipeline/Variants | SNPs | MNPs | Indels | Computing time |
|---|---|---|---|---|
| Fast‐WGS | 4 071 378 | 284 836 | 642 015 | 81 h |
| SOAPsnp | 4 124 216 | ND | 512 418 | 261 h |
Analysis was performed using a Linux server with 64 CPU and 1 Tb of RAM.
Accuracy of genotype calls made using two WGS variant‐calling pipelines (Fast‐WGS and SOAPsnp). WGS‐derived SNP genotypes were compared to the genotypes called at loci in common with the SoySNP50K array for the same samples
| Variants/Pipeline | Fast‐WGS | Concordance (%) | SOAPsnp | Concordance (%) |
|---|---|---|---|---|
| Shared genotypes | 674 139 | 645 070 | ||
| Homozygous | 668 672 | 99.7 | 641 215 | 97.1 |
| Heterozygous | 3842 | 98.6 | 2152 | 91.8 |
| Indels | 1625 | 96.1 | 1703 | 89.5 |
Shared genotypes with the SoySNP50K data set.
Accuracy of imputed missing data in the WGS SNP data set. Imputed genotypes were compared to the genotypes called at loci in common with the SoySNP50K array for the same samples
| Variants | WGS data set | Imputation accuracy (%) |
|---|---|---|
| Number of homozygous genotypes | 594 | 98.8 |
| Number of heterozygous genotypes | 41 | 92.7 |
| Total | 635 | 98.6 |
Figure 1(a) Minor allele frequency (MAF) of variants. (b) Location of variants within the genome.
Figure 2Distribution of variants with different degrees of predicted functional impact based on mutant allele frequency.
Figure 3Number of variants (blue) and tag SNPs (green) based on different number of samples.
List of structural variant types identified in short‐season soya beans and their characteristics
| SV type | Number of SV sites | SV size | Median size of SV (bp) | SV site breakpoint precision (bp) |
|---|---|---|---|---|
| Deletion | 63 556 | 10 bp–3 Mb | 106 | ±3 |
| Insertion | 16 442 | 32 bp–3 Mb | 144 | ±4 |
| Duplication (disperse duplication) | 2865 | 66 bp–3 Mb | 2513 | ±15 |
| Inversion | 4221 | 33 bp–2.8 Mb | 116 | ±12 |
| CNV (tandem duplication) | 1435 | 500 bp–1.5 Mb | 5623 | – |
| Translocation (intrachromosomal) | 3011 | 30 bp–2 Mb | 112 | ±6 |
| Translocation (interchromosomal) | 302 | 100 bp–3 Mb | 4523 | ±35 |
Ascertained with split reads.
Estimated for tandem duplications.
Estimated for inversions with paired‐end support from both breakpoints.
Figure 4Distribution of SNPs and SVs on chromosome Chr10.
Number of SVs located in genic regions based on their span or breakpoints
| SV type | Deletion | Insertion | Duplication | Inversion | CNV | Translocation |
|---|---|---|---|---|---|---|
| In gene | 15 365 | 3201 | 71 | 1949 | 71 | 164 |
| Upstream and gene | 1653 | 1652 | 513 | 147 | 213 | 35 |
| Downstream and gene | 1714 | 1579 | 617 | 175 | 267 | 32 |
| Whole gene | 692 | 329 | 821 | 15 | 443 | 15 |
| Total | 19 424 | 6762 | 2023 | 2286 | 995 | 246.6 |
| Percentage of all SVs affecting genes (%) | 30.6 | 41.1 | 70.6 | 54.2 | 69.3 | 8.2 |
Nontandem duplication.
Tandem duplication.
Intrachromosomal translocation.
Figure 5Plot of mapped‐read depth and heterozygosity in a segment of chromosome Chr10 for which some lines exhibited clusters of heterozygous calls, while other lines were homozygous.