| Literature DB >> 35197050 |
Marc-André Lemay1,2, Jonas A Sibbesen3, Davoud Torkamaneh1,2, Jérémie Hamel2,4, Roger C Levesque2,4, François Belzile5,6.
Abstract
BACKGROUND: Structural variants (SVs), including deletions, insertions, duplications, and inversions, are relatively long genomic variations implicated in a diverse range of processes from human disease to ecology and evolution. Given their complex signatures, tendency to occur in repeated regions, and large size, discovering SVs based on short reads is challenging compared to single-nucleotide variants. The increasing availability of long-read technologies has greatly facilitated SV discovery; however, these technologies remain too costly to apply routinely to population-level studies. Here, we combined short-read and long-read sequencing technologies to provide a comprehensive population-scale assessment of structural variation in a panel of Canadian soybean cultivars.Entities:
Keywords: Crop genomics; Oxford Nanopore sequencing; Population studies; Soybean genomics; Structural variant genotyping; Structural variation; Transposable elements
Mesh:
Substances:
Year: 2022 PMID: 35197050 PMCID: PMC8867729 DOI: 10.1186/s12915-022-01255-w
Source DB: PubMed Journal: BMC Biol ISSN: 1741-7007 Impact factor: 7.431
Number of SVs called from Illumina data per calling tool, SV type, and size class
| [50 bp–100 bp[ | [100 bp–1 kb[ | [1 kb–10 kb[ | ≥ 10 kba | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Calling program | DELb | INSc | DUPd | INVe | DEL | INS | DUP | INV | DEL | INS | DUP | INV | DEL | DUP | INV |
| asmvar | 11,018 | 3575 | 0 | 0 | 14,877 | 2243 | 0 | 0 | 5748 | 1 | 0 | 0 | 4681 | 0 | 0 |
| manta | 9664 | 3358 | 453 | 0 | 12,378 | 1815 | 3114 | 0 | 11,463 | 0 | 4448 | 0 | 7325 | 5034 | 0 |
| smoove | 4168 | 0 | 22 | 45 | 6489 | 0 | 1208 | 149 | 4687 | 0 | 981 | 33 | 1794 | 975 | 47 |
| svaba | 7288 | 2284 | 673 | 21 | 6907 | 215 | 16,081 | 292 | 2969 | 0 | 1548 | 190 | 512 | 458 | 223 |
| mergedf | 17,199 | 5023 | 656 | 61 | 22,980 | 3165 | 9810 | 296 | 13007 | 1 | 4316 | 135 | 10,640 | 4696 | 178 |
a Insertions ≥ 10 kb are not shown because none were called
b DEL: deletions
c INS: insertions
d DUP: duplications
e INV: inversions
f merged: the dataset merged using SVmerge
Fig. 1Genotyping sensitivity and precision of A deletions and B insertions discovered from the Illumina data. Sensitivity was defined as the fraction of SVs in the ground truth set (SVs called by Sniffles from Oxford Nanopore data) genotyped as the alternative allele (non-reference) from the Illumina data. Precision was defined as the fraction of SVs genotyped as the alternative allele from the Illumina data that were also observed in the truth set. Each line and color represents one of 17 samples. The different plots correspond to different SV lengths. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values (i.e., a greater number of reads supporting a genotype call). Some of the threshold values for the minimum number of reads supporting a genotype call are shown for a single sample in the upper left plot of panel A
Fig. 2Genotyping sensitivity and precision of A deletions and B insertions discovered from the Oxford Nanopore data. Sensitivity was defined as the fraction of SVs in the ground truth set (SVs called by Sniffles from Oxford Nanopore data) genotyped as the alternative allele (non-reference) from the Illumina data. Precision was defined as the fraction of SVs genotyped as the alternative allele from the Illumina data that were also observed in the truth set. Each line and color represents one of 17 samples. The different plots correspond to different SV lengths. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values (i.e., a greater number of reads supporting a genotype call). Some of the threshold values for the minimum number of reads supporting a genotype call are shown for a single sample in the upper left plot of panel A
Fig. 3Circos plot of the distribution of various features within 3-Mb bins along the reference assembly version 4 of Williams82. Results shown are based on the population-scale (102 samples) genotyping of SVs discovered using both Illumina and Oxford Nanopore data. A Gene density. B Density of SNVs called by Platypus. C Number of deletions (blue) and insertions (red) discovered within each bin. The bins with the 10% highest SV density (insertions and deletions considered together) are highlighted in gray. D Number of reference (blue) and polymorphic (red) LTR Copia and LTR Gypsy elements (summed together). E Number of reference (blue) and polymorphic (red) DNA transposable elements. The gray highlights in tracks D and E show the bins with the 10% highest polymorphic/reference ratios
Fig. 4Population structure computed on all 102 Canadian soybean cultivars using fastStructure with k = 5 on A SNVs called by Platypus from Illumina data and B SVs discovered from Illumina and Oxford Nanopore data, and subsequently genotyped with Illumina data using Paragraph. The proportion of ancestry attributed to each of five populations is shown along the y-axis for 102 cultivars displayed along the x-axis. The order of the cultivars and the color scheme are identical in both panels. The vertical dotted lines between panels denote the 16 cultivars for which the assigned population (i.e., the population with the highest ancestry for that cultivar) differs
Fig. 5Analysis of the overlap of SVs with gene models. A Distributions of the proportions of deletions and insertions overlapping various genic features as generated by a randomization test (5000 iterations). Observed proportions for each SV type and genic feature are indicated by a vertical dotted line. One-sided p-values are < 2 × 10−4 for all comparisons except for deletions overlapping genes, for which the p-value is 4 × 10−4. B Distribution of the allele frequencies of deletions and insertions depending on the genic features they overlap. Note the logarithmic scale on the y-axis. cds: SVs overlapping coding sequences; gene: SVs overlapping non-coding genic sequences; upstream5kb: SVs overlapping regions 5 kb upstream of genes, but not any genic sequences; intergenic: SVs that do not overlap any of the other features
Number and span of polymorphic and reference transposable elements of different types
| REFa (%) | DELb (%) | INSc (%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TE type | Mbe | kbf | kb | |||||||||
| Copia LTR retrotransposons | 91,241 (35.1) | 170 (43.0) | 1154 (44.6) | 5594 (43.8) | 1303 (54.5) | 6692 (63.1) | ||||||
| Gypsy LTR retrotransposons | 71,390 (27.5) | 139 (35.2) | 949 (36.7) | 5745 (45) | 718 (30) | 2949 (27.8) | ||||||
| Non-LTR retrotransposons | 8078 (3.1) | 10 (2.5) | 144 (5.6) | 449 (3.5) | 99 (4.1) | 307 (2.9) | ||||||
| DNA TE | 89,300 (34.3) | 76 (19.2) | 339 (13.1) | 989 (7.7) | 271 (11.3) | 654 (6.2) | ||||||
a REF: transposable elements ≥ 100 bp in the reference genome
b DEL: deletions relative to the reference that are annotated as TEs
c INS: insertions relative to the reference that are annotated as TEs
d N: number of reference elements, deletions or insertions matching given TE type
e Mb: total length of reference elements of a given type, in Mb
f kb: total length of polymorphic elements matching given TE type, in kb
Fig. 6Analysis of the polymorphic TEs found in this study. Comparison of the number of polymorphic TEs per A LTR family and B DNA TE type found in Tian et al. [38] and in this study. Differences in y- and x-scales are partly explained by the fact that counts for Tian et al. are summed over occurrences in all samples whereas our data counts each SV only once. Note that all scales are logarithmic. C Proportion of matching nucleotides between the two terminal repeats for TE sequences corresponding to 40 different SVs grouped by DNA TE superfamily and by the identifier of the TE sequence they matched in the SoyTEdb database. D Alternate allele frequencies of 156 SNVs located in a ~39-kb linkage disequilibrium block between positions Gm04:2,220,398 and Gm04:2,259,326. Frequencies were computed for three different groups of samples depending on their genotype at the TE insertion site (Gm04:2,257,090). absent: absence of the TE insertion, which corresponds to the reference allele (71 samples); present: presence of the 480-bp Stowaway MITE (9 samples); excised: presence of a 6-bp insertion at the insertion site, putatively left by excision of the TE insertion (14 samples). The locations of three SNVs whose frequency in the “present” and “excised” groups diverge are shown with dotted vertical lines