| Literature DB >> 35672336 |
Readman Chiu1, Indhu-Shree Rajan-Babu2,3, Inanc Birol4,5, Jan M Friedman2,6.
Abstract
Detection of short tandem repeat (STR) expansions with standard short-read sequencing is challenging due to the difficulty in mapping multicopy repeat sequences. In this study, we explored how the long-range sequence information of barcode linked-read sequencing (BLRS) can be leveraged to improve repeat-read detection. We also devised a novel algorithm using BLRS barcodes for distance estimation and evaluated its application for STR genotyping. Both approaches were designed for genotyping large expansions (> 1 kb) that cannot be sized accurately by existing methods. Using simulated and experimental data of genomes with STR expansions from multiple BLRS platforms, we validated the utility of barcode and phasing information in attaining better STR genotypes compared to standard short-read sequencing. Although the coverage bias of extremely GC-rich STRs is an important limitation of BLRS, BLRS is an effective strategy for genotyping many other STR loci.Entities:
Mesh:
Year: 2022 PMID: 35672336 PMCID: PMC9174224 DOI: 10.1038/s41598-022-13024-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1IRR extraction using barcodes in BLRS. (a) Steps in using barcodes for IRR extraction and STR size estimation. (b–d) IRR counts (left) and repeat count estimates (right) of the target loci within the three groups of datasets: (b) heterozygous ATTCT expansion in ATXN10 10 × and stLFR simulations; (c) homozygous larger-than-reference CCAT polymorphism in NA12878 from 10x, stLFR, and TELL-Seq BLRS platforms; and (d) FXN GAA expansions in 10 × data of four Coriell cell lines. The methods in comparison were EH without OTS (EH_noOTS, blue), EH with OTS (EH_OTS, olive), and barcode-based IRR extraction (barcode, red). For the FXN samples, EH results (with or without OTS) from standard Illumina data (EH_OTS(S), light blue and EH_noOTS(S), light olive) were also included. Only results of the expanded alleles in the samples were shown (therefore two separate tallies for the homozygous FXN GM15850 sample). “Expected” or “ground truth” IRR counts (for the simulations) and repeat counts were plotted as orange horizontal bars together with the exact numbers. Exact IRR or repeat counts were shown on top of each bar. Confidence intervals of the estimates reported by EH were shown as error bars. For the custom barcode-based method, the error bars reported for 10 × data corresponded to the range of estimates calculated independently using each of the two read lengths (see the Methods section). “NA” indicates that results were unavailable for certain samples because of a segmentation fault in EH runs.
Figure 2Size estimation of genomic intervals and STR loci using Jaccard index of barcode sharing in BLRS. (a) Inverse relationship between Jaccard index and genomic interval size observed in NA12878 of each of the three BLRS platforms. The colored bands correspond to the 95% confidence intervals for each platform. (b) Schematic of a hypothetical example illustrating the concept and terminology in computing the Jaccard index of barcode sharing for a given genomic interval. (c–e) Scatter plots of estimates (y-axis) vs. truths (x-axis) for ~ 700 arbitrary genomic intervals (black) and the target STR (red) in the simulation (c), NA12878 (d), FXN (e), and FMR1 (f) datasets. Only estimates of the expanded allele at the target loci were shown. Confidence intervals of the estimates of the target loci were shown as red error bars. Dotted red diagonal lines were added to help visualize the amount of deviation of the estimates from the true values. “Truths” (x-axis) for the ~ 700 genomic intervals in all plots were calculated based on hg38 genomic coordinates. “Truth” for the target locus is the size of the ATXN10 repeat we replaced the reference with in the modified genome to generate the simulated datasets (c); size of the CCAT allele we determined from the NA12878 assembly (d); sizes of the FXN (e) and FMR1 (f) repeats in the Coriell samples according to on-line information of the respective cell lines (Table S1).