| Literature DB >> 34162698 |
Kirill Grigorev1,2, Jonathan Foox1,2,3, Daniela Bezdan1,2,3,4,5, Daniel Butler1, Jared J Luxton6,7, Jake Reed1, Miles J McKenna6,7, Lynn Taylor6, Kerry A George8, Cem Meydan1,2,3, Susan M Bailey6,7, Christopher E Mason1,2,3,9.
Abstract
Telomeres are regions of repetitive nucleotide sequences capping the ends of eukaryotic chromosomes that protect against deterioration, and whose lengths can be correlated with age and adverse health risk factors. Yet, given their length and repetitive nature, telomeric regions are not easily reconstructed from short-read sequencing, thus making telomere sequencing, mapping, and variant resolution challenging problems. Recently, long-read sequencing, with read lengths measuring in hundreds of kilobase pairs, has made it possible to routinely read into telomeric regions and inspect their sequence structure. Here, we describe a framework for extracting telomeric reads from whole-genome single-molecule sequencing experiments, including de novo identification of telomere repeat motifs and repeat types, and also describe their sequence variation. We find that long, complex telomeric stretches and repeats can be accurately captured with long-read sequencing, observe extensive sequence heterogeneity of human telomeres, discover and localize noncanonical telomere sequence motifs (both previously reported, as well as novel), and validate them in short-read sequence data. These data reveal extensive intra- and inter-population diversity of repeats in telomeric haplotypes, reveal higher paternal inheritance of telomeric variants, and represent the first motif composition maps of multi-kilobase-pair human telomeric haplotypes across three distinct ancestries (Ashkenazi, Chinese, and Utah), which can aid in future studies of genetic variation, aging, and genome biology.Entities:
Year: 2021 PMID: 34162698 PMCID: PMC8256856 DOI: 10.1101/gr.274639.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Mapping of candidate telomeric reads, illustrated with reads from the HG002 data set aligning to Chromosome 12. The chromosome is displayed schematically, centered around the centromere. Vertical red dashed lines denote the position of the boundary of the annotated telomeric tract. Coordinates are given in kilobase pairs, relative to the positions of the telomeric tract boundaries. Statistics for all chromosomes of all seven data sets are provided in Supplemental Table S2.
Significantly enriched repeating motifs in telomeric regions of GIAB data sets HG001 through HG007
Figure 2.Densities of the top three enriched motifs at ends of chromosomal p arms (A) and q arms (B) of the HG002 data set. Background represents the remaining sequence content (nonrepeating sequence and not significantly enriched motifs). Reads are shown aligned to the contigs in the hg38ext reference set, and genomic coordinates are given in kilobase pairs. Vertical red dashed lines denote the position of the boundary of the annotated telomeric tract.
Figure 3.Distribution of motif entropies in 10-bp windows of candidate PacBio CCS reads aligning to the same chromosomal arms in GIAB data sets HG001 through HG007, with respect to per-window coverage, and the coverage-weighted percentiles of the entropy values.
Measures of cophenetic correlation (Pearson's r and adjusted P-value) between the hierarchical clustering and the pairwise distance matrix on each chromosomal arm
Figure 4.Clustering of reads by relative pairwise Levenshtein distances (unitless measure) on each chromosomal p arm of data sets HG001 through HG007, as well as densities of the top enriched motifs along each read. Each horizontal line represents an individual read; genomic coordinates are given in kilobase pairs, relative to the positions of the telomeric tract boundaries. Only the chromosomal arms cumulatively covered by at least 25 reads are displayed.
Figure 5.Clustering of reads by relative pairwise Levenshtein distances (unitless measure) on each chromosomal q arm of data sets HG001 through HG007, and densities of the top enriched motifs along each read. Each horizontal line represents an individual read; genomic coordinates are given in kilobase pairs, relative to the positions of the telomeric tract boundaries. Only the chromosomal arms cumulatively covered by at least 25 reads are displayed.
Adjusted P-values of the Wilcoxon signed-rank tests on relative Levenshtein distances