| Literature DB >> 31986135 |
Eleanor Young1, Heba Z Abid1, Pui-Yan Kwok2,3,4, Harold Riethman5, Ming Xiao1,6.
Abstract
Detailed comprehensive knowledge of the structures of individual long-range telomere-terminal haplotypes are needed to understand their impact on telomere function, and to delineate the population structure and evolution of subtelomere regions. However, the abundance of large evolutionarily recent segmental duplications and high levels of large structural variations have complicated both the mapping and sequence characterization of human subtelomere regions. Here, we use high throughput optical mapping of large single DNA molecules in nanochannel arrays for 154 human genomes from 26 populations to present a comprehensive look at human subtelomere structure and variation. The results catalog many novel long-range subtelomere haplotypes and determine the frequencies and contexts of specific subtelomeric duplicons on each chromosome arm, helping to clarify the currently ambiguous nature of many specific subtelomere structures as represented in the current reference sequence (HG38). The organization and content of some duplicons in subtelomeres appear to show both chromosome arm and population-specific trends. Based upon these trends we estimate a timeline for the spread of these duplication blocks.Entities:
Mesh:
Year: 2020 PMID: 31986135 PMCID: PMC7004388 DOI: 10.1371/journal.pgen.1008347
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 1Extended haplotypes in subtelomeric regions of 3q in GM191025 supported by single molecule evidence.
Colored rectangles represent paralogy blocks defined in the subtelomere assemblies of Stong et al. [24]. The blue bar shows the hg38 reference with Nt.BsPQ1 nick sites as dark blue dashes along it. The yellow bar shows the consensus contig for this sample, with dark green marks indicating a match to the reference and lighter green/blue showing nick sites without a reference match. The colored rectangles about the yellow bar show the paralogy blocks that match the pattern seen in the extended region. The brown rows indicate single molecules, which extend well past the block regions and into the single copy region. A teal arrow shows the distance, 70kb, from the telomere as defined by the Bionano single-molecule maps to the end of the HG38 reference assembly. A black arrow represents 60 kb of unknown sequence currently in the HG38 reference as ‘N’, an estimate of gap size to the end of the chromosome. Dashed boxes on top of the molecules indicate portions of the extended region that match to paralogy blocks 1–5 but are not in the current references for 3q. A red T indicates the telomeric end of the 3q map.
Fig 2Major haplotypes of highly variable subtelomere regions.
The Stong et al. assembly blocks are shown as colored rectangles above blue Bionano genome mapping bars. Yellow rows with green ticks show haplotypes below these. A teal arrow indicates the size of additional extended regions not covered by the reference. A black arrow represents unknown sequence currently in the HG38 reference as ‘N’, an estimate of gap size to the end of the chromosome. If the black arrow is dashed it signifies a region of unknown telomere-adjacent gap sequence that should be deleted. A red T indicates the Stong 2014 assembly reached the telomere, and the lack of one means that assembly was unable to reach the telomere repeats. Highly variable arms 1p, 2q, 3q, 5q, 6p, and 6q are included here. Additional highly variable arms (7p, 7q, 8p, 9p, 9q, 11p, 14q, 15q, 16q, 17q, 19p, 20p) can be found in S1 Fig through S4 Fig.
Fig 3Telomere labeling shows inaccurate sizing of telomere-adjacent gap segments in HG38 subtelomere regions.
Blue bars represent the nick sites in hg38 reference. Yellow bars with green Nt.BsPQ1 nick sites represent the main haplotype seen in the genomes. A black dashed arrow indicated the width of the telomere-adjacent gap sequence that should be deleted from the hg38 reference. An image below the haplotype shows a single telomere labeled molecule confirming the end of the chromosome arm. These telomeres were labeled using CRISPR-Cas9 to tag the telomere repeat and incorporate a fluorophore[32]. None of the subtelomeric haplotypes for each of these arms extends past the telomere label shown here.
Summary of chromosomes.
| Chr Arm | Samples Represented | Large-scale Var1 | Hg38 Tel-adj Gap2 (kb) | Map Extension3 (kb) | Chr Arm | Samples Represented | Large-scale Var1 | Hg38 Tel-adj Gap2 (kb) | Map Extension3 (kb) |
|---|---|---|---|---|---|---|---|---|---|
| 1p | 113 | High | inaccurate | N/A | 1q | 144 | Low | 10 | 0 |
| 2p | 130 | Low | 10 | 0 | 2q | 133 | High | 10 | 0–45 |
| 3p | 143 | Low | 10 | 0 | 3q | 110 | High | 60 | 70–140 |
| 4p | 124 | Low | 10 | 0 | 4q5 | 133 | N/A | 10 | N/A |
| 5p | 152 | Low | 10 | 0 | 5q | 150 | High | 60 | -60 |
| 6p | 119 | High | 60 | 0 | 6q | 146 | High | 60 | -60 |
| 7p | 104 | High | 10 | 100–175 | 7q | 142 | High | 10 | 0 |
| 8p | 153 | High | 60 | -60 | 8q | 151 | Low | 60 | -20 |
| 9p | 136 | High | 10 | 5–55 | 9q | 128 | High | 60 | -40-20 |
| 10p | 146 | Low | 10 | 0 | 10q5 | 151 | N/A | 10 | N/A |
| 11p | 121 | High | 60 | 0–80 | 11q | 137 | Low | 10 | -15 |
| 12p | 145 | Low | 10 | 0 | 12q | 147 | Low | 10 | -20 |
| 13p | N/A | N/A | N/A | N/A | 13q | 152 | Low | 10 | 0 |
| 14p | N/A | N/A | N/A | N/A | 14q | 95 | High | 160 | -160 |
| 15p | N/A | N/A | N/A | N/A | 15q | 140 | High | 10 | 0 |
| 16p4 | 153 | Low | 10 | -10 | 16q | 121 | High | 110 | -20 |
| 17p4 | 127 | Low | inaccurate | N/A | 17q | 124 | High | 10 | -20 |
| 18p | 144 | Low | 10 | 0 | 18q | 150 | Low | 110 | -110 |
| 19p | 119 | High | 60 | 10 | 19q4 | 150 | Low | 10 | 0 |
| 20p | 135 | High | 60 | 30–70 | 20q | 148 | Low | 110 | -70 |
| 21p | N/A | N/A | N/A | N/A | 21q | 115 | Low | 10 | 0 |
| 22p | N/A | N/A | N/A | N/A | 22q4 | 153 | Low | 10 | 0 |
| X/Yp | 33 | Low | 10 | N/A | X/Yq | 139 | Low | 10 | 0 |
Table 1 contains the number of contigs from the 154 genomes present in each arm, the current amount of Ns in the hg38 reference padding, the range of the map extension lengths (if any), and the classification of each arm as High or Low variability. This is determined by looking at the total number of genomes for an arm, and how many were the majority haplotype vs the minor. If the minor haplotype was less than 10% the total, the arm is considered low variability. The acrocentric arms 13p, 14p, 15p, 21p, 22p cannot be determined due to the lack of reference. 4q and 10q have a known repeat D4Z4 in the subtelomeric region and are also excluded [39]. For differences in reference gap and contig length being less than 10kb it is unable to be determined precisely and is estimated as 0. A negative number indicates the gap estimated in the reference is longer than seen in the arm and a positive number indicates the arm is longer than the gap estimated. For 1p and 17p the HG38 reference is very inaccurate.
1—If the minor haplotype was less than 10% the total haplotypes for the arm, the arm is considered low variability.
2—The reference is inaccurate to the point where the size of the telomere adjacent gap of the reference cannot be evaluated
3—For gap differences of <10kb, accuracy is unclear. These arms appear to be within the correct size range. N/A refers to arms that could not accurately be judged, including the acrocentric arms 13p, 14p, 15p, 21p, 22p. In addition, 4q and 10q contain a known D4Z4 repeat leading to widely variable ranges (35).
4—Contain INP sites, and this may affect data that could determine high vs low variability.
5—Can be considered high variability with respect to high levels of large D4Z4 tandem repeat variability in the populations.
Fig 4Distribution of paralogy block 5 in 15q, 16q and 9q.
The solid color rectangle bars show the paralogy blocks defined in the subtelomeric assemblies of Stong et al. (2014). The narrow grey line segments to the right of the colored blocks show the single-copy DNA region. Blue rectangles with dark blue lines show the HG38 reference with Nt.BsPQ1 nick sites. Paralogy block five is shown as a dashed blue rectangle on top of yellow rows representing consensus maps for particular genomes. Additional paralogy blocks are also shown as dashed colored rectangles. A teal arrow indicates the size of additional extended regions not covered by the reference. A black arrow represents unknown sequence currently in the HG38 reference as ‘N’, an estimate of gap size to the end of the chromosome. If the black arrow is dashed it signifies a region of unknown telomere-adjacent gap sequence that should be deleted.
Distribution of block 5 by population.
| Block 5 | AFR with Block 5 / Total AFR maps | AMR | EAS with Block 5 / Total EAS maps | EUR with Block 5 / Total EUR maps | SAS with Block 5 / Total SAS maps |
|---|---|---|---|---|---|
| 2q | 0% | 0% | 0% | 0% | 4% |
| 3q | 54% | 61% | 76% | 65% | 35% |
| 5q | 88% | 90% | 97% | 92% | 82% |
| 6p | 56% | 41% | 21% | 22% | 32% |
| 6q | 83% | 77% | 62% | 78% | 80% |
| 7p | 7% | 0% | 0% | 3% | 7% |
| 8p | 81% | 81% | 67% | 88% | 78% |
| 9q | 62% | 30% | 48% | 32% | 15% |
| 11p | 49% | 42% | 21% | 43% | 35% |
| 15q | 93% | 97% | 97% | 92% | 93% |
| 16q | 14% | 0% | 0% | 0% | 4% |
| 19p | 10% | 27% | 47% | 18% | 19% |
Table 2 shows the frequency of Block 5 on different chromosomes (rows) for each super population (columns). AFR stands for the Africa super population category, AMR for Ad Mixed American, EAS for East Asian, EUR for European, and SAS for South Asian. Based on the Stong reference block 5 had previously been found on 5q, 6p, 6q, 8p, 11p, 15q, and 19p. In our dataset it was also found on 2q, 3q, 7p, 9q, and 16q.