| Literature DB >> 30989093 |
L I Uralsky1,2, V A Shepelev1,2, A A Alexandrov1, Y B Yurov3, E I Rogaev2,4,5, I A Alexandrov2,3.
Abstract
In the latest hg38 human genome assembly, centromeric gaps has been filled in by alpha satellite (AS) reference models (RMs) which are statistical representations of homogeneous higher-order repeat (HOR) arrays that make up the bulk of the centromeric regions. We analyzed these models to compose an atlas of human AS HORs where each monomer of a HOR was represented by a number of its polymorphic sequence variants. We combined these data and HMMER sequence analysis platform to annotate AS HORs in the assembly. This led to discovery of a new type of low copy number highly divergent HORs which were not represented by RMs. These were included in the dataset. The annotation can be viewed as UCSC Genome Browser custom track (the HOR-track) and used together with our previous annotation of AS suprachromosomal families (SFs) in the same assembly, where each AS monomer can be viewed in its genomic context together with its classification into one of the 5 major SFs (the SF-track). To catalog the diversity of AS HORs in the human genome we introduced a new naming system. Each HOR received a name which showed its SF, chromosomal location and index number. Here we present the first installment of the HOR-track covering only the 17 HORs that belong to SF1 which forms live functional centromeres in chromosomes 1, 3, 5, 6, 7, 10, 12, 16 and 19 and also a large number of minor dead HOR domains, both homogeneous and divergent. Monomer-by-monomer HOR annotation used for this dataset as opposed to annotation of whole HOR repeats provides for mapping and quantification of various structural variants of AS HORs which can be used to collect data on inter-individual polymorphism of AS.Entities:
Keywords: Alpha satellite; Centromeres; Higher-order repeats; Suprachromosomal families; hg38 human genome assembly
Year: 2019 PMID: 30989093 PMCID: PMC6447721 DOI: 10.1016/j.dib.2019.103708
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
List of SF1 AS HORs in hg38 human genome assembly.
| # | Chrom. | New HOR name | Old HOR name | HOR length (mon) | RM or sample contig | Genomic size (kb) | Homogeneous/Divergent | Age | Status | CENP-A reads | Ref. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1,5,19 | S1C1/5/19H1L | D1Z7/D5Z2/D19Z3 | 6 | GJ212201.1 | 2282 | homogeneous | modern | live | 43,226 | |
| 2 | 3 | S1C3H1L | D3Z1 | 17 | GJ211871.1 | 2102 | homogeneous | modern/archaic | live | 31,967 | |
| 3 | 3 | S1C3H2 | D3-2 | 10 | GJ211866.1 | 461 | homogeneous | archaic | pseudo | 484 | |
| 4 | 3 | S1C3H3d | – | 5 | ABBA01004655.1 | 217 | divergent | archaic | relic | – | this paper |
| 5 | 3,6,7,8,10,12,20 | S1CMH1d | – | 4 | ABBA01004652.1 | 251 | divergent | archaic | relic | – | this paper |
| 6 | 5p | S1C5pH2 | – | 16 | GJ211887.1 | 143 | homogeneous | modern | pseudo | 451 | RM |
| 7 | 6 | S1C6H1L | D6Z1 | 18 | GJ211907.1 | 1276 | homogeneous | archaic | live | 32,711 | |
| 8 | 7 | S1C7H1L | D7Z1 | 6 | GJ211908.1 | 2659 | homogeneous | modern | live | 32,278 | |
| 9 | 10 | S1C10H1L | D10Z1 | 8 | GJ211932.1 | 1561 | homogeneous | modern | live | 30,214 | |
| 10 | 10 | S1C10H1—B | – | 14 | GJ211933.1 | 48 | homogeneous | modern | pseudo | 1028 | RM |
| 11 | 10 | S1C10H1—C | – | 8 | GJ211936.1 | 48 | homogeneous | modern | pseudo | 116 | RM |
| 12 | 10 | S1C10H2 | – | 18 | GJ211930.1 | 249 | homogeneous | modern | pseudo | 340 | RM |
| 13 | 12 | S1C12H1L | D12Z3 | 8 | GJ211954.1 | 2350 | homogeneous | modern | live | 35,979 | |
| 14 | 12 | S1C12H2 | – | 18 | GJ211949.1 | 47 | homogeneous | modern | pseudo | 221 | RM |
| 15 | 12 | S1C12H3d | – | 8 | AEKP01211346.1 | 23 | divergent | archaic | relic | – | this paper |
| 16 | 10,12 | S1C10/12H1d | – | 2 | ABBA01049496.1 | 93 | divergent | modern | relic | – | this paper |
| 17 | 16 | S1C16H1L | D16Z2 | 10 | GJ212051.1 | 1928 | homogeneous | modern | live | 38,243 |
RM is supposed to be a model of a HOR cluster constructed under the assumption that all copies form a single array in one chromosome. For known special cases such as double or triple HOR domains (same live HOR in two or three chromosomes; see Section SF1 in morpho-functional classification of AS used in this work (definitions and terminology)), the size of RM is adjusted appropriately to represent only one chromosome. For divergent HORs not represented by RMs, only the names of sample contigs are provided. These contigs do not contain all copies of respective HORs and their size does not reflect the genomic copy number.
Genomic size is an estimated length that this HOR occupies in haploid genome. For homogeneous HORs represented by RMs it is just the RM length. For S1C1/5/19 RM the size reflects the length of the HOR array on one chromosome under the assumption that the arrays in all three chromosomes are of equal size. For divergent HORs it is calculated from the data in Table 2 (the sum of corrected copy numbers for all monomers of a HOR multiplied to the monomer length).
Pseudo and relic are the two kinds of dead HORs we discriminate here. As a rule, the live and dead HORs can be discriminated by CENP-A binding and the large size of the live arrays (see other columns in this Table). Dead pseudo HORs are homogeneous (divergence 1–3%) and dead relic HORs are divergent (9–15%).
The figures shown in this column are the numbers of the 99 bp sequence reads corresponding to a given HOR out of 1 million read sample of CENP-A CHIP-seq dataset (SRR1561921) obtained from a HuRef lymphoblastoid cell line [12]. The whole dataset (about 6 million reads split in 1 million portions) was annotated by HumAS-HMMER used the same way as described in this paper. As the portions did not differ significantly, the data for only one of them are shown in the Table. For S1C1/5/19, the number is adjusted to represent the length of this HOR array on one chromosome under the assumption that the arrays in all three chromosomes are of equal size. One can see that the live HORs are the major CENP-A binding sites. The data are shown only for homogeneous HORs. The numbers for divergent HORs were slightly higher than for dead homogeneous HORs and much lower than for live homogeneous HORs. However, these numbers were not deemed reliable due to admittedly less specific annotation of divergent HORs and more effect the false coverage has on their quantification, both of which could be exacerbated by the short length of the monomer fragments in deep sequencing reads. Thus, the CENP-A binding with these HORs was likely to be somewhat overestimated.
HOR S1C3H2 is also represented in the assembly by GJ211867.1 (length 14 kb) which upon thorough analysis was disqualified as a valid RM for this HOR (see Section Non-redundant list of SF1 RMs and Supplementary note 1). HOR S1C1/5/19H1L is also represented in the assembly by GJ212205.1 (length 340 bp) and GJ212206.1 (length 340 bp) which were disqualified as valid RMs for this HOR due to their short length (see Section Non-redundant list of SF1 RMs and Supplementary note 1).
In fact, S1C12H3 is not an 8-mer, but a variable size HOR 1-2-3-(4–5)n-6-7-8 based on 8 types of monomers (see Supplementary note 1).
Copy number estimate for divergent SF1 HORs.
| HOR name | Total monomers | Corrected total |
|---|---|---|
| S1C10/12H1d.1 | 358 | 259 |
| S1C10/12H1d.2 | 288 | 288 |
| S1C3H3d.1 | 251 | 248 |
| S1C3H3d.2 | 237 | 230 |
| S1C3H3d.3 | 318 | 310 |
| S1C3H3d.4 | 247 | 243 |
| S1C3H3d.5 | 259 | 239 |
| S1C12H3d.1 | 8 | 8 |
| S1C12H3d.2 | 9 | 9 |
| S1C12H3d.3 | 11 | 10 |
| S1C12H3d.4 | 50 | 36 |
| S1C12H3d.5 | 38 | 37 |
| S1C12H3d.6 | 11 | 11 |
| S1C12H3d.7 | 11 | 11 |
| S1C12H3d.8 | 14 | 14 |
| S1CMH1d.1 | 395 | 382 |
| S1CMH1d.2 | 512 | 505 |
| S1CMH1d.3 | 366 | 356 |
| S1CMH1d.4 | 228 | 226 |
The total number of copies of each monomer in each divergent SF1 HOR was calculated from the HOR-track using the Table Browser. It was corrected by subtracting the number of falsely recognized monomers calculated as the sum of the monomers of a given HOR which were SF2 (total of 8), SF3 (total of 1) or SF4+ (total of 180) according to PERCON (the SF track). It was not possible to correct for the false recognition of SF5 monomers as all SF1 divergent HORs except for S1C10/12H1 were archaic and PERCON recognized archaic SF1 as SF1/SF5 mix [1]. S1C10/12H1 was not significantly involved in false SF5 coverage (the total of false SF5 hits was 1).
Fig. 1Divergence of selected SF1 HORs. This boxplot displays the difference in homogeneity between divergent and homogeneous HORs. It visualizes five summary statistics (the median, two hinges and two whiskers), and all "outlying" points individually, as described at (http://ggplot2.tidyverse.org/reference/geom_boxplot.html). For divergent HORs (marked with letter “d” in the name), divergence (%) was calculated using the sets of aligned monomers shown in Alignment file 1. The number of monomers and the divergence in each set are shown in Table S2. As the figures for different monomers of the same HOR were pretty consistent, the divergence values were pooled for each HOR and used for this generalized boxplot. Four different homogeneous live HORs (marked with letter “L” in the name) were used for comparison. One of them was highly polymorphic (S1C10H1L) and three were non-polymorphic (S1C7H1L, S1C12H1L and S1C6H1L). For each, a region in respective RM containing 200–300 copies of each monomer of the basic HOR was picked up and individual monomers with a length 150 bp or longer were extracted by the Table Browser using their HumAS-HMMER assignments and aligned. Only the monomers of a basic HOR were used for comparisons. The hybrids were not included, although they were present in the sample regions, as we suspected there could be few different kinds of each hybrid which resulted from independent deletions or other events. In such case, few kinds of each hybrid could exist and hybrids would not be expected to be homogeneous. Therefore, the sample region for the polymorphic HOR had to be much larger than for the non-polymorphic, as a large number of hybrids occurred in the former. One can see that there is a large gap in divergence between homogeneous and divergent HORs. See the note to Table S2 for more details.
Specifications
| Subject area | |
| More specific subject area | |
| Type of data | |
| How data was acquired | |
| Data format | |
| Experimental factors | |
| Experimental features | |
| Data source location | |
| Data accessibility | |
| Related research article |
The dataset provides detailed description of HOR repeat structure of one major family of AS in human centromeric and pericentromeric chromatin as represented in the latest genome assembly. Monomer-by-monomer annotation allows collection of data on polymorphic HOR variants. The dataset can be viewed online as an easy-to-use and familiar UCSC Genome Browser custom track (the HOR-track). The HOR-track can be combined with our previously published SF-track which would provide complete information on every SF1 AS monomer and its genomic context. AS sequencing reads obtained from individual genomes as well as CHIP-seq, RNA-seq and other mapping experiments can be classed by alignment to annotated assembly. |