| Literature DB >> 26167452 |
V A Shepelev1, L I Uralsky2, A A Alexandrov3, Y B Yurov4, E I Rogaev5, I A Alexandrov4.
Abstract
Centromeric alpha satellite (AS) is composed of highly identical higher-order DNA repetitive sequences, which make the standard assembly process impossible. Because of this the AS repeats were severely underrepresented in previous versions of the human genome assembly showing large centromeric gaps. The latest hg38 assembly (GCA_000001405.15) employed a novel method of approximate representation of these sequences using AS reference models to fill the gaps. Therefore, a lot more of assembled AS became available for genomic analysis. We used the PERCON program previously described by us to annotate various suprachromosomal families (SFs) of AS in the hg38 assembly and presented the results of our primary analysis as an easy-to-read track for the UCSC Genome Browser. The monomeric classes, characteristic of the five known SFs, were color-coded, which allowed quick visual assessment of AS composition in whole multi-megabase centromeres down to each individual AS monomer. Such comprehensive annotation of AS in the human genome assembly was performed for the first time. It showed the expected prevalence of the known major types of AS organization characteristic of the five established SFs. Also, some less common types of AS arrays were identified, such as pure R2 domains in SF5, apparent J/R and D/R mixes in SF1 and SF2, and several different SF4 higher-order repeats among reference models and in regular contigs. No new SFs or large unclassed AS domains were discovered. The dataset reveals the architecture of human centromeres and allows classification of AS sequence reads by alignment to the annotated hg38 assembly. The data were deposited here: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&hgt.customText=https://dl.dropboxusercontent.com/u/22994534/AS-tracks/human-GRC-hg38-M1SFs.bed.bz2.Entities:
Keywords: Alpha satellite; Centromeres; Higher-order repeats; Suprachromosomal families; hg38 human genome assembly
Year: 2015 PMID: 26167452 PMCID: PMC4496801 DOI: 10.1016/j.gdata.2015.05.035
Source DB: PubMed Journal: Genom Data ISSN: 2213-5960
Classification of monomeric types in live and dead AS layers.
| Functional | Old classification | New classification | Ancestral arrangement | Age group | ||
|---|---|---|---|---|---|---|
| SF | Monomer class, (type) | SF/colored layer | Monomer class, (type) | |||
| Live SFs | SF1 | J1(A) | SF1 | J1(A) | Dimeric | New |
| SF2 | D1(B) | SF2 | D1(B) | Dimeric | New | |
| SF3 | W1(B) | SF3 | W1(B) | Pentameric | New | |
| Dead SFs/layers | SF5 | R1(B) | Blue (SF5) | R1(B) | Irregular | Old |
| SF4 + | M1 + (A) | Yellow (SF4) | M1(A) | Monomeric | Old | |
| Yellow-striped (SF6) | V1(A) | Monomeric | Old | |||
| Olive-green (SF?) | H1(A) | Dimeric | Ancient | |||
| Red (SF?) | H3(A) | Monomeric | Ancient | |||
| Gray (SF?) | H4(A) | Monomeric | Ancient | |||
The table summarizes the data reviewed in [3] and reported in [6].
Used in this paper.
In live domains ancestral arrangement can only be observed or deduced from monomer order within a HOR unit.
The list of unique AS reference models in hg38 assembly.
| # | Chrom | Name | Size (bp) | SF | State | HOR symbol |
|---|---|---|---|---|---|---|
| 1 | chr1 | GJ211836.1 | 198,076 | 3 | ||
| 2 | chr1 | GJ211837.1 | 278,512 | 3 | ||
| 3 | chr1 | GJ211855.1 | 63,597 | 3 | ||
| 4 | chr1 | GJ211857.1 | 83,495 | 3 | ||
| 5 | chr1, 5, 19 | GJ212202.1 | 2,282,185 | 1 | Live | D1Z7/D5Z2/D19Z3 |
| 6 | chr2 | GJ211860.1 | 1,902,412 | 2 | Live | D2Z1 |
| 7 | chr3 | GJ211866.1 | 461,128 | 1,5 | ||
| 8 | chr3 | GJ211867.1 | 13,936 | 1,5 | ||
| 9 | chr3 | GJ211871.1 | 2,102,155 | 1 | Live | D3Z1 |
| 10 | chr4 | GJ211881.1 | 2,031,890 | 2 | Live | D4Z1 |
| 11 | chr5 | GJ211882.1 | 83,162 | 5 | ||
| 12 | chr5 | GJ211883.1 | 227,563 | 5 | ||
| 13 | chr5 | GJ211884.1 | 264,463 | 5 | ||
| 14 | chr5 | GJ211886.1 | 46,345 | 5 | ||
| 15 | chr5 | GJ211887.1 | 142,630 | 1 | ||
| 16 | chr5, 19 | GJ211904.2 | 53,672 | 5 | ||
| 17 | chr5, 19 | GJ211906.2 | 338,504 | 5 | ||
| 18 | chr6 | GJ211907.1 | 1,276,046 | 1 | Live | D6Z1 |
| 19 | chr7 | GJ211908.1 | 2,658,581 | 1 | Live | D7Z1 |
| 20 | chr7 | GJ212194.1 | 150,232 | 5 | ||
| 21 | chr8 | GJ211909.1 | 1,843,521 | 2 | Live | D8Z2 |
| 22 | chr9 | GJ211929.1 | 2,128,923 | 2 | Live | D9Z4 |
| 23 | chr10 | GJ211930.1 | 249,218 | 1 | ||
| 24 | chr10 | GJ211932.1 | 1,561,440 | 1 | Live | D10Z1 |
| 25 | chr10 | GJ211933.1 | 48,180 | 1 | ||
| 26 | chr10 | GJ211936.1 | 47,701 | 1 | ||
| 27 | chr11 | GJ211938.1 | 11,969 | 5 | ||
| 28 | chr11 | GJ211943.1 | 3,251,982 | 3 | Live | D11Z1 |
| 29 | chr11 | GJ211948.1 | 82,575 | 3 | ||
| 30 | chr12 | GJ211949.1 | 47,204 | 1 | ||
| 31 | chr12 | GJ211954.1 | 2,349,957 | 1 | Live | D12Z3 |
| 32 | chr13, 14, 21, 22 | GJ211955.2 | 22,537 | 4 + | ||
| 33 | chr13, 14, 21, 22 | GJ211961.2 | 88,022 | 4 + | ||
| 34 | chr13, 14, 21, 22 | GJ211962.2 | 54,133 | 4 + | ||
| 35 | chr13, 14, 21, 22 | GJ211963.2 | 63,535 | 4 + | ||
| 36 | chr13, 14, 21, 22 | GJ211965.2 | 20,670 | 5 | ||
| 37 | chr13, 14, 21, 22 | GJ211967.2 | 6670 | 4 + | ||
| 38 | chr13, 14, 21, 22 | GJ211968.2 | 3245 | 4 + | ||
| 39 | chr13, 14, 21, 22 | GJ211969.2 | 22,561 | 4 + | ||
| 40 | chr13, 14, 21, 22 | GJ211972.2 | 1,134,211 | 2 | Live | D14Z9/D22Z? |
| 41 | chr13, 14, 21, 22 | GJ211986.2 | 1198 | 4 + | ||
| 42 | chr13, 14, 21, 22 | GJ211991.2 | 632,586 | 2 | Live | D13Z1/D21Z1 |
| 43 | chr13, 14, 21, 22 | GJ212205.1 | 340 | 1 | ||
| 44 | chr13, 14, 21, 22 | GJ212206.1 | 340 | 1 | ||
| 45 | chr15 | GJ212036.1 | 415,278 | 4 + | ||
| 46 | chr15 | GJ212042.1 | 855,957 | 4 + | ||
| 47 | chr15 | GJ212045.1 | 1,370,146 | 2 | Live | D15Z3 |
| 48 | chr16 | GJ212046.1 | 23,302 | 2 | ||
| 49 | chr16 | GJ212051.1 | 1,928,003 | 1 | Live | D16Z2 |
| 50 | chr17 | GJ212053.1 | 381,239 | 3 | ||
| 51 | chr17 | GJ212054.1 | 3,371,615 | 3 | Live | D17Z1 |
| 52 | chr17 | GJ212055.1 | 49,431 | 3 | ||
| 53 | chr18 | GJ212060.1 | 319,478 | 2 | ||
| 54 | chr18 | GJ212062.1 | 4,763,584 | 2 | Live | D18Z1 |
| 55 | chr18 | GJ212066.1 | 93,042 | 2 | ||
| 56 | chr18 | GJ212067.1 | 39,636 | 2 | ||
| 57 | chr18 | GJ212069.1 | 76,958 | 2 | ||
| 58 | chr18 | GJ212071.1 | 21,409 | 2 | ||
| 59 | chr20 | GJ212091.1 | 150,723 | 2 | ||
| 60 | chr20 | GJ212093.1 | 1,886,394 | 2 | Live | D20Z2 |
| 61 | chr20 | GJ212095.1 | 47,956 | 2,5 | ||
| 62 | chr20 | GJ212105.1 | 80,766 | 4 + | ||
| 63 | chr20 | GJ212107.1 | 78,875 | 4 + | ||
| 64 | chr20 | GJ212117.1 | 120,944 | 5 | ||
| 65 | chrX | GJ212192.1 | 3,806,963 | 3 | Live | DXZ1 |
| 66 | chrY | GJ212193.1 | 227,095 | 4 + | Live | DYZ3 |
Identity of reference models marked as “live” with the known live HORs of respective chromosomes was verified by BLASTing the sequences in our HOR list in [3] to the first 10,000 bp of respective reference model. In all cases multiple hits of 93% or higher were obtained.
Only one representative member of a group of identical reference models is listed. For complete list, see Supplementary Table S1.
Corrected versions of these reference models were obtained from K. Miga and used for analysis.
Pure R2 regions in hg38 assembly.
| SF | Location | Position in hg38 | Contig | Size | R2% | B-box % | HORs on dot-matrix |
|---|---|---|---|---|---|---|---|
| SF5 | 6q11.1 | chr6:61,326,977–61,336,104 | AMYH02013791.1 | 9127 | 78 | 2 | No HOR |
| SF5 | 6q11.1 | chr6:61,428,794–61,437,937 | FP325349.3 | 9143 | 78 | 2 | No HOR |
| SF5 | 7p11.2 | chr7:57,939,175–57,953,728 | AC138789.1 | 10,294 | 94 | 0 | No HOR |
| SF5 | 7q11.21 | chr7:62,536,194–62,564,614 | AC019063.4 | 24,011 | 83 | 2 | No HOR |
| SF5 | 10p11.1 | chr10:39,432,620–39,442,102 | ABBA01020709.1 | 6981 | 70 | 0 | No HOR |
| SF5 | 11p11.12 | chr11:48,806,070–48,814,307 | AC127495.2 | 8237 | 91 | 0 | No HOR |
| SF5 | 12q11 | chr12:37,632,794–37,639,361 | AC119042.9 | 6567 | 75 | 0 | No HOR |
| SF5 | 16p11.1 | chr16:36,001,814–36,022,913 | AC109490.3 | 21,099 | 94 | 0 | No HOR, duplication 4.8 kb, identity 97.5% |
| SF5 | 16p11.1 | chr16:36,079,689–36,090,000 | FP325312.10 | 10,311 | 93 | 0 | No HOR |
| SF5 | 20q11.1 | chr20:29,908,640–30,038,347 | ABBA01018540.1, GJ212117.1 | 128,442 | 80 | 0 | HOR 1.4 kb |
| SF5 | 20q11.1 | chr20:30,088,752–30,140,826 | FP565326.9 | 51,870 | 83 | 0 | HOR 1.4 kb |
| SF5 | Xq11.1 | chrX:62,611,837–62,642,074 | BX544875.1 | 30,237 | 90 | 3 | No HOR |
Size has been corrected to exclude L1-repeats and gaps.
These contigs are partially segment duplications of each other.
This HOR is about 97% identical to the one in GJ212117.
Fig. 1Comparison of AS SF profiles of hg38 human genome assembly and HuRef WGS dataset.
The figure plots the SF content of the two datasets in Mb per haploid genome (3 × 109 bp). For WGS dataset (1 million reads), the number of AS monomers identified by PERCON was multiplied by the average length of a monomer in this dataset (146 bp) and normalized to the genome size (shown as “HuRef raw”). The same amount of AS divided in proportions obtained in the same sample with bad ends trimmed and filtered for monomers 140 bp or longer (average monomer length 168 bp) is shown as “HuRef corrected” (see Fig. S2 for details). For the assembly, the length of all monomers identified by PERCON in each category was summarized directly from PERCON track using the Table Browser. In both datasets, the real amounts are slightly underestimated in a similar manner, as small gaps which PERCON often leaves between monomers due to imperfect alignment of the ends are not taken into account.
SF1/SF5 and SF2/SF5 mixed AS regions in hg38 assembly.
| SF | Location | Position in hg38 | Contig | Size | SF1% | SF2% | SF5% | HORs on dot-matrix |
|---|---|---|---|---|---|---|---|---|
| SF2/SF5 | 2q21.2 | chr2:132,237,392–132,247,263 | AC097532.3 | 9871 | 0 | 22 | 58 | No HOR |
| SF1/SF5 | 3p11.1 | chr3:90,482,385–90,722,299 | ABBA01004652.1, AEKP01209350.1, ABBA01004653.1, AEKP01209353.1, ABBA01004654.1, ABBA01004655.1, ABBA01004656.1 | 229,441 | 40 | 0 | 41 | No HOR |
| SF1/SF5 | 3p11.1 3q11.1 | chr3:90,772,554–91,233,510 | GJ211866.1 | 460,956 | 48 | 0 | 35 | HOR 1.7 kb |
| SF1/SF5 | 3q11.1 | chr3:91,233,782–91,247,547 | GJ211867.1 | 13,765 | 57 | 0 | 31 | No HOR other than AB dimer, identity ~ 93% |
| SF1/SF5 | 3q11.1 | chr3:91,247,775–91,286,183 | ABBA01000927.1, ABBA01000928.1, ABBA01000929.1, ABBA01000930.1, ABBA01000931.1 | 17,430 | 50 | 0 | 32 | HOR 1.7 kb |
| SF1/SF5 | 3q11.1 | chr3:93,716,246–93,725,946 | ABBA01026974.1 | 9700 | 35 | 0 | 49 | No HOR |
| SF1/SF5 | 6q11.1 | chr6:60,230,028–60,241,613 | AC244258.2 | 11,401 | 28 | 0 | 53 | No HOR |
| SF1/SF5 | 6q11.1 | chr6:61,371,445–61,427,364 | AEKP01189806.1, AEKP01189805.1, AEKP01189804.1, AEKP01189803.1, AEKP01189802.1, FP325349.3 | 55,519 | 30 | 0 | 57 | No HOR |
| SF2/SF5 | 7q11.1 | chr7:61,096,433–61,103,082 | AC142121.2, AC017075.8 | 6649 | 0 | 38 | 46 | No HOR |
| SF1/SF5 | 8p11.1 | chr8:43,940,231–43,965,733 | AC127507.4, AC144576.3 | 22,886 | 15 | 0 | 74 | No HOR |
| SF1/SF5 | 8q11.1 | chr8:45,946,092–45,971,262 | AC118650.5 | 22,549 | 16 | 0 | 70 | No HOR |
| SF2/SF5 | 9p11.2 | chr9:40,556,928–40,565,104 | AMYH02020868.1, FP325318.4 | 7524 | 0 | 40 | 46 | No HOR |
| SF2/SF5 | 9p11.2 | chr9:40,862,745–40,873,147 | AL353626.5 | 10,402 | 0 | 28 | 50 | No HOR |
| SF1/SF5 | 10p11.1 | chr10:39,548,571–39,555,979 | ABBA01020707.1 | 7408 | 35 | 0 | 52 | No HOR |
| SF1/SF5 | 12p11.1 | chr12:34,686,342–34,715,037 | AC144535.4, AUXG01000432.1 | 28,658 | 60 | 0 | 30 | No HOR |
| SF2/SF5 | 16p11.2 | chr16:34,219,066–34,252,724 | AC136932.4, ABBA01017803.1 | 30,263 | 0 | 35 | 36 | No HOR |
| SF2/SF5 | 20q11.1 | chr20:28,509,094–28,556,877 | GJ212095.1 | 47,783 | 0 | 27 | 54 | HOR 1.9 kb |
| SF2/SF5 | 22q11.1 | chr22:15,965,313–15,972,300 | AC145543.3 | 6987 | 0 | 41 | 34 | No HOR |
Size has been corrected to exclude L1-repeats and gaps.
Same HOR as in GJ211866.
| Specifications | |
|---|---|
| Organism/cell line/tissue | |
| Sex | Both |
| Sequencer or array type | hg38 human genome assembly |
| Data format | Analyzed |
| Experimental factors | N/A |
| Experimental features | N/A |
| Consent | N/A |
| Sample source location | N/A |