| Literature DB >> 35357919 |
Sergey Nurk1, Sergey Koren1, Arang Rhie1, Mikko Rautiainen1, Andrey V Bzikadze2, Alla Mikheenko3, Mitchell R Vollger4, Nicolas Altemose5, Lev Uralsky6,7, Ariel Gershman8, Sergey Aganezov9, Savannah J Hoyt10, Mark Diekhans11, Glennis A Logsdon4, Michael Alonge9, Stylianos E Antonarakis12, Matthew Borchers13, Gerard G Bouffard14, Shelise Y Brooks14, Gina V Caldas15, Nae-Chyun Chen9, Haoyu Cheng16,17, Chen-Shan Chin18, William Chow19, Leonardo G de Lima13, Philip C Dishuck4, Richard Durbin19,20, Tatiana Dvorkina3, Ian T Fiddes21, Giulio Formenti22,23, Robert S Fulton24, Arkarachai Fungtammasan18, Erik Garrison11,25, Patrick G S Grady10, Tina A Graves-Lindsay26, Ira M Hall27, Nancy F Hansen28, Gabrielle A Hartley10, Marina Haukness11, Kerstin Howe19, Michael W Hunkapiller29, Chirag Jain1,30, Miten Jain11, Erich D Jarvis22,23, Peter Kerpedjiev31, Melanie Kirsche9, Mikhail Kolmogorov32, Jonas Korlach29, Milinn Kremitzki26, Heng Li16,17, Valerie V Maduro33, Tobias Marschall34, Ann M McCartney1, Jennifer McDaniel35, Danny E Miller4,36, James C Mullikin14,28, Eugene W Myers37, Nathan D Olson35, Benedict Paten11, Paul Peluso29, Pavel A Pevzner32, David Porubsky4, Tamara Potapova13, Evgeny I Rogaev6,7,38,39, Jeffrey A Rosenfeld40, Steven L Salzberg9,41, Valerie A Schneider42, Fritz J Sedlazeck43, Kishwar Shafin11, Colin J Shew44, Alaina Shumate41, Ying Sims19, Arian F A Smit45, Daniela C Soto44, Ivan Sović29,46, Jessica M Storer45, Aaron Streets5,47, Beth A Sullivan48, Françoise Thibaud-Nissen42, James Torrance19, Justin Wagner35, Brian P Walenz1, Aaron Wenger29, Jonathan M D Wood19, Chunlin Xiao42, Stephanie M Yan49, Alice C Young14, Samantha Zarate9, Urvashi Surti50, Rajiv C McCoy49, Megan Y Dennis44, Ivan A Alexandrov3,7,51, Jennifer L Gerton13,52, Rachel J O'Neill10, Winston Timp8,41, Justin M Zook35, Michael C Schatz9,49, Evan E Eichler4,53, Karen H Miga11,54, Adam M Phillippy1.
Abstract
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.Entities:
Mesh:
Year: 2022 PMID: 35357919 PMCID: PMC9186530 DOI: 10.1126/science.abj6987
Source DB: PubMed Journal: Science ISSN: 0036-8075 Impact factor: 63.714
Fig. 1.Summary of the complete T2T-CHM13 human genome assembly.
(A) Ideogram of T2T-CHM13v1.1 assembly features. Bottom to top: gaps/issues in GRCh38 fixed by CHM13 overlaid with the density of genes exclusive to CHM13 in red; segmental duplications (SDs) (42) and centromeric satellites (CenSat) (30); and CHM13 ancestry predictions (EUR, European; SAS, South Asian; EAS, East Asian; AMR, Ad Mixed American). (B) Additional (non-syntenic) bases in the CHM13 assembly relative to GRCh38 per chromosome, with the acrocentrics highlighted in black, and (C) by sequence type (note that the CenSat and SD annotations overlap). (D) Total non-gap bases in UCSC reference genome releases dating back to September 2000 (hg4) and ending with T2T-CHM13 in 2021.
Fig. 2.High-resolution assembly string graph of the CHM13 genome.
(A) Bandage (60) visualization, where nodes represent unambiguously assembled sequences scaled by length, and edges correspond to the overlaps between node sequences. Each chromosome is both colored and numbered on the short (p) arm. Long (q) arms are labeled where unclear. The five acrocentric chromosomes (bottom right) are connected due to similarity between their short arms, and the rDNA arrays form five dense tangles due to their high copy number. The graph is partially fragmented due to HiFi coverage dropout surrounding GA-rich sequence (black triangles). Centromeric satellites (30) are the source of most ambiguity in the graph (gray highlights). (B) The ONT-assisted graph traversal for the 2p11 locus is given by numerical order. Based on low depth-of-coverage, the unlabeled light gray node represents an artifact or heterozygous variant and was not used. (C) The multi-megabase tandem HSat3 duplication (9qh+) at 9q12 requires two traversals of the large loop structure (the size of the loop is exaggerated because graph edges are of constant size). Nodes used by the first traversal are in dark purple and the second traversal in light purple. Nodes used by both traversals typically have twice the sequencing coverage. (D) Enlargement of the distal short arms of the acrocentrics, showing the colored graph walks and edges between highly similar sequences in the distal junctions (DJs) adjacent to the rDNA arrays.
Fig. 3.Sequencing coverage and assembly validation.
(A) Uniform whole-genome coverage of mapped HiFi and ONT reads is shown with primary alignments in light shades and marker-assisted alignments overlaid in dark shades. Large HSat arrays (30) are noted by triangles, with inset regions are marked by arrowheads and the location of the rDNA arrays marked with asterisks. Regions with low unique marker frequency (light green) correspond to drops in unique marker density, but are recovered by the lower-confidence primary alignments. Annotated assembly issues are compared for T2T-CHM13 and GRCh38. (B–D) Enlargements corresponding to regions of the genome featured in Fig. 2. Uniform coverage changes within certain satellites are reproducible and likely caused by sequencing bias. Identified heterozygous variants and assembly issues are marked below and typically correspond with low coverage of the primary allele (black) and elevated coverage of the secondary allele (red). % microsatellite repeats for every 128 bp window is shown at the bottom.
Comparison of GRCh38 and T2T-CHM13v1.1 human genome assemblies.
| Summary | GRCh38 | T2T-CHM13 | ±% |
|---|---|---|---|
| Assembled bases (Gbp) | 2.92 | 3.05 | +4.5% |
| Unplaced bases (Mbp) | 11.42 | 0 | −100.0% |
| Gap bases (Mbp) | 120.31 | 0 | −100.0% |
| # Contigs | 949 | 24 | −97.5% |
| Ctg NG50 (Mbp) | 56.41 | 154.26 | +173.5% |
| # Issues | 230 | 46 | −80.0% |
| Issues (Mbp) | 230.43 | 8.18 | −96.5% |
|
| |||
|
| |||
|
| |||
| # Genes | 60,090 | 63,494 | +5.7% |
| protein coding | 19,890 | 19,969 | +0.4% |
| # Exclusive genes | 263 | 3,604 | |
| protein coding | 63 | 140 | |
| # Transcripts | 228,597 | 233,615 | +2.2% |
| protein coding | 84,277 | 86,245 | +2.3% |
| # Exclusive transcripts | 1,708 | 6,693 | |
| protein coding | 829 | 2,780 | |
|
| |||
|
| |||
|
| |||
| % SDs | 5.00% | 6.61% | |
| SD bases (Mbp) | 151.71 | 201.93 | +33.1% |
| # SDs | 24097 | 41528 | +72.3% |
|
| |||
|
| |||
|
| |||
| % Repeats | 51.89% | 53.94% | |
| Repeat bases (Mbp) | 1,516.37 | 1,647.81 | +8.7% |
| LINE | 626.33 | 631.64 | +0.8% |
| SINE | 386.48 | 390.27 | +1.0% |
| LTR | 267.52 | 269.91 | +0.9% |
| Satellite | 76.51 | 150.42 | +96.6% |
| DNA | 108.53 | 109.35 | +0.8% |
| Simple repeat | 36.5 | 77.69 | +112.9% |
| Low complexity | 6.16 | 6.44 | +4.6% |
| Retroposon | 4.51 | 4.65 | +3.3% |
| rRNA | 0.21 | 1.71 | +730.4% |
GRCh38 summary statistics exclude “alts” (110 Mbp), patches (63 Mbp), and Chromosome Y (58 Mbp). Assembled bases: all non-N bases. Unplaced bases: not assigned or positioned within a chromosome. # Contigs: GRCh38 scaffolds were split at three consecutive Ns to obtain contigs. NG50: half of the 3.05 Gbp human genome size contained in contigs of this length or greater. # Exclusive genes/transcripts: for GRCh38, GENCODE genes/transcripts not found in CHM13; for CHM13, extra putative paralogs that are not in GENCODE. Segmental duplication analysis is from (42). RepeatMasker analysis is from (49).
Fig. 4.Short arms of the acrocentric chromosomes.
Each short arm is shown along with annotated genes, percent of methylated CpGs (29), and a color-coded satellite repeat annotation (30). The rDNA arrays are represented by a directional arrow and copy number due to their high self-similarity, which prohibits ONT mapping. Percent identity heatmaps versus the other four arms were computed in 10 kbp windows and smoothed over 100 kbp intervals. Each position shows the maximum identity of that window to any window in the other chromosome. The distal short arms include conserved satellite structure and inverted repeats (thin arrows), while the proximal short arms show a diversity of structures. The proximal short arms of Chromosomes 13, 14, and 21 share a segmentally duplicated core, including small alpha satellite HOR arrays and a central, highly methylated, SST1 array (thin arrows with teal block). Yellow triangles indicate hypomethylated centromeric dip regions (CDRs), marking the sites of kinetochore assembly (29).
Fig. 5.Resolved FRG1 paralogs.
(A) Protein-coding gene FRG1 and its 23 paralogs in CHM13. Only 9 are found in GRCh38. Genes are drawn larger than their actual size and the “FRG1” prefix is omitted for brevity. All paralogs are found near satellite arrays. Most copies exhibit evidence of expression, including CpG islands present at the 5′ start site with varying degrees of methylation. (B) Reference (gray) and variant (colored) allele coverage is shown for four human HiFi samples mapped to the paralog FRG1DP. When mapped to GRCh38, the region shows excessive HiFi coverage and variants, indicating that reads from the missing paralogs are mis-mapped to FRG1DP (variants with >80% coverage shown). When mapped to CHM13, HiFi reads show the expected coverage and a typical heterozygous variation pattern for the three non-CHM13 samples (variants >20% coverage shown). These non-reference alleles are also found in other populations from 1KGP ILMN data. (C) Mapped HiFi read coverage for other FRG1 paralogs, with an extended context shown for Chromosome 20. Coverage of HiFi reads that mapped to FRG1DP in GRCh38 are highlighted (dark gray), showing the paralogous copies they originate from (FRG1BP4–10, FRG1GP, FRG1GP2, and FRG1KP4). Background coverage is variable for some paralogs, suggesting copy number polymorphism in the population. (D) Methylation and expression profiles suggest transcription of FRG1DP in CHM13. In the copy number display (bottom), each length k sequence (k-mer) of the CHM13 assembly is painted with a color representing the copy number of that k-mer sequence in an SGDP sample. The CHM13 and GRCh38 tracks show the copy number of these same k-mers in the respective assemblies. CHM13 copy number resembles all samples from the SGDP, whereas GRCh38 underrepresents the true copy number.