| Literature DB >> 33957932 |
Richard W Tourdot1,2,3, Gregory J Brunette1,2, Ricardo A Pinto1,2,3, Cheng-Zhong Zhang4,5,6.
Abstract
Haplotype phase represents the collective genetic variation between homologous chromosomes and is an essential feature of non-haploid genomes. Here we describe a computational strategy to reliably determine complete whole-chromosome haplotypes using a combination of bulk long-range sequencing and Hi-C sequencing. We demonstrate that this strategy can resolve the haplotypes of parental chromosomes in diploid human genomes with high precision (>99%) and completeness (>98%) and assemble the syntenic structure of rearranged chromosomes in aneuploid cancer genomes at base pair level resolution. Our work enables direct interrogation of chromosome-specific alterations and chromatin reorganization using bulk DNA sequencing.Entities:
Keywords: Cancer genomics; Chromosome rearrangement; Haplotype
Mesh:
Year: 2021 PMID: 33957932 PMCID: PMC8101039 DOI: 10.1186/s13059-021-02330-1
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1A hierarchical strategy for determining whole-chromosome haplotypes. The haplotype phase of parental chromosomes (open and filled rectangles) are represented by the positions of alternate genotypes at heterozygous sites. Megabase-scale haplotype blocks are determined using linkage information (∼10kb range) from linked-reads or long-read sequencing and then concatenated using Hi-C contacts, first within each chromosome arm and then between the p- and q-arms
Sources of data for parental haplotype inference and benchmarking
| Sample | Data type | Data source | Read count | Mean | Contacts | Application |
|---|---|---|---|---|---|---|
| depth | (>1Mb) | |||||
| RPE-1 | Bulk WGS | [ | 228,708,769a | 13 × | Variant calling | |
| RPE-1 | Linked reads | New | 941,518,426b | 60 ×c | Variant calling | |
| & local phasing | ||||||
| RPE-1 | CCS long reads | New | 4,607,047d | 11 × | Local phasing | |
| RPE-1 | Hi-C | [ | 281,285,484e | 48,124,211 | Long-range phasing | |
| RPE-1 | Single cell with | New | hi-conf variants and | |||
| monosomies | reference haplotypes | |||||
| NA12878 | Linked reads v.1 | 10X Genomicsf | 422,179,395g | 35 ×c | Local phasing | |
| NA12878 | Linked reads v.2 | 10X Genomicsh | 423,854,243i | 35 ×c | Local phasing | |
| NA12878 | Hi-C | [ | 486,848,169j | 91,428,507 | Long-range phasing | |
| NA12878 | Phased VCF | GIABk | hi-conf variants and | |||
| reference haplotypes | ||||||
| NA12878 | Phased VCF | Diploid assemblyl | hi-conf variants and | |||
| reference haplotypes |
aSRR1778442: median insert 243; 208,151,992 fragments aligned in pair; 2 ×101bp reads; duplication rate 0.024.
bMean molecular length 24.8kb; median insert 551; 913,660,083 aligned in pair; 2 ×150bp reads; duplication rate 0.255.
cexcluding the GEMcode sequence and duplicated fragments
dMean read length 7.1kb; 4,606,654 aligned.
eSRS1045722: median insert 364; 279,027,892 aligned in pair; 2 ×150bp reads; duplication rate 0.067.
fhttps://support.10xgenomics.com/genome-exome/datasets/2.1.0/NA12878_WGS_210
gMean molecular length 68.7kb; median insert 349; 407,015,530 aligned in pair; 2 ×150bp reads; duplication rate 0.062.
hhttps://support.10xgenomics.com/genome-exome/datasets/2.2.1/NA12878_WGS_v2
iMean molecular length 85.6kb; median insert 370; 418,283,435 aligned in pair; 2 ×150bp reads; duplication rate 0.079.
jSRR1658572: median insert 377; 484,211,662 aligned in pair; 2 ×101bp reads; duplication rate 0.028.
khttps://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/latest/GRCh38/
lhttp://ftp.dfci.harvard.edu/pub/hli/hifiasm/NA12878-r253/. Phased variants were determined using dipcall (https://github.com/lh3/dipcall) on the sequences of parental chromosomes generated by diploid de novo assembly of the NA12878 genome using PacBio High-Fidelity long reads together with short reads of the parental genomes using hifiasm [40].
Fig. 2Statistical metrics of molecular haplotype linkage between variants at different genomic distance extracted from linked-reads and Hi-C sequencing data of RPE-1 (a,c,e) and NA12878 (b,d,f) cells. a, b Fraction of linked variants. c, d Average number of links between linked variants. e, f Fraction of links consistent with cis linkage. The residual linkage in the linked-reads data ∼50% accuracy (e and f) reflects tagging of unrelated DNA fragments from both parental homologs by the same molecular barcodes by chance
Fig. 3Linking haplotype blocks with Hi-C reads. a Average number of Hi-C links between two segments of 0.5Mb, 1Mb, and 2Mb at different genomic distance calculated using the RPE-1 Hi-C data. b A schematic illustration of haplotype block concatenation using Hi-C reads. The shown example assumes that at least two links are required to join two blocks. (i) Local haplotype blocks (open rectangles) and Hi-C links (curves); (ii) and (iii) merging of adjacent haplotype blocks based on Hi-C linkage; and (iv) joint inference of the haplotype phase of all blocks using all Hi-C links
Fig. 4a, b Haplotype blocks on Chr.5 derived from the NA12878 (a) and RPE-1 (b) linked-reads data. Each row of haplotype blocks is determined using a different switching-penalty cutoff from ΔE=5 to ΔE=10,000. Only blocks with 50 or more phased variants are shown. The accuracy of each block is estimated by the percentage of genotypes that are consistent with the majority haplotype of each block determined using the reference haplotype. Blocks with ≥98% accuracy are colored in gray; those with <98% accuracy are colored in red with brightness scaled by the accuracy (minimum 50%). Three examples of low-accuracy blocks each containing a single intra-block switching error are highlighted with red arrows in the NA12878 genome; these blocks are broken into two high-accuracy blocks at a higher cutoff. Shown below the haplotype blocks are three tracks of regional variant density measured by the number of total detected variants (blue), phased variants in the final haplotype solution (green), and phased variants in the reference data (purple) in 200 kb bins. (We have chosen the Genome-In-A-Bottle data as the NA12878 reference haplotype.) Bins with more than 20 variants (variant density more than 1 per 10 kb) are omitted. Black arrows highlight large regions with low variant density, including the spinal muscular atrophy (SMA) region on 5q13.2 consisting of large (∼200kb) segmental duplications with high sequence similarity (>98%) [47] that cannot be resolved by short reads. The other region in the RPE-1 genome reflects loss-of-heterozygosity. c. Average intra-block accuracy (weighted by the number of variants in each block) and the N50 length of all haplotype blocks in each sample generated using different switching-penalty cutoffs. The NA12878 dataset produces longer haplotype blocks due to having longer input molecules (Table 1)
Comparison between the final haplotype solution and the reference haplotype of NA12878
| All SNV sites | Phased from | Reference | Comparable | Agreed | Accuracy | Fraction of |
|---|---|---|---|---|---|---|
| bulk data | haplotype | sites | completion | |||
| 2,652,381 | 2,319,027a | 1,861,941b | 1,824,401 | 1,818,042 | 0.997 | 0.980 |
| 2,151,642c | 1,815,197 | 1,809,886 | 0.997 | 0.975 | ||
| 2,319,027a | 2,183,123d | 2,122,256 | 2,114,548 | 0.996 | 0.969 | |
| 2,151,642c | 2,096,982 | 2,091,821 | 0.998 | 0.958 | ||
| 9,285 | 7663e | 3618f | 3553 | 3535 | 0.995 | 0.982 |
| 4702c | 3183 | 3177 | 0.998 | 0.880 | ||
| 7663e | 4581g | 4478 | 4426 | 0.988 | 0.978 | |
| 4702c | 3835 | 3827 | 0.998 | 0.837 |
aAll phased variants without any filtering
bVariants detected in the linked-reads data that are also contained in the GIAB release. Total number of phased SNVs in the GIAB release, 1,867,590
cFiltered by haplotype linkage: ≥1 link connecting ref, alt, HapA, and HapB, and minor linkage ≤2 or minor linkage/total linkage ≤0.1
dPhased variants determined from phased de novo assembly of parental chromosomes that are also detected in the linked-reads data. Total number of phased variants from diploid de novo assembly, 2,312,059
eVariants phased by molecular linkage to phased SNVs in the scaffold haplotype solution
fIntersection with phased indel variants in the GIAB data with exactly matching variant genotypes. Total number of phased indels in the GIAB release, 4090
gIntersection with phased indels derived from de novo assembly of parental chromosomes with exactly matching variant genotypes. Total number of phased indels from diploid assembly, 7128
Comparison between the final haplotype solution and the reference haplotype of RPE-1
| Filter | Total variant | Phased from | Phased from | Agreed | Discordant | Fraction of |
|---|---|---|---|---|---|---|
| sites | bulk data | monosomies | discordance | |||
| None | 2,475,311 | 2,242,237 | 2,320,153 | 2,101,195 | 40,006 | 0.019 |
| Allele fractiona | 2,172,689 | 2,087,188 | 2,109,589 | 2,018,906 | 12,903 | 0.006 |
| Linkageb | 2,156,423 | 2,156,346 | 2,071,674 | 2,054,006 | 17,616 | 0.009 |
| Combinedc | 2,054,859 | 2,054,832 | 2,001,674 | 1,993,552 | 8,098 | 0.004 |
aFrom single-cell data: minor allele fraction ≥0.3 in disomic regions and in the [0.18,0.48] range in the trisomic region of Chr.10q
b ≥1 link connecting ref, alt, HapA, and HapB & minor linkage ≤2 or minor linkage/total linkage ≤0.1
cWith both the allele fraction (a) and the linkage (b) filter
Fig. 5Haplotype-resolved DNA copy number and rearrangement analysis of post-crisis RPE-1 cells from Ref. [41]. Schematic diagrams of alterations to each homolog and their clonal fractions are shown below the copy-number plots. Non-telomeric chromosomal fragments involved in complex rearrangements are outlined. a Haplotype-specific DNA copy number and rearrangement (black arcs) of Chr.8 in the X-33 sample. Both rearrangements and copy-number alterations are restricted to the red homolog; the non-integer copy-number states indicate that the altered Chr.8 is present in a subclonal population (∼45%). b Haplotype-specific DNA copy number and rearrangement of Chr.4 (black arcs: intrachromosomal; magenta vertical lines: interchromosomal to Chr.13) in the X-25 sample. Rearrangements and copy-number alterations affect both homologs: Segmental changes of both homologs (blue: gain, red: loss) appear to be subclonal (∼90%). c Haplotype-specific copy number of Chr.6 in the X-36 sample. The blue homolog shows non-constant copy number (1-1.3) near the q-terminus that contrasts with constant copy number of the red homolog (∼1) or the rest of the blue homolog (∼0.7). We interpret this pattern as reflecting extensive copy-number heterogeneity in the population
Fig. 6Haplotype-resolved synteny of rearranged chromosomes and aneuploid genomes. a Walking the translocated X chromosome in the RPE-1 genome using phased Hi-C links (dots) between different homologs of Chr.10 and Chr.X. A significant increase of Hi-C links is seen in only one haplotype combination reflecting cis links generated by the translocation between Chr.10 and Chr.X. The enrichment of Hi-C links throughout the entire X chromosome suggests the 10q segment is attached to an intact X chromosome with structure shown below the Hi-C contact map. Arrows denote the orientation of the two segments from the p-terminus to the q-terminus. b The cytogenetic K-562 karyotype reported in Ref. [42] (reprinted with permission from the publisher) with outlined structurally altered (marker) chromosomes resolved by sequencing data (shown in c). c The digital K-562 karyotype with haplotype assignment to both normal (left) and structurally altered (right) chromosomes determined from linked-reads and Hi-C sequencing data. The digital karyotype mostly agrees with the cytogenetic karyotype and the differences may be attributed to additional alterations during cell culture. Among all marker chromosomes listed in b, we are able to determine the syntenic structure of the following: mar5/6, mar6/6, mar3/10, mar12/21, mar9/17, mar1/18, mar1/6/20, and mar1/21, and resolve most rearrangement junctions at the base-pair level. Arrows represent the orientations of rearranged segments relative to the standard p-q arm orientation. Multiple junctions contained local fold-back rearrangements (inverted colored arrows in t(1A;18A), t(3A;18B), t(6A;1A;20A)) that are consistent with local DNA copy number gains; these events cannot be resolved by cytogenetic analyses. The mar18 described in Ref. [42] is probably related/similar to t(3A;18B). The BCR-ABL amplification is contained in a homogeneously staining region (hsr) in the marker chromosome t(22A;9-13-22hsr). We infer the structure of the amplicon from DNA copy number and rearrangements but cannot validate the inferred structure due to technical limitations. We are further able to partially resolve the structure of the altered Chr.7 and Chr.9 and completely resolve the structure of three additional marker chromosomes described in Ref. [43] but not in Ref. [42]: t(2A;22A), t(3A;10A;17A), t(9A;13A). Details of the analysis are presented in Additional file 5 and explained in Additional file 1:Determination of the K-562 karyotype by haplotype-specific genomic analysis