Literature DB >> 30864654

Haplotype-resolved and integrated genome analysis of the cancer cell line HepG2.

Bo Zhou1,2, Steve S Ho1,2, Stephanie U Greer3, Noah Spies2,4,5, John M Bell6, Xianglong Zhang1,2, Xiaowei Zhu1,2, Joseph G Arthur7, Seunggyu Byeon8, Reenal Pattni1,2, Ishan Saha2, Yiling Huang1,2, Giltae Song8, Dimitri Perrin9, Wing H Wong7,10, Hanlee P Ji3,6, Alexej Abyzov11, Alexander E Urban1,2,12.   

Abstract

HepG2 is one of the most widely used human cancer cell lines in biomedical research and one of the main cell lines of ENCODE. Although the functional genomic and epigenomic characteristics of HepG2 are extensively studied, its genome sequence has never been comprehensively analyzed and higher order genomic structural features are largely unknown. The high degree of aneuploidy in HepG2 renders traditional genome variant analysis methods challenging and partially ineffective. Correct and complete interpretation of the extensive functional genomics data from HepG2 requires an understanding of the cell line's genome sequence and genome structure. Using a variety of sequencing and analysis methods, we identified a wide spectrum of genome characteristics in HepG2: copy numbers of chromosomal segments at high resolution, SNVs and Indels (corrected for aneuploidy), regions with loss of heterozygosity, phased haplotypes extending to entire chromosome arms, retrotransposon insertions and structural variants (SVs) including complex and somatic genomic rearrangements. A large number of SVs were phased, sequence assembled and experimentally validated. We re-analyzed published HepG2 datasets for allele-specific expression and DNA methylation and assembled an allele-specific CRISPR/Cas9 targeting map. We demonstrate how deeper insights into genomic regulatory complexity are gained by adopting a genome-integrated framework.
© The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30864654      PMCID: PMC6486628          DOI: 10.1093/nar/gkz169

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Genomic instability is a hallmark of cancer where critical genomic changes create gene fusions, the disruption of tumor-suppressor and the amplification of oncogenes (1–3). A comprehensive knowledge of the mutations and larger structural changes that underlie a cancer genome is critical not only for a deeper understanding of the biological processes that drive tumor progression and evolution, but also for the development of targeted cancer therapies. The HepG2 cell line is one of the most widely used cancer cell lines used in many areas of biomedical research due to its extreme versatility, contributing to over 23 000 publications to date, even more than K562. It is a hepatoblastoma cell line derived from a 15-year-old male of European ancestry (4,5). Representing the human endodermal lineage, HepG2 cells are widely used as models for human toxicology studies (6–10), including toxicogenomic screens using CRISPR-Cas9 (11), in addition to studies on drug metabolism (12), cancer (13), liver disease (14), gene regulatory mechanisms (15) and biomarker discovery (16). As one of the main cell lines of the ENCyclopedia Of DNA Elements Project (ENCODE), HepG2 has been used to generate close to 1000 datasets for ENCODE (17). The functional genomic and epigenomics aspects of HepG2 cells have been extensively studied with approximately 325 ChIP-Seq, 300 RNA-Seq and 180 eCLIP datasets available through ENCODE in addition to recent single-cell methylome and transcriptome datasets (18). However, the genome sequence and higher order genomic structural features of HepG2 have never been characterized in a comprehensive manner, even though the HepG2 cell line has been known to contain multiple chromosomal abnormalities (19,20). As a result, the extensive HepG2 functional genomics and epigenomics studies conducted to date were done without reliable genomic contexts for accurate interpretation. Here, we report the first global, integrated and haplotype-resolved whole-genome characterization of the HepG2 cancer genome that includes copy numbers (CN) of large chromosomal regions at high-resolution, single-nucleotide variants (SNVs, also including single-nucleotide polymorphisms, i.e. SNPs) and small insertions and deletions (indels) with allele-frequencies corrected by CN in aneuploid regions, loss of heterozygosity, mega-base-scale phased haplotypes and structural variants (SVs), many of which are haplotype-resolved (Figure 1 and Supplementary Figure S1). The datasets generated in this study form an integrated, phased, high-fidelity genomic resource that can provide the proper contexts for future experiments that rely on HepG2’s unique characteristics. We show how knowledge about HepG2’s genomic sequence and structural variations can enhance the interpretation of functional genomics and epigenomics data. For example, we integrated HepG2 RNA-Seq data and whole-genome bisulfite sequencing data with ploidy and phasing information and identified many cases of allele-specific gene expression and allele-specific DNA methylation. We also compiled a phased CRISPR map of loci suitable for allele specific-targeted genome editing or screening. Finally, we demonstrate the power of this resource by providing compelling insights into the mutational history of HepG2 and oncogene regulatory complexity derived from our datasets. The technical framework demonstrated in this study is also suitable for the study of other cancer cell lines and primary tumor samples.
Figure 1.

Comprehensive Overview of the HepG2 Genome. Circos visualization of HepG2 genome variants with the following tracks in concentric order starting with outermost ‘ring’: human genome reference track (hg19); large CN changes (colors correspond to different CN, see legend panel); in 1.5 Mb windows, merged SV density (deletions, duplications, inversions) called using BreakDancer, BreakSeq, PINDEL, LUMPY and Long Ranger; phased haplotype blocks (demarcated with four colors for clearer visualization); SNV density in 1 Mb windows; Indel density in 1 Mb windows; dominant zygosity (heterozygous or homozygous > 50%) in 1 Mb windows; regions with loss of heterozygosity; allele-specific expression; CpG islands exhibiting allele-specific DNA methylation; non-reference LINE1 and Alu insertions; allele-specific CRISPR target sites; large-scale SVs resolved by using Long Ranger (peach: intrachromosomal: dark maroon: interchromosomal); by using GROC-SVs (light-purple: intrachromsomal; dark-purple: interchromosomal).

Comprehensive Overview of the HepG2 Genome. Circos visualization of HepG2 genome variants with the following tracks in concentric order starting with outermost ‘ring’: human genome reference track (hg19); large CN changes (colors correspond to different CN, see legend panel); in 1.5 Mb windows, merged SV density (deletions, duplications, inversions) called using BreakDancer, BreakSeq, PINDEL, LUMPY and Long Ranger; phased haplotype blocks (demarcated with four colors for clearer visualization); SNV density in 1 Mb windows; Indel density in 1 Mb windows; dominant zygosity (heterozygous or homozygous > 50%) in 1 Mb windows; regions with loss of heterozygosity; allele-specific expression; CpG islands exhibiting allele-specific DNA methylation; non-reference LINE1 and Alu insertions; allele-specific CRISPR target sites; large-scale SVs resolved by using Long Ranger (peach: intrachromosomal: dark maroon: interchromosomal); by using GROC-SVs (light-purple: intrachromsomal; dark-purple: interchromosomal).

MATERIALS AND METHODS

HepG2 karyotyping and DNA extraction

HepG2 cells were acquired from the Stanford ENCODE Product Center for Mapping of Regulatory Regions (NHGRI Project 1U54HG006996-01). Karyotyping of HepG2 cells was conducted in the Cytogenetics Laboratory (cytogenetics.stanford.edu) at Stanford University Medical Center (Palo Alto, CA, USA), where 20 metaphase cells were analyzed by GTW banding. DNA extraction was performed using the Qiagen DNeasy Blood & Tissue Kit (Cat No. 69504), and concentration was measured using the Qubit dsDNA BR Assay Kit (Invitrogen, Waltham, MA, USA). Purity of DNA (OD260/280 > 1.8; OD260/230 > 1.5) was verified using NanoDrop (Thermo Scientific, Waltham, MA, USA). Using field-inversion gel electrophoresis on the Pippin Pulse System (Sage Science, Beverly, MA, USA), the extracted DNA was verified to be high molecular weight (mean > 35 kb).

CN of chromosome segments and allele frequencies of SNVs and Indels

Standard short-insert WGS (Supplementary Methods) coverage was calculated in 10-kb bins across the genome and plotted against the %GC content of each bin to verify the existence of discrete clusters corresponding to discrete CNs (Supplementary Figure S2). CN was assigned to a cluster based on the ratio of its mean coverage to that of the lowest cluster. For an example, the cluster with the lowest mean coverage was assigned CN1, and the cluster with twice as much mean coverage was assigned CN2 and so forth. The ratios for the five discrete clusters observed corresponded almost perfectly to CN1, CN2, CN3 and CN4. WGS coverage across the genome and across each chromosome was examined visually to assign CN for different chromosome segments or entire chromosomes based on the cluster analysis where adjacent chromosomal segments with different CNs could be identified by the clearly visible sharp and steep changes in sequencing coverage (Supplementary Figure S3 and Supplementary Data). For each chromosome segment, SNVs and Indels were called using by GATK Haplotypecaller (version 3.7) (21) by specifying the CN or ploidy of that chromosome segment (stand_emit_conf = 0.1, variant_index_type = LINEAR, variant_index_parameter = 128000, ploidy = {CN}). The resulting Haplotypecaller outputs from all chromosome segments were then concatenated, and variant quality scores were recalibrated using GATK VQSR with training datasets (dbSNP 138, HapMap 3.3, Omni 2.5 genotypes, 1000 Genomes Phase 1) as recommended by GATK Best Practices (22,23) and filtered with the setting tranche = 99.0. SNVs and Indels were annotated using dbSNP138 (24) followed by SnpEff (version 4.3; canonical transcripts) (25) and then filtered for protein altering variants using SnpSift (version 4.3; ‘HIGH’ and ‘MODERATE’ putative impact) (26). Protein-altering variants were intersected with the variants from the 1000 Genomes Project (27) and the Exome Sequencing Project (http://evs.gs.washington.edu/EVS/) where overlapping variants were removed using Bedtools (version 2.26) (28). The resulting PPA variant calls were overlapped against the Catalogue of Somatic Mutations in Cancer (29) and Sanger Cancer Gene Census (30).

Identification of LOH

A Hidden Markov Model (HMM) was used to identify genomic regions exhibiting LOH. The HMM is designed with two states: LOH present and LOH absent. We used SNVs that were recalibrated and ‘PASS’-filtered from GATK VQSR as well as overlapped 1000 Genomes Project variants (31). The genome was split into 40-kb bins; heterozygous and homozygous SNVs were tallied for each bin, and bins with <12 SNVs were removed. A bin was classified as heterozygous if ≥50% of the SNVs within the bin are heterozygous; otherwise it was classified as homozygous. This classification was used as the HMM emission sequence. The HMM was initialized with the same initiation and transition probabilities (Prob = 10E-8) (3), and the Viterbi algorithm was used to estimate a best path. Adjacent LOH intervals were merged.

Haplotype phasing and variant analysis using linked-reads

Paired-end linked-reads (median insert size 396 bp, duplication rate 7.68%, Q30 Read1 78.7%, Q30 Read2 63.1%) were aligned to hg19 (alignment rate 90.4%, mean coverage 67.0x, zero coverage 0.117%) and analyzed using the Long Ranger Software (version 2.1.5) from 10x Genomics (32,33) (Pleasanton, CA, USA). Segmental duplications, reference gaps, unplaced contigs, regions with assembly issues and highly polymorphic sites (http://cf.10xgenomics.com/supp/genome/hg19/sv_blacklist.bed, http://cf.10xgenomics.com/supp/genome/hg19/segdups.bedpe) were excluded from the analysis. ENSEMBL annotations (http://cf.10xgenomics.com/supp/genome/gene_annotations.gtf.gz) were used for genes and exons. Phasing was performed by specifying the set of pre-called and filtered HepG2 heterozygous SNVs and Indels from GATK (see above) and formatted using mkvcf from Long Ranger (version 2.1.5). Heterozygous SNVs and Indels with more than two types of alleles in ploidy>2 regions were excluded from analysis. Large (>30 kb) SVs and large-scale complex rearrangements were identified using both the Long Ranger wgs module with the ‘–somatic’ option, GROC-SVs (default settings with breakpoint assembly) (34) and gemtools (35). The ‘–somatic’ option increases the sensitivity of the large-scale SV caller for somatic SVs by allowing the detection of sub-haplotype events and does not affect small-scale variant calling. Variants from Long Ranger analysis indicated as ‘PASS’ were retained. SV breakpoints identified using GROC-SVs were also analyzed for supporting evidence from mate-pair reads (see below). By using the method described in (36), the HepG2 haplotype-blocks identified from Long Ranger were ‘stitched’ to mega-haplotype blocks by leveraging the SNV haplotype imbalance in aneuploid regions where NA12878 linked-read sequencing data (https://support.10xgenomics.com/genome-exome/datasets/2.2.1/NA12878_WGS_v2) were used as the matching control. Only SNVs present in both genomes (HepG2 and NA12878) were included in the mega-haplotype blocks (Supplementary Data and Table 1). For details on linked-read library construction and allele-specific RNA expression, DNA methylation and CRISPR analysis, see Supplementary Methods.
Table 1.

Summary of HepG2 small variant calls and phasing results

SNPsINDELsPhased WGS
All3 337 361892 019% phased heterozygous SNPs99
 Heterozygous/homozygous1 898 493/1 438 868598 882/293 137% phased INDELs78
 Protein altering11 460 (0.3%)1347 (0.2%)Longest phase block31 106 135
dbSNP1383 279 135 (98%)693 348 (78%)Number of phase blocks1628
 Heterozygous/homozygous1 845 345/1 433 790439 143/254 205N50 phase block6 792 324
Novel58 226 (2%)198 671 (22%)N50 Linked-reads per molecule61
 Heterozygous/homozygous53 148/5078159 739/38 932Barcodes detected1 532 287
1000 Genomes Project & Exome Sequencing Project Overlap (with protein altering variants)11 083 (97%)1092 (81%)Mean DNA per barcode (bp)633 889
Novel Protein Altering377255
COSMIC Overlap148 (39%)42 (16%)
Mega-haplotypes
Chromosome Start End Chromosome Arm % of arm covered P-value
221 88889 128 6282p98%2.20E-16
298 803 025243 046 5912q98%2.20E-16
62 69 21156 501 0366p96%8.70E-07
662 383 957170 631 0196q99%3.92E-13
1646 511 76290 230 34316q99%3.87E-05
1734 819 19180 982 38617q83%4.25E-06
Summary of HepG2 small variant calls and phasing results

SV identification from short-insert and mate-pair WGS

For short-insert WGS, we identified structural variants using BreakDancer (version 1.4.5) (37), Pindel (version 0.2.4t) (38) and BreakSeq (version 2.0) (39) with default settings to obtain per-filtered calls. All SV calls were required to be >50 bp. We filtered out BreakDancer calls with <2 supporting paired-end reads and with confidence scores <90. Pindel calls were filtered for SVs with quality scores >400. No further filtering was performed for BreakSeq calls. From the mate-pair sequencing (Supplementary Methods), SV calls were made using LUMPY (version 0.6.11) (40). Split-reads and discordantly mapped reads were first extracted and sorted from the processed alignment file as described in github.com/arq5x/lumpy-sv (40). The lumpyexpress command was issued to obtain pre-filtered SV calls. Segmental duplications and reference gaps (hg19) downloaded from the UCSC Genome Browser (41,42) were excluded from the analysis through the ‘-x’ option. SV calls <50 bp were filtered out. To select for high-confidence calls, only SVs that have ≥5 supporting reads as well as both discordant and split-read support were retained. For details regarding experimental validation of SVs, see Supplementary Methods.

RESULTS

Karyotyping

We obtained HepG2 cells from the Stanford ENCODE Production Center. The cells exhibit a hyperdiploid karyotype of 49 to 52 chromosomes (Figure 2A). All 20 metaphase HepG2 cells analyzed using GTW banding were abnormal; 15 cells were very complex, characterized by multiple structural and numerical abnormalities, and the other 5 show a doubling (or tetraploid expansion) of this abnormal cell line, as typical of tumors both in vitro and in vivo. These include translocation between the chromosome 1p and 21p, trisomies of chromosomes 2, 16 and 17, tetrasomy of chromosome 20, uncharacterized arrangements of chromosomes 16 and 17, and a variable number of marker chromosomes. Five cells demonstrated >100 chromosomes and represent a tetraploid expansion of the stemline described. This tetraploid expansion is consistent with previously published results (19) but also absent from other published cytogenetic analyses of HepG2 (20), suggesting the clonal evolution arose during tumorigenesis or early in the establishment of the HepG2 cell line. Although the ploidies of all chromosomes in our HepG2 cell line were supported by previous published karyotypes (19,20,43), variations do exist and also among the various published analyses especially for chromosomes 16 and 17, suggesting that karyotypic differences exist between different HepG2 cell lines (Supplementary Table S1).
Figure 2.

HepG2 Karyogram and Callset Overview. (A) Representative karyogram of HepG2 cells by GTW banding that shows multiple numerical and structural abnormalities including a translocation between the short arms of chromosomes 1 and 21, trisomies of chromosomes 12, 16 and 17, tetrasomy of chromosome 20, uncharacterized rearrangements of chromosomes 16 and 17 and a two marker chromosomes. ISCN 2013 description: 49∼52,XY,t(1:21)(p22;p11),+2,+16,add(16)(p13),?+17,?add(17)(p11.2),+20,+20,+1∼3mar[cp15]/101∼106,idemx2[cp5]. (B) CNs (by percentage) across the HepG2 genome. (C) Percentage of HepG2 SNVs and Indels that are novel and known (in dbSNP). (D) Violin plot with overlaid boxplot of phased haplotype block sizes, with N50 represented as a dashed line (N50 = 6 792 324 bp) with log-scaled Y-axis. (E) X-axis: chromosome coordinate (Mb). Y-axis: difference in unique linked-read barcode counts between major and minor haplotypes, normalized by SNV density. Haplotype blocks from of normal control sample (NA12878) in blue and from HepG2 in dark gray. Density plots on the right reflects the distribution of the differences in haplotype-specific barcode counts for control sample HepG2. Significant difference (one-sided t-test, P < 0.001) in haplotype-specific barcode counts indicates aneuploidy and haplotype imbalance. Haplotype blocks (with ≥100 phased SNVs) generated from Long Ranger (Dataset 2) for the major and minor haplotypes were then ‘stitched’ to mega-haplotypes encompassing the entire triploid chromosome arms of 2p and 2q.

HepG2 Karyogram and Callset Overview. (A) Representative karyogram of HepG2 cells by GTW banding that shows multiple numerical and structural abnormalities including a translocation between the short arms of chromosomes 1 and 21, trisomies of chromosomes 12, 16 and 17, tetrasomy of chromosome 20, uncharacterized rearrangements of chromosomes 16 and 17 and a two marker chromosomes. ISCN 2013 description: 49∼52,XY,t(1:21)(p22;p11),+2,+16,add(16)(p13),?+17,?add(17)(p11.2),+20,+20,+1∼3mar[cp15]/101∼106,idemx2[cp5]. (B) CNs (by percentage) across the HepG2 genome. (C) Percentage of HepG2 SNVs and Indels that are novel and known (in dbSNP). (D) Violin plot with overlaid boxplot of phased haplotype block sizes, with N50 represented as a dashed line (N50 = 6 792 324 bp) with log-scaled Y-axis. (E) X-axis: chromosome coordinate (Mb). Y-axis: difference in unique linked-read barcode counts between major and minor haplotypes, normalized by SNV density. Haplotype blocks from of normal control sample (NA12878) in blue and from HepG2 in dark gray. Density plots on the right reflects the distribution of the differences in haplotype-specific barcode counts for control sample HepG2. Significant difference (one-sided t-test, P < 0.001) in haplotype-specific barcode counts indicates aneuploidy and haplotype imbalance. Haplotype blocks (with ≥100 phased SNVs) generated from Long Ranger (Dataset 2) for the major and minor haplotypes were then ‘stitched’ to mega-haplotypes encompassing the entire triploid chromosome arms of 2p and 2q.

High-resolution ploidy changes in HepG2

To obtain a high-resolution aneuploid map i.e. large CN changes by chromosomal region in HepG2, WGS coverage across the genome was first calculated in 10-kb bins and plotted against percent GC content where four distinct clusters were clearly observed (44) (Supplementary Figure S2). CNs were assigned to each cluster based on the ratio between its mean coverage and that of the lowest cluster (CN = 1). These assigned large CN changes by chromosomal region confirm the hyperdiploid state of the HepG2 genome as identified by karyotyping (Figure 2A; Supplementary Figure S2 and Supplementary Table S2). We see that 74.1% of the HepG2 genome has a baseline copy number of two (consistent with karyotype), 15.5% copy number of three, 2.7% copy number of four, 0.7% has a copy number of five and 6.9% in a haploid state (Figure 2B and Supplementary Table S2). Furthermore, these high-resolution CN changes across the HepG2 genome were also confirmed by two independent replicates of Illumina Infinium Multi-Ethnic Global-8 arrays (MEGA) array data (Supplementary Figure S3A and Supplementary Data). We found increased CN (CN = 3) over the oncogene VEGFA (6p21.1), which was found to be recurrently duplicated in cases of hepatocellular carcinoma (45).

SNVs and indels

We identified SNVs and indels in HepG2 by taking into account the CN of the chromosomal regions in which they reside so that heterozygous allele frequencies can be assigned accordingly (e.g. 0.33 and 0.67 in triploid regions; 0.25, 0.50 and 0.75 in tetraploid regions). Using GATK Haplotypecaller (21), we identified a total of ∼3.34M SNVs (1.90M heterozygous, 1.44M homozygous) and 0.90M indels (0.60M heterozygous, 0.29M homozygous) (Table 1, Dataset 1). Interestingly, there are 12 375 heterozygous SNVs and indels that have more than two haplotypes in chromosomal regions with CN > 2 (Dataset 1). In addition, chromosome 22 and large continuous stretches of chromosomes 6, 11 and 14 show striking loss of heterozygosity (LOH) (Figure 1 and Supplementary Table S3). Since genomic data from healthy tissue that correspond to HepG2 cells is not available, we intersected these SNVs and indels with dbsnp138 (24) and found the overlap to be ∼3.28M (98%) and ∼0.69M (78%), respectively (Figure 2C and Table 1). We found that 377 SNVs and 255 indels are private protein-altering (PPA) after filtering out those that overlapped with The 1000 Genomes Project (27) or the Exome Sequencing Project (46) (Table 1 and Supplementary Table S4). Moreover, the intersection between the filtered PPA variants and the Catalogue of Somatic Mutations in Cancer (COSMIC) is 39% and 16% for SNVs and indels, respectively (Supplementary Table S5). The gene overlap between HepG2 PPA and the Sanger Cancer Gene Census is 19 (Supplementary Table S6). HepG2 PPA variants include oncogenes and tumor suppressors such as NRAS (47), STK11/LKB1 (48) and PREX2 (49,50) as well as other genes recently found to play critical roles in driving cancer such as CDK12 (51) and IKBKB (52,53). RP1L1, which was recently found to be significantly mutated in hepatocellular carcinoma (45), is also present among the PPA variants.

Resolving haplotypes

We phased the heterozygous SNVs and indels in the HepG2 genome by performing 10X Genomics Chromium linked-read library preparation and sequencing (32,33). Post sequencing quality control analysis shows that 1.49 ng or approximately 447 genomic equivalents of high molecular weight (HMW) genomic DNA fragments (mean = 68 kb, 96.1% >20 kb, 22.0% >100 kb) were partitioned into 1.53 million oil droplets and uniquely barcoded (16 bp). This library was sequenced (2 × 151 bp) to 67x genome coverage with half of all reads coming from HMW DNA molecules with at least 61 linked reads (N50 Linked-Reads per Molecule) (Table 1). We estimate the actual physical coverage (CF) to be 247×. Coverage of the mean insert by sequencing (CR) is 18 176 bp (284 bp × 64) or 30.8%, thus the overall sequencing coverage C = CR × CF = 67×. Distributed over 1628 haplotype blocks (Table 1, Dataset 2), 1.87M (98.7%) of heterozygous SNVs and 0.67M (77.9%) of indels in HepG2 were successfully phased. The longest phased haplotype block is 31.1 Mbp (N50 = 6.80 Mbp) (Figure 2D and Table 1, Dataset 2); however, haplotype block lengths vary widely across different chromosomes (Figure 1 and Supplementary Figure S4). Poorly phased regions correspond to regions exhibiting LOH (Supplementary Table S3 and Figure 1, Dataset 2).

Construction of mega-haplotypes of entire chromosome arms

We constructed mega-haplotypes of entire chromosome arms by leveraging the haplotype imbalance in aneuploid regions in the HepG2 genome where phased haplotype blocks derived from linked-reads were ‘stitched’ together (Table 1 and Figure 2E; Supplementary Data). Briefly, by using a recently developed method (36) specifically for cancer genomes, we counted linked-read barcodes for each phased heterozygous SNVs in haplotype blocks with ≥100 phased SNVs (Dataset 2). Because each barcode is specific for an HMW DNA molecule, the total number of unique barcodes is directly correlated with the number of individual HMW DNA molecules that were sequenced. The fractional representation of a particular genomic sequence (or locus) can be obtained by counting the total number of unique barcodes associated with that particular genomic sequence. Consequently, for each phased haplotype with CN > 2, major and minor haplotypes can be assigned according to the number of barcodes associated with each haplotype (Figure 2E), where the major haplotype is the haplotype with more associated unique barcodes. In genomic regions where CN = 2, the two haplotypes are expected to have similar numbers of unique barcodes. In this method (36), a matched control for comparison is required to confidently discriminate between the major and minor haplotypes. Here, we used NA12878 as normal control because no matching normal tissue sample is available for HepG2 (Figure 2E). After performing the normalization procedures and statistical tests described in (36) to verify haplotype imbalance or aneuploidy genomic regions in HepG2, we then ‘stitched’ together contiguous blocks of phased major and minor haplotypes, respectively. Using this approach, a total of six autosomal mega-haplotypes were constructed (Table 1 and Supplementary Data); four of which encompass entire (or >96%) chromosome arms: 2p, 2q, 6p and 16q (Figure 3). The largest mega-haplotype is approximately 144 Mb long (2q).
Figure 3.

Large SVs in HepG2 Resolved from Linked-Read Sequencing using Long Ranger. HepG2 SVs resolved by identifying identical linked-read barcodes in distant genomic regions with non-expected barcode overlap for identified using Long Ranger (32,33). (A) Disruption of FRK by translocation between chromosomes 6 and 16. (B) 2.47 Mb intra-chromosomal rearrangement between MALRD1 and MLLT10 on chromosome 10. (C) 127 kb duplication on chromosome 7 resulting in partial duplications of USP42 and PMS2. (D) 395 kb duplication within PRKG1 on chromosome 10. (E) 31.3 kb inversion within GUSBP1 on chromosome 5. (F) 60.4 kb inversion that disrupts PPL and SEC14L5.

Large SVs in HepG2 Resolved from Linked-Read Sequencing using Long Ranger. HepG2 SVs resolved by identifying identical linked-read barcodes in distant genomic regions with non-expected barcode overlap for identified using Long Ranger (32,33). (A) Disruption of FRK by translocation between chromosomes 6 and 16. (B) 2.47 Mb intra-chromosomal rearrangement between MALRD1 and MLLT10 on chromosome 10. (C) 127 kb duplication on chromosome 7 resulting in partial duplications of USP42 and PMS2. (D) 395 kb duplication within PRKG1 on chromosome 10. (E) 31.3 kb inversion within GUSBP1 on chromosome 5. (F) 60.4 kb inversion that disrupts PPL and SEC14L5.

Using linked-reads to identify and reconstruct large and complex SVs

From the linked reads, breakpoints of large-scale SVs can be identified by searching for distant genomic regions with linked-reads that have large numbers of overlapping barcodes. SVs can also be assigned to specific haplotypes if the breakpoint-supporting reads contain phased SNVs or indels (32,33). Using this approach (implemented by the Long Ranger software from 10X Genomics), we identified 97 large SVs >30 kb (99% phased) (Dataset 3) and 3473 deletions between 50 bp and 30 kb (78% phased) (Dataset 4). The large SVs include inter- and intra-chromosomal rearrangements (54) (Figure 3A and B), duplications (Figure 3C and D) and inversions (Figure 3E and F). A remarkable example is the haplotype-resolved translocation between chromosomes 16 and 6 (Figure 3A) resulting in the disruption of the non-receptor Fyn-related tyrosine kinase gene FRK, which has been identified as a tumor suppressor (55,56). Another example is the 127 kb tandem duplication on chromosome 7 (Figure 3C) that results in the partial duplication of genes PMS2, encoding a mismatch repair endonuclease, and USP42, encoding the ubiquitin-specific protease 42. An interesting large SV is the 395 kb duplication within PRKG1 (Figure 3D), which encodes the soluble l-α and l-β isoforms of cyclic GMP-dependent protein kinase. We also identified a 193 kb homozygous deletion in PDE4D for HepG2 using linked-read sequencing where six internal exons within the gene are deleted (Figure 3D). Furthermore, we also used the long-range information from the deep linked-reads sequencing dataset to identify, assemble and reconstruct the breakpoints of SVs in the HepG2 genome using a recently developed method called Genome-wide Reconstruction of Complex Structural Variants (GROC-SVs) (34). Here, HMW DNA fragments that span breakpoints are statistically inferred and refined by quantifying the barcode similarity of linked-reads between pairs of genomic regions similar to Long Ranger (32). Sequence reconstruction is then achieved by assembling the relevant linked reads around the identified breakpoints from which SVs are automatically reconstructed. Breakpoints that also have supporting evidence from the 3 kb-mate pair dataset (see ‘Materials and methods’ section) are indicated as high-confidence events. GROC-SVs called a total of 140 high-confidence breakpoints including four inter-chromosomal events (Figure 1, Dataset 5 and Figure 4A–D); 138 of the breakpoints were successfully sequence-assembled with nucleotide-level resolution of breakpoints as well the exact sequence in the cases where nucleotides have been added or deleted. We identified striking examples of inter-chromosomal rearrangements or translocations in HepG2 between chromosomes 1 and 4 (Figure 4A) and between chromosomes 6 and 17 (Figure 4B) as well as breakpoint-assembled large genomic deletions (Figure 4C, Dataset 5). This break-point assembled 335 kb heterozygous deletion is within the NEDD4L on chromosome 18. Finally, we identified a large (1.3 mb) intra-chromosomal rearrangement that deletes large portions of RBFOX1 and RP11420N32 in one haplotype on chromosome 16 using deep linked-read sequencing (Figure 4D, Dataset 3, Dataset 5).
Figure 4.

HepG2 SVs Reconstructed and Assembled Using GROC-SVs in HepG2. (A–D) Each line depicts a fragment inferred from 10X-Genomics data based on clustering of reads with identical barcodes (Y-axis) identified from GROC-SVs (34). Abrupt ending (dashed vertical line) of fragments indicates location of SV breakpoint. All breakpoints depicted are validated by 3 kb-mate-pair sequencing data. Fragments are phased locally with respect to surrounding SNVs (haplotype-specific) are in orange, and black when no informative SNVs are found nearby. Gray lines indicate portions of fragments that do not support the current breakpoint. (A) Translocation between chromosomes 1 and 4. Linked-read fragments containing overlapping barcodes that map to chromosome 1 end abruptly near 248.60 mb indicating a breakpoint, and then continues simultaneously near 168.75 mb on chromosome 4. (B) Translocation between chromosomes 6 and 17. Linked-read fragments containing overlapping barcodes that map to chromosome 17 end abruptly near 36.17 mb indicating a breakpoint and then continues simultaneously near 113.52 mb on chromosome 6. (C) Large (335 kb) heterozygous deletion within NEDD4L on chromosome 18. (D) Large (1.3 mb) intra-chromosomal rearrangement that deletes large portions of RBFOX1 and RP11420N32 on chromosome 16.

HepG2 SVs Reconstructed and Assembled Using GROC-SVs in HepG2. (A–D) Each line depicts a fragment inferred from 10X-Genomics data based on clustering of reads with identical barcodes (Y-axis) identified from GROC-SVs (34). Abrupt ending (dashed vertical line) of fragments indicates location of SV breakpoint. All breakpoints depicted are validated by 3 kb-mate-pair sequencing data. Fragments are phased locally with respect to surrounding SNVs (haplotype-specific) are in orange, and black when no informative SNVs are found nearby. Gray lines indicate portions of fragments that do not support the current breakpoint. (A) Translocation between chromosomes 1 and 4. Linked-read fragments containing overlapping barcodes that map to chromosome 1 end abruptly near 248.60 mb indicating a breakpoint, and then continues simultaneously near 168.75 mb on chromosome 4. (B) Translocation between chromosomes 6 and 17. Linked-read fragments containing overlapping barcodes that map to chromosome 17 end abruptly near 36.17 mb indicating a breakpoint and then continues simultaneously near 113.52 mb on chromosome 6. (C) Large (335 kb) heterozygous deletion within NEDD4L on chromosome 18. (D) Large (1.3 mb) intra-chromosomal rearrangement that deletes large portions of RBFOX1 and RP11420N32 on chromosome 16. We then employed ‘gemtools’ (35) to resolve and phase large and complex SVs in the HepG2 genome. We identified a complex SV on chromosome 8 that involves a small deletion downstream of ADAM2 that is also within a larger tandem duplication leading to the amplification of the oncogene IDO1 (57) and the first half of IDO2 (Figure 5). Two allele-specific deletions 700 and 200 kb, respectively, were identified in the PDE4D on chromosome 5 (Figure 5). Since chromosome 5 is triploid in HepG2 (Figures 1 and 2A; Supplementary Data), we see approximately twice as much linked-reads barcode representation for the allele harboring the 200 kb deletion, suggesting that this allele of PDE4D has two copies and the allele harboring the 700 kb deletion has one copy (Figure 5). Similarly, we also identified two allele-specific deletions, 290 and 160 kb respectively within AUTS2 on chromosome 7 (Figure 5). Interestingly, for the allele harboring the 160 kb deletion, the non-deleted reference allele is also present at much larger frequency as indicted by the total number of linked-read barcodes, suggesting that the allele harboring the 160 kb deletion within AUTS2 occurs in a fraction of HepG2 cells or sub-clonally (Figure 5). From the total number of linked-read barcodes associated with this 160 kb allele-specific deletion in AUTS2, we estimate that this deletion occurs in 10% of HepG2 cells. All breakpoints identified using ‘gemtools’ were individual polymerase chain reaction (PCR) and Sanger sequencing verified (Supplementary Table S7).
Figure 5.

Large and complex haplotype-resolved SVs using gemtools. Each SV is identified from linked-reads clustered by identical barcodes (i.e. SV-specific barcodes, Y-axis) indicative of single HMW DNA molecules (depicted by each row) that span SV breakpoints. Haplotype-specific SVs are represented in blue and red. X-axis: hg19 genomic coordinate. (Top) Complex SV on chromosome 8 involving a 4585 bp deletion downstream of ADAM2. This deletion is within a tandem duplication leading to the amplification of the IDO1 and the first half of IDO2. The presence of HMW molecules sharing the same linked-read barcodes spanning both breakpoints indicates a cis orientation and occurrence on only one allele of this locus. Schematic diagram of the rearranged structures drawn above the plot. (Middle) Two haplotype-resolved deletions 700 kb (blue) and 200 kb (red), respectively, occurring on two separate alleles within of PDE4D on chromosome 5—the spanning HMW molecules for each deletion do not share SV-specific barcodes, indicating that these deletions are in trans. Two haplotype-resolved deletions, 290 kb (red) and 160 kb (blue) respectively, within AUTS2 on chromosome 7. The reference allele of AUTS2 without the deletion (Haplotype 2) is also detected and resolved by linked-reads (blue, bottom panel). The 160 kb deletion on Haplotype 2 occurs sub-clonally.

Large and complex haplotype-resolved SVs using gemtools. Each SV is identified from linked-reads clustered by identical barcodes (i.e. SV-specific barcodes, Y-axis) indicative of single HMW DNA molecules (depicted by each row) that span SV breakpoints. Haplotype-specific SVs are represented in blue and red. X-axis: hg19 genomic coordinate. (Top) Complex SV on chromosome 8 involving a 4585 bp deletion downstream of ADAM2. This deletion is within a tandem duplication leading to the amplification of the IDO1 and the first half of IDO2. The presence of HMW molecules sharing the same linked-read barcodes spanning both breakpoints indicates a cis orientation and occurrence on only one allele of this locus. Schematic diagram of the rearranged structures drawn above the plot. (Middle) Two haplotype-resolved deletions 700 kb (blue) and 200 kb (red), respectively, occurring on two separate alleles within of PDE4D on chromosome 5—the spanning HMW molecules for each deletion do not share SV-specific barcodes, indicating that these deletions are in trans. Two haplotype-resolved deletions, 290 kb (red) and 160 kb (blue) respectively, within AUTS2 on chromosome 7. The reference allele of AUTS2 without the deletion (Haplotype 2) is also detected and resolved by linked-reads (blue, bottom panel). The 160 kb deletion on Haplotype 2 occurs sub-clonally.

SVs from mate-pair sequencing

To obtain increase sensitivity in the detection medium-sized (1–100 kb) SVs in HepG2, we prepared a 3 kb-mate pair library and sequenced (2 × 151 bp) it to a genome coverage of 7.9x after duplicate removal. The sequencing coverage of each 3 kb insert (CR) is 302 bp (or 10% of the insert size) that translates to a physical coverage (CF) of 79×. Deletions, inversions and tandem duplications from the mate-pair library were identified from analysis of discordant read pairs and split reads using LUMPY (40). Only SVs that are supported by both discordant read-pair and split-read supports were retained. Using this approach, we identified 122 deletions, 41 inversions and 133 tandem duplications (Dataset 6). Approximately 76% of these SVs are between 1 and 10 kb, 86% are between 1 and 100 kb (9% between 10 and 100 kb) and 3% are >100 kb (Dataset 6). Twenty SVs (16 deletions and 4 duplications) were randomly selected for experimental validation using PCR and Sanger sequencing in which 15/16 were successfully validated (93.8%) (Supplementary Table S7).

SVs identified from deep short-insert WGS

Deletions, inversions, insertions and tandem duplications were identified from the HepG2 WGS dataset using Pindel (38), BreakDancer (37) and BreakSeq (39). Since similar categories of SVs were also identified using mate-pair and linked-read sequencing, these SVs were combined with the SVs identified previously using Long Ranger and LUMPY where variations with support from multiple methods and with >50% reciprocal overlap were merged. In total, 6405 SVs were obtained from all methods that include 5226 deletions, 245 duplications, 428 inversions and 494 insertions (only BreakDancer (37) was designed to call insertions) (Supplementary Data). A set of deletion (n = 27) and tandem duplication calls (n = 4) was randomly selected to confirm by PCR and Sanger sequencing, and 30/32 (94%) events were successfully validated (Supplementary Table S7). Consistent with previous analysis (58), deletions show the highest concordance among the various methods of detection compared to duplication and inversion calls (Supplementary Figure S5). As expected, we detected a 520 bp deletion in exon 3 of the β-catenin (CTNNB1) gene (Dataset 4, Supplementary Data), which was previously documented to exist in HepG2 (59). Interestingly, we found no SVs or PPA mutations in the Wnt-pathway gene CAPRIN2 (60), which had been previously reported for hepatoblastoma (61).

Identification of non-reference Alu and LINE1 insertions

From our deep-coverage short-insert WGS data, we also analyzed the HepG2 genome for non-reference LINE1 and Alu retrotransposon insertions using RetroSeq (62) with some modifications. These insertions were identified from paired-end reads that have one of the pair mapping to hg19 uniquely and other mapping to an Alu or LINE1 consensus sequence in full or split fashion (see ‘Materials and methods’ section). Retrotransposon insertion events with greater than five supporting reads were categorized as high confidence and retained (Supplementary Table S8). We identified 1899 and 351 non-reference Alu and LINE1 insertions in the HepG2 genome, respectively (Figure 1). We randomly chose 8 Alu and 10 LINE1 insertions with split-read support for confirmation using PCR and Sanger sequencing where 87.5% and 100% were successfully validated, respectively (Supplementary Table S8).

Allele-specific gene expression

Due to the abundance of aneuploidy in the HepG2 genome, CN changes of genomic regions should be taken into account when analyzing for allele-specific gene expression in order to reduce false positives and false negatives. Using the heterozygous SNV allele frequencies in HepG2 (Dataset 1), we re-analyzed two replicates of HepG2 ENCODE RNA-Seq data. We identified 3189 and 3022 genes that show allele-specific expression (P < 0.05) in replicates one and two, respectively (Figure 1 and Supplementary Table S9). Furthermore, we also identified 862 and 911 genes that would have been falsely identified to have allele-specific expression (false positives), if the copy numbers of SNV allele frequencies were not taken into consideration as well as 446 and 407 genes that would not have been identified (false negatives) in replicates one and two, respectively (Supplementary Table S10).

Allele-specific DNA methylation

Using the phasing information for HepG2 SNVs (Dataset 2), we also identified 384 CpG islands (CGIs) that exhibit allele-specific DNA methylation (Figure 1 and Supplementary Table S11). We obtained two independent replicates of HepG2 whole-genome bisulfite sequencing data (2 × 125 bp, experiment ENCSR881XOU) from the ENCODE Portal (17). Read alignment to hg19 was performed using Bismark (63); 70.0% of reads were uniquely aligned, 44.7% of cytosines were methylated in a CpG context. We then phased methylated and unmethylated CpGs to their respective haplotypes by identifying reads that overlap both CpGs and phased heterozygous SNVs (Dataset 2). We grouped the phased individual CpGs into CGIs and totaled the number of reads that contain methylated and unmethylated cytosines for each CGI allele, normalizing by CN in cases of aneuploidy. Fisher’s exact test was used to evaluate allele-specific methylation, and significant results were selected using a target false discovery rate of 10% (64) (see ‘Materials and methods’ section). Ninety-eight CGIs reside within promoter regions (defined as 1 kb upstream of a gene); 277 are intragenic and 96 lie within 1 kb downstream of 348 different genes (Supplementary Table S11). The following 11 genes are within 1 kb of a differentially methylated CGI and also overlap with the Sanger Cancer Gene Census: FOXA1, GNAS, HOXD13, PDE4DIP, PRDM16, PRRX1, SALL4, STIL, TAL1 andZNF331. Twenty-seven unique CGIs with allele-specific methylation overlap with allele-specific RNA expression (Supplementary Table S9).

Allele-specific CRISPR targets

We identified 38 551 targets in the HepG2 genome suitable for allele-specific CRSIPR targeting (Figure 1 and Supplementary Table S12). Phased variant sequences (including reverse complement) that differ by >1 bp between the alleles were extracted to identify all possible CRISPR targets by pattern searching for [G, C or A]N20GG (see ‘Materials and methods’ section). Only conserved high-quality targets were retained by using a selection method previously described and validated (65). We took the high-quality target filtering process further by taking the gRNA function and structure into account. Targets with multiple exact matches, extreme GC fractions and those with TTTT sequence (which might disrupt the secondary structure of gRNA) were removed. Furthermore, we used the Vienna RNA-fold package (66) to identify gRNA secondary structure and eliminated targets for which the stem–loop structure for Cas9 recognition is not able to form (67). Finally, we calculated the off-target risk score using the tool as described for this purpose (68). A very strict off-target threshold score was chosen in which candidates with a score below 75 were rejected to ensure that all targets are as reliable and as specific as possible.

Genomic sequence and structural context provides insight into regulatory complexity

We show examples of how deeper insights into gene regulation and regulatory complexity can be obtained by integrating genomic sequence and structural contexts with functional genomics and epigenomics data (Figure 6A–D). One example is the allele-specific RNA expression and allele-specific DNA methylation in HepG2 at the PLK2 locus on chromosome 5 (Figure 6A). By incorporating the genomic context in which PLK2 is expressed in HepG2 cells, we see that PLK2 RNA is only expressed from Haplotype 1 (P = 4.66E-10) in which the CGI within the gene is completely unmethylated (P = 1.51E-66) in the expressed allele and completely methylated in the non-expressed allele (Figure 6A, C and D). The second example is allele-specific RNA expression and allele-specific DNA methylation of the TBX2 gene in HepG2 (Figure 6B). The TBX2 locus on chromosome 17 is triploid, and we see that TBX2 is preferentially expressed from Haplotype 1 that has one copy and lower expression is observed from the two copies of Haplotype 2 (P = 0.0179) (Figure 6B and C). We also observed highly preferential DNA methylation of the CGI in Haplotype 1 (P = 1.55E-32) (Figure 6B and D). In addition, there is also an allele-specific CRISPR targeting site for both haplotypes in the promoter region of TBX2 and inside CGI 22251 (1937 bp upstream of TBX2 gene and 2259 bp downstream of the 5′ end of CGI 22251) (Figure 6B).
Figure 6.

Genomic Sequence and Structural Context Provides Insight into Regulatory Complexity in HepG2. (A) Chr5:57,755,334-57,756,803 locus containing the serine/threonine-protein kinase gene PLK2 and CGI 6693 (1463 bp) where phased Haplotype 1 and Haplotype 2. Allele-specific transcription of PLK2 from Haplotype 2 only. CpGs in CGI 6693 are mostly unmethylated in Haplotype 2 (expressed) and highly methylated in Haplotype 1 (repressed). (B) Chr17:59,473,060-59,483,266 locus (triploid in HepG2) containing T-box transcription factor gene TBX2 and CpG Island (CGI) 22251 (10 206 bp) where phased Haplotype 2 has two copies and Haplotype 1 has one copy. Allele-specific transcription of TBX2 from Haplotype 2 only. CpGs in CGI 22251 are unmethylated in Haplotype 1 (repressed) and methylated in Haplotype 2 (expressed). Allele-specific CRISPR targeting site 1937 bp inside the 5′ region of TBX2 for both Haplotypes. (C) Number of allele-specific RNA-Seq reads in Haplotypes 1 and 2 for PLK2 and TBX2 where both genes exhibit allele-specific RNA expression (P = 0.4.66E-10 and P = 0.0179, respectively). (D) Number of methylated and unmethylated phased whole-genome bisulfite-sequencing reads for Haplotypes 1 and 2 in CGI 6693 and CGI 22251 where both CGIs exhibit allele-specific DNA methylation (P = 1.51E-66 and P = 1.55E-32, respectively).

Genomic Sequence and Structural Context Provides Insight into Regulatory Complexity in HepG2. (A) Chr5:57,755,334-57,756,803 locus containing the serine/threonine-protein kinase gene PLK2 and CGI 6693 (1463 bp) where phased Haplotype 1 and Haplotype 2. Allele-specific transcription of PLK2 from Haplotype 2 only. CpGs in CGI 6693 are mostly unmethylated in Haplotype 2 (expressed) and highly methylated in Haplotype 1 (repressed). (B) Chr17:59,473,060-59,483,266 locus (triploid in HepG2) containing T-box transcription factor gene TBX2 and CpG Island (CGI) 22251 (10 206 bp) where phased Haplotype 2 has two copies and Haplotype 1 has one copy. Allele-specific transcription of TBX2 from Haplotype 2 only. CpGs in CGI 22251 are unmethylated in Haplotype 1 (repressed) and methylated in Haplotype 2 (expressed). Allele-specific CRISPR targeting site 1937 bp inside the 5′ region of TBX2 for both Haplotypes. (C) Number of allele-specific RNA-Seq reads in Haplotypes 1 and 2 for PLK2 and TBX2 where both genes exhibit allele-specific RNA expression (P = 0.4.66E-10 and P = 0.0179, respectively). (D) Number of methylated and unmethylated phased whole-genome bisulfite-sequencing reads for Haplotypes 1 and 2 in CGI 6693 and CGI 22251 where both CGIs exhibit allele-specific DNA methylation (P = 1.51E-66 and P = 1.55E-32, respectively).

DISCUSSION

As one of the most widely used cell lines in biomedical research, HepG2’s genomic sequence and structural features have never been characterized in a comprehensive manner beyond its karyotype (19,20) and SNVs identified from ChIP-Seq data and 10× coverage WGS that do not take aneuploidy or CN into consideration (69,70). Here, in summary, we performed a comprehensive analysis of genomic structural features (Figure 1) for the HepG2 cell line that includes SNVs (Dataset 1), Indels (Dataset 1), large CN or ploidy changes across chromosomal regions at 10 kb resolution (Supplementary Table S2), phased haplotype blocks (Dataset 2), phased CRISPR targets (Supplementary Table S12), novel retrotransposon insertions (Supplementary Table S8) and SVs (Datasets 3–6) including deletions, duplications, inversions, translocations, and those that are the result of complex genomic rearrangements. Many of the HepG2 SVs are also phased, assembled and experimentally verified (Dataset 5, Supplementary Tables S7 and S8). We illustrate, using PLK2 and TBX2 (Figure 6A and B), examples where genomic context can enhance the interpretation of function genomics and epigenomics data to derive novel insights into the complexity of oncogene regulation. The Polo-like kinase gene PLK2 (SNK) is a transcriptional target of p53 and also a cancer biomarker (71,72). It has been studied in the contexts of many human cancers (71,73–75). Disruption of PLK2 has also been proposed to have therapeutic value in sensitizing chemo-resistant tumors. Its roles in Burkitt’s lymphoma (76), hepatocellular carcinoma (73) and epithelial ovarian cancer (74) are consistent with that of tumor suppressors while its role in colorectal cancer is consistent with that of an oncogene (75). Interestingly, promoter methylation and/or LOH were linked to the down-regulation of PLK2 in human hepatocellular carcinoma (73). Chemotherapy resistance of epithelial ovarian cancer can be conferred by the down-regulation of PLK2 at the transcriptional level via DNA methylation of the CpG island in the PLK2 promoter (74). Here we show that the down-regulation of PLK2 in HepG2 cancer cells could be achieved through what appears to be allele-specific transcriptional silencing via allele-specific DNA methylation of a large CGI within the gene body (Figure 6A). The T-box transcription factor TBX2 is a critical regulator of cell fate decisions, cell migration and morphogenesis in the development of many organs (77–80). It regulates cell cycle progression (81), and its overexpression has been demonstrated in promoting or maintaining the proliferation of many cancers including melanomas (82), nasopharyngeal cancer (83), breast cancer (84,85), prostate cancer (86) and gastric cancer (87). Here, we show that three copies of the TBX2 gene exist in HepG2 cancer cells as a result of duplication in Haplotype 2. However, it is preferentially expressed in Haplotype 1 possibly due to the highly allele-specific DNA methylation in the CGI that spans its promoter region and most of the gene body (Figure 6B). It is plausible that overexpression of TBX2 in other cancer types are caused by similar genomic rearrangements and/or epigenetic mechanisms where duplication of TBX2 may result in the overexpression and DNA methylation (possibly allele-specific) may contribute an additive effect to TBX2 overexpression or act as the sole contributor where TBX2 is not duplicated (see Supplementary Discussion for detailed discussion of other oncogenes, tumor-suppressors and other genes associated with cancer that are disrupted as a consequence of genomic variation in HepG2). Combining orthogonal methods and signals greatly improves SV-calling sensitivity and accuracy (40,88). We compared SVs identified from various methods. For deletion SVs, linked reads show the highest sensitivity. The linked-reads analysis software Long Ranger detects SVs (deletions, duplications and inversions) larger than 30 kb (Dataset 3) and deletions smaller than 30 kb (Dataset 4)—a wider size spectrum for deletions. Out of the total 4771 unique deletion calls in HepG2, 3364 (71%) can be detected using linked-reads alone; the lower-coverage mate-pair dataset (analyzed using LUMPY (40)) added another 31 calls, and the deep-coverage WGS dataset added the rest (Supplementary Figure S5A). However, for duplications and inversions, we see that many more calls were added by bringing in the mate-pair and deep short-insert whole-genome datasets and incorporating analyses of other mapping signals such as discordant-read-pair and split-reads (Supplementary Figure S5B and C). Overall, we see considerable overlap as well as variant calls specific to each method for deletions (Supplementary Figure S5A), and much less overlap for duplications and inversions (Supplementary Figure S5B and C). This is consistent with what has been shown previously (58) as inversions and duplications are more difficult in principle to accurately resolve. Experimental validation of specific SVs of interest should be conducted by individual laboratories prior to functional follow-up studies. This study is primarily focused on the utilization of Illumina sequencing to resolve SVs in HepG2. In the relatively near future, long-read technologies such as Oxford Nanopore or Pacific Biosciences can be expected to become of considerable utility in the analysis of complex cancer genomes. The eventual incorporation of long-reads can be expected to improve the ability to resolve challenging variants for example inside or flanked by repetitive regions i.e. segmental duplications, and also the sensitivity to detect nested SVs (89,90). Long-reads are accompanied with much higher error-rates compared to short Illumina sequencing, but it is foreseeable that the continuing development of computational tools for and SV detection will at least partially offset this challenge (89–92). All data and results generated from this global whole-genome analysis of HepG2 is publicly available through ENCODE (encodeproject.org) (17). This analysis serves as a valuable reference for further understanding the vast amount of existing HepG2 ENCODE data. Our results also guide future study designs that utilize HepG2 cells including CRISPR/Cas9 experiments where knowledge of the phased genomic variants can extend or modify the number of editing targets including those that are haplotype-specific (Supplementary Table S12) while knowledge of aberrant chromosomal CN changes will allow for more accurate interpretation of functional data in non-diploid regions. This study may serve as a technical archetype for advanced, integrated and global analysis of genomic sequence and structural variants for other widely cell lines with complex genomes. Since HepG2 has been passaged for decades and across many different laboratories, additional genome variation may be present in HepG2 cells that had been long separated from the ENCODE HepG2 production line. Many of the results we discuss here are supported by previous studies, for instance, karyotyping and mutation in CTNNB1 (59), but there are minor differences such as the lack of a mutation in CAPRIN2 (60). We expect that the vast majority of genomic variants that we describe here to be shared across the different versions of HepG2 cells, thus when taking into account for future global studies, substantial insights are expected to be gained. However, if specific loci is of interest for follow-up studies, a first step should always be to experimentally confirm the presence the particular working line of HepG2 as distinct lines may harbor slight variations. While the complexity of the HepG2 genome renders the design and interpretation of functional genomic and epigenomic studies more challenging, the results of this study enables researchers to continue to use HepG2 to investigate the effects of different types of genomic variations on the multiple layers of functionality and regulation for which ENCODE data are already available and continues to be produced.

DATA AVAILABILITY

All raw and processed data files are publicly released on the ENCODE portal (encodeproject.org) via accession ENCBS760ISV. Datasets 1–6 can individually be accessed via ENCODE accession numbers ENCFF336CFC, ENCFF853HHD, ENCFF467ETN, ENCFF717TPE, ENCFF330UFT, ENCFF241CEK, respectively. Click here for additional data file.
  92 in total

1.  Genome engineering using the CRISPR-Cas9 system.

Authors:  F Ann Ran; Patrick D Hsu; Jason Wright; Vineeta Agarwala; David A Scott; Feng Zhang
Journal:  Nat Protoc       Date:  2013-10-24       Impact factor: 13.491

Review 2.  RAS oncogenes: weaving a tumorigenic web.

Authors:  Yuliya Pylayeva-Gupta; Elda Grabocka; Dafna Bar-Sagi
Journal:  Nat Rev Cancer       Date:  2011-10-13       Impact factor: 60.716

3.  Decoding complex patterns of genomic rearrangement in hepatocellular carcinoma.

Authors:  Julio Fernandez-Banet; Nikki P Lee; Kin Tak Chan; Huan Gao; Xiao Liu; Wing-Kin Sung; Winnie Tan; Sheung Tat Fan; Ronnie T Poon; Shiyong Li; Keith Ching; Paul A Rejto; Mao Mao; Zhengyan Kan
Journal:  Genomics       Date:  2014-01-21       Impact factor: 5.736

4.  Mammalian Reverse Genetics without Crossing Reveals Nr3a as a Short-Sleeper Gene.

Authors:  Genshiro A Sunagawa; Kenta Sumiyama; Maki Ukai-Tadenuma; Dimitri Perrin; Hiroshi Fujishima; Hideki Ukai; Osamu Nishimura; Shoi Shi; Rei-Ichiro Ohno; Ryohei Narumi; Yoshihiro Shimizu; Daisuke Tone; Koji L Ode; Shigehiro Kuraku; Hiroki R Ueda
Journal:  Cell Rep       Date:  2016-01-07       Impact factor: 9.423

Review 5.  A census of human cancer genes.

Authors:  P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton
Journal:  Nat Rev Cancer       Date:  2004-03       Impact factor: 60.716

6.  Transcriptional silencing of Polo-like kinase 2 (SNK/PLK2) is a frequent event in B-cell malignancies.

Authors:  Nelofer Syed; Paul Smith; Alexandra Sullivan; Lindsay C Spender; Martin Dyer; Lorraine Karran; Jenny O'Nions; Martin Allday; Ingrid Hoffmann; Dorothy Crawford; Beverley Griffin; Paul J Farrell; Tim Crook
Journal:  Blood       Date:  2005-09-13       Impact factor: 22.113

7.  BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.

Authors:  Ken Chen; John W Wallis; Michael D McLellan; David E Larson; Joelle M Kalicki; Craig S Pohl; Sean D McGrath; Michael C Wendl; Qunyuan Zhang; Devin P Locke; Xiaoqi Shi; Robert S Fulton; Timothy J Ley; Richard K Wilson; Li Ding; Elaine R Mardis
Journal:  Nat Methods       Date:  2009-08-09       Impact factor: 28.547

8.  Melanoma genome sequencing reveals frequent PREX2 mutations.

Authors:  Michael F Berger; Eran Hodis; Timothy P Heffernan; Yonathan Lissanu Deribe; Michael S Lawrence; Alexei Protopopov; Elena Ivanova; Ian R Watson; Elizabeth Nickerson; Papia Ghosh; Hailei Zhang; Rhamy Zeid; Xiaojia Ren; Kristian Cibulskis; Andrey Y Sivachenko; Nikhil Wagle; Antje Sucker; Carrie Sougnez; Robert Onofrio; Lauren Ambrogio; Daniel Auclair; Timothy Fennell; Scott L Carter; Yotam Drier; Petar Stojanov; Meredith A Singer; Douglas Voet; Rui Jing; Gordon Saksena; Jordi Barretina; Alex H Ramos; Trevor J Pugh; Nicolas Stransky; Melissa Parkin; Wendy Winckler; Scott Mahan; Kristin Ardlie; Jennifer Baldwin; Jennifer Wargo; Dirk Schadendorf; Matthew Meyerson; Stacey B Gabriel; Todd R Golub; Stephan N Wagner; Eric S Lander; Gad Getz; Lynda Chin; Levi A Garraway
Journal:  Nature       Date:  2012-05-09       Impact factor: 49.962

9.  LUMPY: a probabilistic framework for structural variant discovery.

Authors:  Ryan M Layer; Colby Chiang; Aaron R Quinlan; Ira M Hall
Journal:  Genome Biol       Date:  2014-06-26       Impact factor: 13.583

10.  TBX2 over-expression promotes nasopharyngeal cancer cell proliferation and invasion.

Authors:  Yan Lv; Meng Si; Nannan Chen; Ya Li; Xingkai Ma; Huijun Yang; Ling Zhang; Hongyan Zhu; Guang-Yin Xu; Ge-Ping Wu; C Cao
Journal:  Oncotarget       Date:  2017-04-13
View more
  17 in total

1.  Structural variant analysis for linked-read sequencing data with gemtools.

Authors:  S U Greer; H P Ji
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

2.  Integrative analysis of liver-specific non-coding regulatory SNPs associated with the risk of coronary artery disease.

Authors:  Ilakya Selvarajan; Anu Toropainen; Kristina M Garske; Maykel López Rodríguez; Arthur Ko; Zong Miao; Dorota Kaminska; Kadri Õunap; Tiit Örd; Aarthi Ravindran; Oscar H Liu; Pierre R Moreau; Ashik Jawahar Deen; Ville Männistö; Calvin Pan; Anna-Liisa Levonen; Aldons J Lusis; Sami Heikkinen; Casey E Romanoski; Jussi Pihlajamäki; Päivi Pajukanta; Minna U Kaikkonen
Journal:  Am J Hum Genet       Date:  2021-02-23       Impact factor: 11.025

Review 3.  Structural variation in the sequencing era.

Authors:  Steve S Ho; Alexander E Urban; Ryan E Mills
Journal:  Nat Rev Genet       Date:  2019-11-15       Impact factor: 53.242

4.  DNA Analysis by Restriction Enzyme (DARE) enables concurrent genomic and epigenomic characterization of single cells.

Authors:  Ramya Viswanathan; Elsie Cheruba; Lih Feng Cheow
Journal:  Nucleic Acids Res       Date:  2019-11-04       Impact factor: 16.971

5.  High-throughput imaging of mRNA at the single-cell level in human primary immune cells.

Authors:  Manasi Gadkari; Jing Sun; Adrian Carcamo; Hugh Alessi; Zonghui Hu; Iain D C Fraser; Gianluca Pegoraro; Luis M Franco
Journal:  RNA       Date:  2022-06-28       Impact factor: 5.636

6.  Development and Characterization of a Factor V-Deficient CRISPR Cell Model for the Correction of Mutations.

Authors:  Luis Javier Serrano; Mariano Garcia-Arranz; Juan A De Pablo-Moreno; José Carlos Segovia; Rocío Olivera-Salazar; Damián Garcia-Olmo; Antonio Liras
Journal:  Int J Mol Sci       Date:  2022-05-22       Impact factor: 6.208

7.  Simultaneous brain cell type and lineage determined by scRNA-seq reveals stereotyped cortical development.

Authors:  Donovan J Anderson; Florian M Pauler; Aaron McKenna; Jay Shendure; Simon Hippenmeyer; Marshall S Horwitz
Journal:  Cell Syst       Date:  2022-04-21       Impact factor: 11.091

8.  Concentration-dependent toxicogenomic changes of silver nanoparticles in hepatocyte-like cells derived from human induced pluripotent stem cells.

Authors:  Xiugong Gao; Rong Li; Robert L Sprando; Jeffrey J Yourick
Journal:  Cell Biol Toxicol       Date:  2020-05-24       Impact factor: 6.691

9.  Refined detection and phasing of structural aberrations in pediatric acute lymphoblastic leukemia by linked-read whole-genome sequencing.

Authors:  Jessica Nordlund; Yanara Marincevic-Zuniga; Lucia Cavelier; Amanda Raine; Tom Martin; Anders Lundmark; Jonas Abrahamsson; Ulrika Norén-Nyström; Gudmar Lönnerholm; Ann-Christine Syvänen
Journal:  Sci Rep       Date:  2020-02-13       Impact factor: 4.379

10.  Structural variation and its potential impact on genome instability: Novel discoveries in the EGFR landscape by long-read sequencing.

Authors:  George W Cook; Michael G Benton; Wallace Akerley; George F Mayhew; Cynthia Moehlenkamp; Denise Raterman; Daniel L Burgess; William J Rowell; Christine Lambert; Kevin Eng; Jenny Gu; Primo Baybayan; John T Fussell; Heath D Herbold; John M O'Shea; Thomas K Varghese; Lyska L Emerson
Journal:  PLoS One       Date:  2020-01-15       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.