Hidewaki Nakagawa1, Masashi Fujita1. 1. Laboratory for Genome Sequencing Analysis, RIKEN Center for Integrative Medical Sciences, Tokyo, Japan.
Abstract
Explosive advances in next-generation sequencer (NGS) and computational analyses have enabled exploration of somatic protein-altered mutations in most cancer types, with coding mutation data intensively accumulated. However, there is limited information on somatic mutations in non-coding regions, including introns, regulatory elements and non-coding RNA. Structural variants and pathogen in cancer genomes remain widely unexplored. Whole genome sequencing (WGS) approaches can be used to comprehensively explore all types of genomic alterations in cancer and help us to better understand the whole landscape of driver mutations and mutational signatures in cancer genomes and elucidate the functional or clinical implications of these unexplored genomic regions and mutational signatures. This review describes recently developed technical approaches for cancer WGS and the future direction of cancer WGS, and discusses its utility and limitations as an analysis platform and for mutation interpretation for cancer genomics and cancer precision medicine. Taking into account the diversity of cancer genomes and phenotypes, interpretation of abundant mutation information from WGS, especially non-coding and structure variants, requires the analysis of large-scale WGS data integrated with RNA-Seq, epigenomics, immuno-genomic and clinic-pathological information.
Explosive advances in next-generation sequencer (NGS) and computational analyses have enabled exploration of somatic protein-altered mutations in most cancer types, with coding mutation data intensively accumulated. However, there is limited information on somatic mutations in non-coding regions, including introns, regulatory elements and non-coding RNA. Structural variants and pathogen in cancer genomes remain widely unexplored. Whole genome sequencing (WGS) approaches can be used to comprehensively explore all types of genomic alterations in cancer and help us to better understand the whole landscape of driver mutations and mutational signatures in cancer genomes and elucidate the functional or clinical implications of these unexplored genomic regions and mutational signatures. This review describes recently developed technical approaches for cancer WGS and the future direction of cancer WGS, and discusses its utility and limitations as an analysis platform and for mutation interpretation for cancer genomics and cancer precision medicine. Taking into account the diversity of cancer genomes and phenotypes, interpretation of abundant mutation information from WGS, especially non-coding and structure variants, requires the analysis of large-scale WGS data integrated with RNA-Seq, epigenomics, immuno-genomic and clinic-pathological information.
Cancer is essentially a disease of the genome which evolves and progresses with accumulations of somatic mutations, including copy‐number alterations (CNA) and structural variants (SV), and epigenomic alterations with and without some hereditability (germline variants).1, 2 A number of familial cancer segregation studies and loss‐of‐heterozygosity (LOH) analyses on cancer tissues have identified germline and somatic mutations of several classical tumor suppressor genes, such as RB1, TP53 and APC,2, 3 and copy‐number analysis has found some oncogenes and underlying oncogenic activators, such as HER2/ERBB2 and MYC.1, 4 Some of these oncogenic mutations have been successfully targeted for molecular therapy, and specific and recurrent mutations of these oncogenes are now used to predict sensitivity to therapy, prognosis and residual disease.1, 2Explosive advances in next‐generation sequencer (NGS) and computational analyses handling massive data have enabled us to comprehensively analyze cancer genome profiles at research and clinical levels, such as targeted sequencing for hundreds of genes, whole exome sequencing (WES), RNA sequencing (RNA‐Seq) and whole genome sequencing (WGS).5, 6 So far, to explore cancer genomic alterations and their diversity, more than 50 000 cancer genomes have been sequenced and accumulated worldwide, including The Cancer Genome Atlas (TCGA)2, 7 and The International Cancer Genome Consortium (ICGC),8 and hundreds of millions of cancerpatients will have their genome sequenced by 2030. In these projects so far, WES is the main platform for cancer genome sequencing and vast amounts of mutational data in protein‐coding regions have been accumulated for all types of common and rare humantumors. These systematic studies of these cancer genome data have reveled scores of new cancer genes and pathways,2 and saturation analysis suggests that most driver genes with frequent mutation in cancer have been almost elucidated.9, 10 Researchers are now focusing on a “long tail” of rare mutated driver genes2, 11 and rare variants of the driver genes, in addition to validating these functional or clinical implications by integrating with clinical data and using functional assays. Pan‐cancer analysis on WES data demonstrated that carcinogen‐exposed cancer, such as melanoma and lung cancer, have far higher numbers of somatic mutations in coding regions among common cancers, while pediatric tumors and leukemia have much fewer mutations and only several protein‐altered mutations are present in their whole coding regions.12, 13 TCGA and ICGC provide comprehensive mutational data of coding regions in more than 20 000 cancers and the COSMIC database14 has been extensively curating mutations from targeted sequencing and WES, summarizing coding mutations for more than 1 000 000 cancer samples. However, there is limited information on somatic mutations in non‐coding regions spanning 98% of the human genome, which includes untranslated regions (UTR), introns, promoters, regulatory elements, non‐coding functional RNA, repetitive regions and mitochondrial genomes. Somatic structural variants (SV) including large deletions/insertions, inversion, duplication, translocation and pathogen (virus) integration in cancer genomes also remain widely unexplored (Figure 1A). WGS approaches can cover all of these unexplored mutations (Table 1) and help us to better understand the “whole” landscape of cancer genomes and elucidate the functions of these unexplored human genomic regions (Figure 1A). This approach combined with mathematical analysis and other omics analysis can clarify the underlying carcinogenesis and achieve molecular sub‐classification of cancer, which facilitates discovery of genomic biomarkers and personalized cancer medicine. This review describes recent technical approaches for cancer WGS and the future direction of cancer WGS, and discusses its utility and limitations as an analysis platform and for mutation interpretation for cancer genomics and cancer precision medicine.
Figure 1
A, Whole genome sequencing (WGS) by next‐generation sequencer (NGS) can detect non‐coding mutations, structural variants (SV), including copy number alterations (CNA), mitochondria mutations and pathogen detection, as well as protein‐coding mutations; B, A representative Circos plot of cancer genome structure from WGS analysis, which indicates SV and CNA in all human chromosomes (1‐22+XY). Chromothripsis was observed in chromosomes 1 and 14. SNV, single nucleotide variants
Table 1
Detectable mutations by each analysis platform for cancer genome
DNA chip
Target‐Seq
Exome
RNA‐Seq
WGS
Coding SNV
▵
○
▵
○
Coding indels
▵
○
▵
○
Splicing alteration
▵
○
○
Promoter mutation
○
Regulatory regions
○
Copy‐number alteration
○
▵
○
Structural variant
▵ (fusion)
○
Pathogen
○
○
Mitochondria
▵
▵
○
Mutational signature
▵
○
Neo‐antigen/HLA
○
○
○
Sequence (Gb)
—
0.5‐1
10
5‐10
90‐150
Assay cost ($)
100
200‐500
500
500
1000
SNV, single nucleotide variants; WGS, whole genome sequencing.
A, Whole genome sequencing (WGS) by next‐generation sequencer (NGS) can detect non‐coding mutations, structural variants (SV), including copy number alterations (CNA), mitochondria mutations and pathogen detection, as well as protein‐coding mutations; B, A representative Circos plot of cancer genome structure from WGS analysis, which indicates SV and CNA in all human chromosomes (1‐22+XY). Chromothripsis was observed in chromosomes 1 and 14. SNV, single nucleotide variantsDetectable mutations by each analysis platform for cancer genomeSNV, single nucleotide variants; WGS, whole genome sequencing.
NEXT‐GENERATION SEQUENCER TECHNOLOGY AND WHOLE GENOME SEQUENCING ANALYSIS
Detectable mutations by each of the genomic analysis platforms (DNA chip, target sequencing for 100 genes, WES, RNA‐Seq and WGS) and their performances are summarized in Table 1. WES analysis captures protein‐coding exons spanning approximately 50 Mb (1%‐2%) of the human genome by in‐solution hybridization,15 microarray capturing or PCR amplification, and usually sequences approximately 100 × sequence depth for each sample, which is more accurate than 30 × WGS, because the accuracy of mutation calling by NGS is primarily dependent on the sequencing depth. However, some capturing bias expected is, for example due to difficulty detecting complicated or repetitive genomic regions as well as non‐targeted regions. On the other hand, WGS is technically straightforward. DNA is randomly fragmented by physical shearing, and 30‐50× sequence depth (90‐150 Gb) of each human whole genome is usually sequenced for both cancer and normal genomes,5, 6 which can cover 99% of the entire human genome. Common NGS technology reads are 100‐150 bp for both ends of a 500‐600‐bp DNA fragment,16 but WGS by NGS is still dependent on PCR, with PCR bias, indicating that GC‐rich or AT‐rich regions are difficult to sequence. PCR‐free protocol was recently developed, which shows less GC bias and is more comprehensive than the PCR protocol, although some μg DNA is required as an input for library preparation. The largest limitation of the present 2nd NGS technology (Illumina SBS technology)16 is its short‐read length (100‐250 bp). Around 50% of the 3‐Gb human genome is occupied by repetitive regions and pseudogenes in its 50%, and when short sequence reads are aligned to the redundant reference genome, alignment errors can occur around these repetitive or complicated regions, leading to mutation calling errors.17 The 3rd generation NGS technologies such as PacBio SMART single molecule sequencing18 and nanopore sequencing19 can yield 10 kb and longer reads without PCR bias and are promising for analysis for human WGS; however, they currently suffer from a high error rate (5% and more) in each read and are still prohibitively expensive for WGS analysis, considering that the cost of human WGS was just below $1000 in 2017 (Table 1).
COMPUTATIONAL ANALYSIS FOR CANCER WHOLE GENOME SEQUENCING
One of the most challenging issues for cancer WGS is computational analysis. Cancer WGS is required to produce more than 90‐150 Gb ×2 (cancer and normal DNA) of sequence data, corresponding to approximately one terabyte for raw data. Large computer resources are required to handle WGS datasets and to perform alignment and variant calling promptly for thousands of cancer WGS. Academic genome centers are usually increasing their computer resources for WGS, but it would be not sufficient in these academia resources for analysis on tens of thousands of WGS dataset. Cloud computing systems can solve these problems and facilitate the sharing of genomic data globally, although there are technical problems with data transfer and ethical and legal issues in some areas.20A representative set of computational pipelines and the analysis workflow for cancer WGS are shown in Figure 2. As an initial step, raw sequence data (90‐150 Gb ×2) from NGS (FASTQ files) are aligned to the 3‐Gb human reference sequence (hg19 or new hg38) by BWAmem and other programs, producing BAM files, and PCR duplicates are removed from the BAM file (usually a few percent). Somatic mutations are called by several algorithms specific to somatic mutation types, such as single nucleotide variants (SNV), short indels, CNA and SV, which statistically compares variant allele fractions (VAF) in the cancer genome with those in the normal genome.21, 22, 23 Accuracy is primarily dependent on the sequence depth in each genomic region. The other important factor for accurate analysis is considering alignment or mapping error. Taking account of the complexity and redundancy of the human genome, especially non‐coding regions, alignment error can occur with high frequency when short reads are aligned to repetitive and redundant regions.17, 23 The most serious problem of WGS is that its result is dependent on these mutation call algorithm and each pipeline calls different somatic mutations, especially in low‐depth and complicated regions and somatic short indels.23 The ICGC working group23 performed an extensive benchmark exercise of more than 10 analysis pipelines all of the world and evaluated the consistency of the mutation calling methods. Somatic indel calling showed a high level of inconsistency, while SNV and SV calls showed comparative consistency among the pipelines, concluding that somatic mutation calling remains an unsolved problem. The working group proposed guidelines for computational analysis for cancer WGS.23 For germline variants which are related with cancer risk and hereditary cancer diagnosis, another calling pipeline is required because only normal genome sequencing data is analyzed and VAF is around 50% basically. HaplotyperCaller of GATK (https://software.broadinstitute.org/gatk/) is commonly used for germline variant calling, including SNV and indels from WGS.
Figure 2
A representative set of computational tools for cancer whole genome sequencing (WGS) analysis. As an initial step, raw sequence data (90‐150‐Gb ×2: FASTQ files) from next‐generation sequencer (NGS) of cancer genome and normal genome are aligned to the 3‐Gb human reference sequence (3 Gb), producing BAM files. PCR duplication is removed from the BAM file (usually a few percent). Somatic mutations are called by several types of algorithms specific to mutation types (SNV, short indels, CNA, SV and others), comparing variant allele numbers in cancer genomes with those in normal genomes by statistical analysis and creating a list of somatic mutations (VCF files). Germline variant call including SNV and indels is commonly performed from sequencing data of normal genomes using other software, HaplotypeCaller of GATK. SNV, single nucleotide variants; SV, structural variants
A representative set of computational tools for cancer whole genome sequencing (WGS) analysis. As an initial step, raw sequence data (90‐150‐Gb ×2: FASTQ files) from next‐generation sequencer (NGS) of cancer genome and normal genome are aligned to the 3‐Gb human reference sequence (3 Gb), producing BAM files. PCR duplication is removed from the BAM file (usually a few percent). Somatic mutations are called by several types of algorithms specific to mutation types (SNV, short indels, CNA, SV and others), comparing variant allele numbers in cancer genomes with those in normal genomes by statistical analysis and creating a list of somatic mutations (VCF files). Germline variant call including SNV and indels is commonly performed from sequencing data of normal genomes using other software, HaplotypeCaller of GATK. SNV, single nucleotide variants; SV, structural variants
MUTATIONS IN CODING REGIONS AND SPLICING SITES
Whole genome sequencing can detect somatic SNV and short indels (1‐10 bp) in coding regions and splicing sites near the exon‐intron junctions as well as WES. WGS can detect somatic mutations in intronic regions, whose impact is difficult to evaluate and interpret.2, 24 However, combined analysis with RNA‐Seq data can evaluate the impact of deep intronic and synonymous mutations and investigate the transcriptional or functional consequences of the genomic alterations (Figure 3).25, 26 In addition to mutations in exon‐intron junction sites (the GU‐AG consensus sites), mutations in deep intronic regions can generate new splice‐donor or acceptor sites, giving rise to new splicing forms. Synonymous mutations in coding regions and intronic regions can alter exonic motifs that regulate splicing and the functions of cancer‐related genes.27 Systematic combined analysis of WGS and RNA‐Seq is required to interpret these non‐coding mutations.26
Figure 3
Non‐coding mutations and gene expression in whole genome sequencing (WGS) and RNA‐Seq. Intronic mutations can affect splicing forms. Mutations in 5′UTR and promoter regions can alter transcriptional activity, and regulatory elements such as enhancers, silencers or insulators in intergenic regions can affect chromatin structure and transcriptional activity. Mutations in 3′UTR can alter RNA stability and protein translation through changes in miRNA binding and other mechanisms. Mutations in non‐coding RNA, especially miRNA and lincRNA, may change the interaction of coding RNA/proteins and regulatory elements, and alter chromatin structure
Non‐coding mutations and gene expression in whole genome sequencing (WGS) and RNA‐Seq. Intronic mutations can affect splicing forms. Mutations in 5′UTR and promoter regions can alter transcriptional activity, and regulatory elements such as enhancers, silencers or insulators in intergenic regions can affect chromatin structure and transcriptional activity. Mutations in 3′UTR can alter RNA stability and protein translation through changes in miRNA binding and other mechanisms. Mutations in non‐coding RNA, especially miRNA and lincRNA, may change the interaction of coding RNA/proteins and regulatory elements, and alter chromatin structure
MUTATIONS IN NON‐CODING REGIONS
Pre‐mRNA of protein‐coding genes usually contain extensive noncoding sequences, in the form of introns, 5′‐untranslated regions (5′‐UTR) and 3′‐untranslated regions (3′‐UTR) (Figure 3). They are involved with regulation of RNA transcription, splicing and protein translational processes.28 Mutations in 3′‐UTR tend to occur in cancer driver genes29 and are likely to control RNA stability and protein translation through miRNA binding (Figure 3). The human genome contains genes encoding approximately 20 000 non‐coding RNA (ncRNA), including tRNA, ribosomal RNA, microRNA and long non‐coding RNA (lincRNA).30 These functional ncRNA are expected to contribute to chromatin structures, transcription regulation, RNA splicing and the translational machinery.30, 31 Some studies32, 33 report that somatic mutations are accumulated in lincRNA NEAT1 and MALAT1, which are located nearby and are involved with cancer invasions. Furthermore, intergenic regions contain various regulatory element sequences that are crucial to regulate gene expression and associated chromatin structure. Genome‐wide association studies (GWAS) for cancers have identified several hundreds of cancer‐predisposing loci and germline variants, many of which are located in intergenic regions, expecting that they are involved with regulatory elements controlling gene expression around these loci.34 Extrapolations from the ENCODE project,35 Roadmap Epigenomics Consortium36 and FAMTOM30 suggest that 20%‐40% of the human genome could be regulatory elements. Efforts have shifted toward finding interactions between genomic variants and regulatory proteins by ChIP‐Seq, or open chromatin elements (structures), which can indicate where there are active regulatory sequences in a cell type‐specific or cancer‐specific manner.35, 36Whole genome sequencing analysis in melanoma samples revealed hotspot mutations in the TERT promoter, located −124 bp and −146 bp upstream from the translation start ATG site and conferred enhanced TERT promoter activity.37 These promoter mutations were frequently detected in glioblastoma, bladder cancer, thyroid cancer, liver cancer and melanoma, although the strength of association between these TERT promoter mutations and TERT expression is variable among cancer types.38 In a subset of T‐cell ALL, somatic mutations in non‐coding regions introduced binding motifs for the MYB transcriptional factor and created a super‐enhancer upstream of the TAL1 oncogene.39 Recent systematic or statistical analysis for non‐coding somatic mutations using WGS datasets from TCGA and ICGC indicated that several non‐coding regions are frequently mutated, such as the promoters or regulatory elements of PLEKHS1, WDR74, TFPI2 and BCL6.32, 40, 41 There are a lot of CTCF/cohesion‐binding sites (CBS) across the human genome, which function as insulators to regulate gene expression of nearby genes. Multiple cancer types accumulate CBS mutations42 and these mutations may be involved in the generation of double‐strand breaks and SV as well.43 To identify non‐coding mutations and to interpret their impacts and consequences, more systematic approaches are required by integrating many datasets targeting non‐coding regulatory regions, such as ENCODE,34 FANTOM,30 ChIP‐Seq datasets (epigenomic data) and gene expression datasets (RNA‐Seq).
COPY NUMBER ALTERATIONS
Copy number alterations (CNA), affecting large DNA segments (10 kb and more), are one of the most common landmarks of cancer genomes, and lead to activation of oncogenes and inactivation of tumor suppressor genes located in focal CNA. The CNA‐associated oncogenes and tumor suppressors include the focal amplifications of 8q24.21 (MYC), 11q13.3 (CCND1), 7p11.2 (EGFR), 17q12 (ERRB2 = HER2) and 7q31.2 (MET), whereas focal deletion involved 13q14.2 (RB1), 9p21.3 (CDKN2A) and 10q23.31 (PTEN).44, 45 Array CGH and DNA or SNP chip analyses efficiently detect gain or loss of CNA in cancer genomes and we should discuss CNA in a comprehensive cancer genome analysis. The challenge is to identify the oncogene and tumor suppressor gene targets of the driver CNA, which often encompass many genes and elucidate the functional roles of CNA.44, 45 CNA affect not only protein‐coding genes: copy number gains of noncoding regions harboring super‐enhancers near some oncogenes such as KLF5 and MYC are associated with overexpression of these cancer‐related genes.46 Computational tools for WES can detect CNA, but its resolution is not high and sometimes it is difficult to detect specific CNA because exon capturing does not cover many recurrent CNA regions and there is some bias. On the other hand, WGS can analyze CNA in a non‐biased way by counting reads mapped to specific genomic regions in cancer/normal DNA. Even low‐depth WGS (×0.1) can efficiently detect CNA in cancer genomes.47 The theological base of non‐invasive prenatal genetic testing (NIPT) is that CNA of a fetus are detected in plasma of pregnant women,48 and this method of low‐depth WGS is applied to cancerpatients as a liquid biopsy or ctDNA (circulating tumor DNA) analysis. Indeed, NIPT by WGS can detect CNA of cancer in low‐depth WGS of plasma from cancer‐bearing pregnant women.49
STRUCTURE VARIANTS
Distinct rearrangements or SV in leukemia and sarcoma lead to the activation of proto‐oncogene products or creation of cancer‐specific fusion genes, some of which are clinical diagnostic tools, such as STY‐SSX1 fusion in synovial sarcoma and EWS‐FLI1 fusion in Ewing sarcoma.50 Philadelphia chromosome in CML,51 translocation between chromosome 9q34 and 22q11 gives rise to the BCR‐ABL fusion gene, to which the ABL kinase inhibitor imatinib for first time successful targeted in CML. A small inversion on 2p creates the EML4‐ALK fusion gene, which is found in 1%‐2% of lung adenocarcinomas, and kinase inhibitors are targeting such a kinase fusion gene as ALK in lung cancer.52 SV involving ROS1 at 6q22 (mainly translocation) and RET1 at chromosome 10q11.2 (mainly inversion) were also identified in a small percentage of lung cancers with unique clinical and pathological features. They produce fusion kinases as driver genes and are molecular targets for lung cancer53, 54; 40%‐70% of prostate cancers were found to have SV involving ERG at 21q22 and multiple ETS family genes, producing TMPRSS2‐ERG and ETS family gene fusions.55 Recent analysis for medulloblastoma found recurrent SV activated GFIB proto‐oncogene by enhancer hijacking.56 In liver and kidney cancers, the promoter regions of TERT are frequently affected by SV, inducing TERT overexpression.32 SV affects CD274 (PD‐L1) genes in specific types of lymphoma,57 inducing the stability of PD‐L1 and associated with immune escape of cancer cell.
MUTATIONAL ANALYSIS IN REPEAT OR REPETITIVE REGIONS
Repetitive sequences comprise approximately 50% of the human genome. These sequences are highly variable and have been used for genome linkage mapping and diagnosis of cancer with DNA mismatch‐repair deficiency as a microsatellite instability (MSI) test. It is still difficult to analyze mutations and variants in such repetitive regions using WGS and NGS approaches because of alignment issues for short‐read sequences. A recent study analyzed MS mutations in approximately 1000 WGS data across 23 cancer types and identified genes in DNA repair and oncogenic pathways recurrently subject to MSI and uncovered non‐coding loci that frequently display MSI.58 Transposable genetic elements, which are an abundant component in the human genome, can replicate and insert copies of themselves at other locations.23 These transposons played a major role in a driving force for genomic evolution and diversity.59 Several studies have analyzed transposon‐mediated somatic mutations and SV by using cancer WGS data and have identified 4‐5 somatic retrotransposon insertions per tumor.59, 60 These somatic retrotransposon insertions tend to occur in genes that are commonly mutated in cancer and can change their expressions.
PATHOGEN DETECTION AND INTEGRATION
Viral and bacterial infection and following chronic inflammation are the strongest etiological factors of cancer development. Hepatitis B virus (HBV) or hepatitis C virus (HCV) infection are linked to liver cancer. Humanpapillomavirus (HPV) infection initiates and promotes carcinogenesis of the cervix. Helicobactor pylori and Epstein‐Barr virus (EBV) infection are involved with gastric cancer development. Hence, it is important in cancer genomes to detect DNA or RNA sequences derived from known and unknown pathogens (virus and bacteria) leading to chronic inflammation and WGS can detect genomic integrations of pathogens to the host human genome. Technically, unaligned reads to the human genome sequences are extracted and accumulated from WGS or RNA‐Seq data and they can be matched to known pathogen genome sequences with or without pre‐assembling. Especially for tumors in digestive organs, bacteria detection and metagenome analysis of gut flora are important for understand the genome‐environmental interaction in tumor development and therapy resistance, such as Fusobacterium in colorectal cancer61 and Gammaproteobacteria in pancreatic cancer.62 WGS of liver cancers detected several integration sites of the HBV DNA genome (3 kb), which preferentially integrated to the genomic regions of the TERT and MLL4 loci.32, 63 The HPV DNA genome and its integrations are detected by WGS for cervical cancer64 and head and neck cancer. Several rare cancers have a strong viral component, such as EBV in Burkitt lymphoma65 and nasopharyngeal carcinoma, and RNA retrovirus HTLV‐1 in adult T‐cell leukemia/lymphoma.66 Adeno‐associated virus (AAV) is also reported to be integrated in liver cancer genome,32 although its pathogenesis is unclear. These viral integrations/interactions are likely to lead to local genomic instability followed by copy number changes, overexpression around the integration sites, and human‐human or human‐virus gene fusion events, in addition to oncoproteins derived from viral genome.
MUTATIONS IN MITOCHONDRIA GENOME
Whole exome sequencing is not designed to capture the 16‐kb mitochondria genome, which includes 13 protein‐coding genes that are equipped with all the elements necessary for their own protein synthesis.67 The proteins encoded by mitochondrial DNA (mtDNA) genes work with other nuclear genes to form the respiratory chain complexes that are the main energy production system in cells. The involvement of mitochondria in carcinogenesis has long been suspected and altered energy metabolism is a common feature of cancer.68 Some studies have examined mtDNA copy numbers in individual cancer types68 or from a collection of WES data69 and demonstrated that there is selective pressure against deleterious coding mutations in mtDNA, supporting that functional mitochondria are required in tumor cells. These studies also observe a strong mutational strand bias, compatible with endogenous replication‐coupled errors as the major source of mutations. Transmission of mtDNA to the nuclear genome occurs in neoplastically transformed cells and mitochondrial‐nuclear genome fusions occur at a similar rate per base pair of DNA as interchromosomal nuclear SV.70
MUTATIONAL SIGNATURE
Somatic mutations in cancer are the consequence of multiple mutational processes, including the intrinsic infidelity of the DNA replication machinery, exogenous or endogenous mutagen exposures, enzymatic modification of DNA, and defective DNA repair. Different mutational processes generate unique combinations of mutation types, termed “mutational signatures.”12 Each of the mutational signature patterns is associated with each cancer etiology in a tissue‐specific manner. For example, C>A/G>T transversions such as R293S are the most frequent substitutions in the TP53 gene in liver cancer developed through aflatoxin exposure.71 Unlike WES, WGS detects thousands of somatic SNV in common cancers, and recent comprehensive mutational searches and non‐negative matrix factorization mathematical analysis have extracted more than 30 mutational signatures for cancer genomes, which are shown in the COSMIC database (http://cancer.sanger.ac.uk/cosmic/signatures). Researchers have attempted to implicate each of these mutational signatures in biological and epidemiological aspects.12 Among these established mutational signatures, some are established to be associated with specific mutational processes. Signature 1 represents a clock‐like mutational process (aging) and is observed in all types of cancer.72 Signature 24 represents the signature associated with aflatoxin,32, 71 Signature 22 with aristolochic acid (which is contained in Chinese herbal products),73 Signature 4 with smoking exposure,32, 74 Signature 3 with the defect of DNA double‐strand break repair associated with BRCA1/2 mutation, and Signature 6 with DNA‐mismatch repair defects (Figure 4).75 By observing genome‐wide somatic mutational signatures from cancer WGS data, we may presume the etiological factors for individual cancer development among the multiple internal (aging and intrinsic DNA repair) and external (environmental exposure) etiological steps in carcinogenesis.
Figure 4
Mutational signature and etiological factors in COSMIC database. The profile of each signature is displayed using the 6 substitution subtypes: C>A, C>G, C>T, T>A, T>C and T>G. Furthermore, each of the substitutions is examined by incorporating information on the bases immediately 5′ and 3′ to each mutated base, generating 96 possible mutation types. NMF analysis of cancer WGS in the COSMIC database (http://cancer.sanger.ac.uk/cosmic/signatures) demonstrates 30 mutational signatures at present, and 6 representative signatures are shown with their associated etiological factors for cancer development (aging, environmental exposures and defect of intrinsic DNA repair)
Mutational signature and etiological factors in COSMIC database. The profile of each signature is displayed using the 6 substitution subtypes: C>A, C>G, C>T, T>A, T>C and T>G. Furthermore, each of the substitutions is examined by incorporating information on the bases immediately 5′ and 3′ to each mutated base, generating 96 possible mutation types. NMF analysis of cancer WGS in the COSMIC database (http://cancer.sanger.ac.uk/cosmic/signatures) demonstrates 30 mutational signatures at present, and 6 representative signatures are shown with their associated etiological factors for cancer development (aging, environmental exposures and defect of intrinsic DNA repair)
GENOMIC INSTABILITY AND CELL‐OF‐ORIGIN PREDICTION BY MUTATIONAL SIGNATURE
Whole genome sequencing analysis of cancer revealed a distinct signature pattern, termed chromothripsis,76 in which, 1 or a few chromosomes in 1 cell produce dozens to hundreds of clustered SV (Figure 1b). The mechanism for such complicated SV is that at 1 or more carcinogenic stage, distinct chromosomes or genomic regions become fragmented into many segments which are then pieced together inaccurately by DNA repair mechanisms.76 Chromothriptic signatures were detected in cancers arising from patients with inherited p53 mutations,77 suggesting that this event is associated with the functions of p53 and genomic stability in various DNA damage response signaling pathways. WGS analysis for breast, ovarian and pancreatic cancers indicated that signatures were related with inactivation of DNA maintenance genes (BRCA1, BRCA2 and PALB2), and also related with high response to DNA‐damaging agents and PARP inhibitors.75, 78 Recent studies have demonstrated that somatic substitution, insertion/deletion and SV patterns, or “mutational signatures,” are associated with BRCA1/BRCA2 dysfunction. By combining this signature information from WGS, homologous recombination deficiency associated with BRCA1/2 deficiency was evaluated79. Polak et al80 predicted cell‐of‐origin (COO) from WGS data for cancer by comparing the genomic distribution and signature of somatic mutations to 424 epigenetic features that were measured by the Epigenome Roadmap consortium,36 which were derived from 106 different cell types. The genomic distribution of chromatin features corresponding to the tumor's cell type of origin is strongly associated with local mutation density, and they chose the tissue showing the most significant enrichment as the most likely tissue of origin for individual cancer whole genome.
IMMUNO‐GENOMIC ANALYSIS FROM CANCER GENOME SEQUENCING
Immunotherapy using immune checkpoint inhibitors and emerging new therapies have already shown great promise in some types of cancers. Genomic biomarkers have been extensively investigated by genome sequencing analysis on pre‐treated or recurrent cancer specimens. The overexpression or genetic alteration of PD‐L1 (CD274) is likely to be associated with response to anti‐PD‐1/PD‐L1 agents in some types of cancer, such as lymphoma.57, 81 Mutation load at whole genome level, which is associated with greater neo‐antigen presentation, is a good genomic marker in melanoma, lung cancer and MSI‐positive colorectal cancer. Genome‐wide CNA pattern, aneuploidy, was reported to be correlated with reduced infiltrating immune cells with tumor, which was evaluated by RNA‐Seq, and with reduced response to immunotherapy.82 Several mutations involved with IFN‐gamma pathway and HLA presentations such as HLA, B2M, JAK1/2
83, 84 are likely to be related to the resistance of immune check‐point inhibitors, but tumor immunology and mechanisms of immune check‐point inhibitors are quite complicated and diverse. To understand immuno‐genomics of cancer and to explore genomic markers to predict the response of immune therapy, comprehensive immune “signature” analysis, including the quantity and quality of immune cells and neoantigen signatures, is required from WGS and RNA‐Seq data on a number of pre‐treatment and post‐treated tumor specimens and immune cells.
CONCLUSION AND FUTURE DIRECTION OF CANCER WHOLE GENOME SEQUENCING
As sequencing costs continue to decrease and computer resources expand, WGS analysis for cancer genome research and clinical utilities will become more common and more sophisticated. Cancer WGS provides abundant information to understand the biology underlying the cancer genome and the function of unexplored non‐coding regions and SV in the human genome. There is much potential for transcriptional or functional consequences of SV and non‐coding mutations and they should be further explored by integrative analysis of RNA‐Seq and multi‐omics analysis with DNA methylation data,78 protein expression data, and chromatin structure85 or epigenome data to interpret mutational consequences and to understand the biology and immunology of cancer. Taking into account the diversity of cancer genomes and phenotypes, interpretation of the mutational data from cancer WGS will also require the analysis of much more WGS data and integration with multi‐omics data, functional data, immuno‐genomic data and clinic‐pathological data in a larger sample set.
CONFLICT OF INTEREST
The authors have no conflicts of interest to declare.
Authors: Franklin W Huang; Eran Hodis; Mary Jue Xu; Gregory V Kryukov; Lynda Chin; Levi A Garraway Journal: Science Date: 2013-01-24 Impact factor: 47.728
Authors: Alistair R R Forrest; Hideya Kawaji; Michael Rehli; J Kenneth Baillie; Michiel J L de Hoon; Vanja Haberle; Timo Lassmann; Ivan V Kulakovskiy; Marina Lizio; Masayoshi Itoh; Robin Andersson; Christopher J Mungall; Terrence F Meehan; Sebastian Schmeier; Nicolas Bertin; Mette Jørgensen; Emmanuel Dimont; Erik Arner; Christian Schmidl; Ulf Schaefer; Yulia A Medvedeva; Charles Plessy; Morana Vitezic; Jessica Severin; Colin A Semple; Yuri Ishizu; Robert S Young; Margherita Francescatto; Intikhab Alam; Davide Albanese; Gabriel M Altschuler; Takahiro Arakawa; John A C Archer; Peter Arner; Magda Babina; Sarah Rennie; Piotr J Balwierz; Anthony G Beckhouse; Swati Pradhan-Bhatt; Judith A Blake; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; A Maxwell Burroughs; Andrea Califano; Carlo V Cannistraci; Daniel Carbajo; Yun Chen; Marco Chierici; Yari Ciani; Hans C Clevers; Emiliano Dalla; Carrie A Davis; Michael Detmar; Alexander D Diehl; Taeko Dohi; Finn Drabløs; Albert S B Edge; Matthias Edinger; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Michela Fagiolini; Lynsey Fairbairn; Hai Fang; Mary C Farach-Carson; Geoffrey J Faulkner; Alexander V Favorov; Malcolm E Fisher; Martin C Frith; Rie Fujita; Shiro Fukuda; Cesare Furlanello; Masaaki Furino; Jun-ichi Furusawa; Teunis B Geijtenbeek; Andrew P Gibson; Thomas Gingeras; Daniel Goldowitz; Julian Gough; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Matthias Harbers; Jayson Harshbarger; Akira Hasegawa; Yuki Hasegawa; Takehiro Hashimoto; Meenhard Herlyn; Kelly J Hitchens; Shannan J Ho Sui; Oliver M Hofmann; Ilka Hoof; Furni Hori; Lukasz Huminiecki; Kei Iida; Tomokatsu Ikawa; Boris R Jankovic; Hui Jia; Anagha Joshi; Giuseppe Jurman; Bogumil Kaczkowski; Chieko Kai; Kaoru Kaida; Ai Kaiho; Kazuhiro Kajiyama; Mutsumi Kanamori-Katayama; Artem S Kasianov; Takeya Kasukawa; Shintaro Katayama; Sachi Kato; Shuji Kawaguchi; Hiroshi Kawamoto; Yuki I Kawamura; Tsugumi Kawashima; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; S Peter Klinken; Alan J Knox; Miki Kojima; Soichi Kojima; Naoto Kondo; Haruhiko Koseki; Shigeo Koyasu; Sarah Krampitz; Atsutaka Kubosaki; Andrew T Kwon; Jeroen F J Laros; Weonju Lee; Andreas Lennartsson; Kang Li; Berit Lilje; Leonard Lipovich; Alan Mackay-Sim; Ri-ichiroh Manabe; Jessica C Mar; Benoit Marchand; Anthony Mathelier; Niklas Mejhert; Alison Meynert; Yosuke Mizuno; David A de Lima Morais; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Efthymios Motakis; Hozumi Motohashi; Christine L Mummery; Mitsuyoshi Murata; Sayaka Nagao-Sato; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Kenichi Nakazato; Erik van Nimwegen; Noriko Ninomiya; Hiromi Nishiyori; Shohei Noma; Shohei Noma; Tadasuke Noazaki; Soichi Ogishima; Naganari Ohkura; Hiroko Ohimiya; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Arnab Pain; Robert Passier; Margaret Patrikakis; Helena Persson; Silvano Piazza; James G D Prendergast; Owen J L Rackham; Jordan A Ramilowski; Mamoon Rashid; Timothy Ravasi; Patrizia Rizzu; Marco Roncador; Sugata Roy; Morten B Rye; Eri Saijyo; Antti Sajantila; Akiko Saka; Shimon Sakaguchi; Mizuho Sakai; Hiroki Sato; Suzana Savvi; Alka Saxena; Claudio Schneider; Erik A Schultes; Gundula G Schulze-Tanzil; Anita Schwegmann; Thierry Sengstag; Guojun Sheng; Hisashi Shimoji; Yishai Shimoni; Jay W Shin; Christophe Simon; Daisuke Sugiyama; Takaai Sugiyama; Masanori Suzuki; Naoko Suzuki; Rolf K Swoboda; Peter A C 't Hoen; Michihira Tagami; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Zuotian Tatum; Mark Thompson; Hiroo Toyodo; Tetsuro Toyoda; Elvind Valen; Marc van de Wetering; Linda M van den Berg; Roberto Verado; Dipti Vijayan; Ilya E Vorontsov; Wyeth W Wasserman; Shoko Watanabe; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Emily J Wood; Yoko Yamaguchi; Masayuki Yamamoto; Misako Yoneda; Yohei Yonekura; Shigehiro Yoshida; Susan E Zabierowski; Peter G Zhang; Xiaobei Zhao; Silvia Zucchelli; Kim M Summers; Harukazu Suzuki; Carsten O Daub; Jun Kawai; Peter Heutink; Winston Hide; Tom C Freeman; Boris Lenhard; Vladimir B Bajic; Martin S Taylor; Vsevolod J Makeev; Albin Sandelin; David A Hume; Piero Carninci; Yoshihide Hayashizaki Journal: Nature Date: 2014-03-27 Impact factor: 49.962
Authors: Hui Y Xiong; Babak Alipanahi; Leo J Lee; Hannes Bretschneider; Daniele Merico; Ryan K C Yuen; Yimin Hua; Serge Gueroussov; Hamed S Najafabadi; Timothy R Hughes; Quaid Morris; Yoseph Barash; Adrian R Krainer; Nebojsa Jojic; Stephen W Scherer; Benjamin J Blencowe; Brendan J Frey Journal: Science Date: 2014-12-18 Impact factor: 47.728
Authors: Ludmil B Alexandrov; Philip H Jones; David C Wedge; Julian E Sale; Peter J Campbell; Serena Nik-Zainal; Michael R Stratton Journal: Nat Genet Date: 2015-11-09 Impact factor: 38.330
Authors: Alessandro Petrini; Marco Mesiti; Max Schubach; Marco Frasca; Daniel Danis; Matteo Re; Giuliano Grossi; Luca Cappelletti; Tiziana Castrignanò; Peter N Robinson; Giorgio Valentini Journal: Gigascience Date: 2020-05-01 Impact factor: 6.524
Authors: Teri A Manolio; Robb Rowley; Marc S Williams; Dan Roden; Geoffrey S Ginsburg; Carol Bult; Rex L Chisholm; Patricia A Deverka; Howard L McLeod; George A Mensah; Mary V Relling; Laura Lyman Rodriguez; Cecelia Tamburro; Eric D Green Journal: Lancet Date: 2019-08-05 Impact factor: 79.321
Authors: Anna V Kudryavtseva; Elena N Lukyanova; Dmitry V Kalinin; Andrew R Zaretsky; Anatoly V Pokrovsky; Alexander L Golovyuk; Maria S Fedorova; Elena A Pudova; Sergey L Kharitonov; Vladislav S Pavlov; Anastasiya A Kobelyatskaya; Nataliya V Melnikova; Alexey A Dmitriev; Andrey P Polyakov; Boris Y Alekseev; Marina V Kiseleva; Andrey D Kaprin; George S Krasnov; Anastasiya V Snezhkina Journal: BMC Med Genomics Date: 2019-03-13 Impact factor: 3.063
Authors: Michael Zhong; Anneke van der Walt; Maria Pia Campagna; Jim Stankovich; Helmut Butzkueven; Vilija Jokubaitis Journal: Neurotherapeutics Date: 2020-10-14 Impact factor: 7.620
Authors: Dan M Roden; Howard L McLeod; Mary V Relling; Marc S Williams; George A Mensah; Josh F Peterson; Sara L Van Driest Journal: Lancet Date: 2019-08-05 Impact factor: 79.321