| Literature DB >> 34055633 |
Ken Asada1,2, Syuzo Kaneko1,2, Ken Takasawa1,2, Hidenori Machino1,2, Satoshi Takahashi1,2, Norio Shinkai1,2,3, Ryo Shimoyama1,2, Masaaki Komatsu1,2, Ryuji Hamamoto1,2,3.
Abstract
With the completion of the International Human Genome Project, we have entered what is known as the post-genome era, and efforts to apply genomic information to medicine have become more active. In particular, with the announcement of the Precision Medicine Initiative by U.S. President Barack Obama in his State of the Union address at the beginning of 2015, "precision medicine," which aims to divide patients and potential patients into subgroups with respect to disease susceptibility, has become the focus of worldwide attention. The field of oncology is also actively adopting the precision oncology approach, which is based on molecular profiling, such as genomic information, to select the appropriate treatment. However, the current precision oncology is dominated by a method called targeted-gene panel (TGP), which uses next-generation sequencing (NGS) to analyze a limited number of specific cancer-related genes and suggest optimal treatments, but this method causes the problem that the number of patients who benefit from it is limited. In order to steadily develop precision oncology, it is necessary to integrate and analyze more detailed omics data, such as whole genome data and epigenome data. On the other hand, with the advancement of analysis technologies such as NGS, the amount of data obtained by omics analysis has become enormous, and artificial intelligence (AI) technologies, mainly machine learning (ML) technologies, are being actively used to make more efficient and accurate predictions. In this review, we will focus on whole genome sequencing (WGS) analysis and epigenome analysis, introduce the latest results of omics analysis using ML technologies for the development of precision oncology, and discuss the future prospects.Entities:
Keywords: artificial intelligence; biomarker discovery; cancer diagnosis and treatment; epigenome analysis; machine learning; precision oncology; whole genome analysis
Year: 2021 PMID: 34055633 PMCID: PMC8149908 DOI: 10.3389/fonc.2021.666937
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 6.244
Figure 1The summarized figure of chromatin structure and epigenomic analysis methods. ChIP-seq, ATAC-seq, and Hi-C methods can be used to predict the state of transcriptional activation or inactivation, and chromatin structure. Image credit: Shutterstock.com/ellepigrafica.
Overview of whole genome analysis using machine learning.
| Features | Pipeline name | Brief summary | Reference |
|---|---|---|---|
| Peak calling, mutational signature, or | HipSTR (Haplotype inference and phasing for short tandem repeat) | This method identifies | Nat. Methods (2017) ( |
| BayesTyper | This method performs genotyping of all types of variation (including SNPs, indels and complex structural variants) based on an input set of variants and read k-mer counts. | Nature (2017) ( | |
| Genomiser | This method identifies pathogenic regulatory variants in non-coding regions. | Am. J. Hum. Genet (2016) ( | |
| DeepVariant | This is a universal SNP and small-indel variant caller using deep neural networks, highlighting the benefits of using automated and generalizable techniques for variant calling. | Nat. Biotechnol (2018) ( | |
| ARC (Artifact Removal by Classifier) | This is a supervised random forest model designed to distinguish true rare | Cell (2019) ( | |
| N/A | This method addresses the challenge of detecting the contribution of non-coding variants to disease using a deep learning-based framework that predicts the specific regulatory and detrimental effects of genetic variants. | Nat. Genet (2019) ( | |
| NeuroSomatic | This is a convolutional neural network for somatic mutation detection. | Nat.Commun (2019) ( | |
| Genome graph | Graphtyper | This is an algorithm and software for discovering and genotyping sequence variation, which rearranges short read sequence data into a pan-genome and creates a graph structure that takes into account the mutations that encode sequence variation in a population by representing possible haplotypes as graph paths. | Nat. Genet (2017) ( |
| N/A | The results of the missing mutations are added to a structure that can be described as a mathematical graph, the genome graph. Compared to the existing reference genome map | bioRxiv (2017) ( | |
| (GRCh38), the genome graph can significantly improve the percentage of reads that map uniquely and completely. | |||
| GenGraph | This provides a set of tools for generating graph-based representations of sets of sequences. | BMC Bioinformatics (2019) ( | |
| N/A | This is a SV caller that uses genome graphs, which is used to analyze cancer somatic DNA rearrangements and revealed three novel complex rearrangement phenomena. | Cell (2020) ( | |
| Heterogeneity | PyClone | This is a Bayesian clustering method for grouping sets of deeply sequenced somatic mutations into putative clonal clusters while estimating their cellular prevalences and accounting for allelic imbalances introduced by segmental copy-number changes and normal-cell contamination. | Nat. Methods (2014) ( |
| MOBSTER | This is an approach for model-based tumor subclonal reconstructions. Cancer genomic data are generated from bulk samples composed of mixtures of cancer subpopulations, as well as normal cells. Subclonal reconstruction methods based on machine learning aim to separate those subpopulations in a sample and infer their evolutionary history. | Nat. Genet (2020) ( | |
| DigiPico/MutLX | This method is a powerful framework for the identification of clone-specific variants with high accuracy. | ELife (2020) ( | |
| Mutational signature | SigMA (signature multivariant analysis) | This provides an accurate identification of mutational signatures with a likelihood approach, even when the mutation count is very small. | Nat. Genet (2019) ( |
| DeepMS (deep learning of mutational signature) | This is a regression-based model to estimate the correlation between signatures and clinical and demographical phenotypes in order to identify mutational signatures. | Oncogenes (2020) ( | |
| SigLASSO | This method performs efficient cancer mutation signature analysis by accounting for sampling uncertainty, and also improves performance by allowing knowledge transfer through cooperative fitting of linear mixtures and maximizing sampling likelihood. | Nat. Commun (2020) ( | |
| GWAS | COMBI | This is a two-step algorithm that trains a support vector machine to determine candidate SNPs and then performs hypothesis testing on these SNPs. | Sci Rep (2016) ( |
| DeepWAS | This integrates regulatory effects predictions of single variants into a multivariate GWAS setting and provide evidence that DeepWAS results directly identify disease/trait-associated SNPs with a common effect on a specific chromatin feature. | PLoS Comput. Biol (2019) ( | |
| Promoter-CNN + | This is a DL-based approach for genotype-phenotype association studies to predict the occurrence of ALS from individual genotype data. A two step-approach employs (1); promoter regions that are likely associated to ALS are identified and (2) individuals are classified based on their genotype in the selected genomic regions. | Bioinformatics (2019) ( |
Figure 2Diagram of comparison between a typical enhancer and a super-enhancer. According to reference 87, super enhancers are observed in the transcriptional regulatory regions of oncogenes such as MYC in cancer cells, but not in their counterparts in normal tissues. E, enhancer; TF, transcription factor; Med, Mediator complex; RNA pol II, RNA polymerase II. Image credit: Shutterstock.com/ellepigrafica.
Epigenetic analysis typically focusing on regulatory regions.
| Features | Pipeline name | Brief summary | Reference |
|---|---|---|---|
| Epigenomic Atlas (chromatin marks/chromatin states, DHSs, active enhancers) | N/A | Mapping nine chromatin marks across nine cell types. Systematically characterizes regulatory elements, cell-type specificities, and functional interactions. Defining multicell activity profiles for chromatin state, gene expression, regulatory motif enrichment, and regulator expression. Assigning candidate regulatory functions to disease-associated variants from GWAS. | Nature (2011) ( |
| N/A | Presenting extensive map of human DNase I hypersensitive site (DHSs) to identify through genome-wide profiling in 125 diverse cells and tissue types. The map shows relationships between chromatin accessibility, transcription, DNA methylation, and mutation rate in regulatory DNA. | Nature (2012) ( | |
| N/A | The bidirectional capped RNAs measured by cap analysis of gene expression (CAGE) are robust predictors of enhancer activity. Enhancers share properties with CpG-poor messenger RNA promoters but produce bidirectional, exosome-sensitive, relatively short unspliced RNAs. The generation of RNA is strongly related to enhancer activity. | Nature (2014) ( | |
| Regulatory sequence/Network identify (enhancer/promoter/EPI, | ELMER (Enhancer Linking by Methylation/Expression Relationships) | This uses methylation and expression data to identify cancer-specific regulatory transcription factors, detect enhancer-gene promoter pairs, and correlate enhancer status with expression of neighboring genes. | Genome Biol (2015) ( |
| JEME (joint effect of multiple enhancers) | This method is an inference of enhancer-target networks, and consists of two steps: identifying enhancers that regulate transcription start sites (TSSs) across all samples, and detecting enhancers that regulate TSSs in a particular sample, to determine the target genes of transcriptional enhancers in a particular cell or tissue. | Nat. Genet (2017) ( | |
| FOCS (FDR-corrected OLS with Cross-validation and Shrinkage) | This method estimates the link between enhancers and promoters based on the correlation of activity patterns between samples and implements a leave-cell-type-out cross-validation (LCTO CV) procedure to avoid overfitting of the regression model to the training samples. The cross-validation scheme consists of learning training set of samples and evaluation left-out samples from other cell types. This also provides extensive enhancer–promoter maps from ENCODE, Roadmap Epigenomics, FANTOM5, and a new compendium of GRO-seq samples. | Genome Biol (2018) ( | |
| SPEID (Sequence-based Promoter-Enhancer Interaction with Deep learning; pronounced “speed”) | This method predicts enhancer-promoter interactions using DL models from genomic sequences, using only the location of enhancers and promoters in specific cell types. Using the melanoma dataset, this shows that there is potential to identify somatic non-coding mutations that reduce or interrupt important enhancer-promoter interactions (EPIs). | Quant. Biol (2019) ( | |
| EP2vec | This method uses natural language processing to predict enhancer-promoter interactions, and also extracts sequence-embedded features (fixed-length vector representations) using an unsupervised DL model, the paragraph vector. The extracted features are used to train a classifier to predict the interaction using supervised learning. This can also merge sequence embedded features with experimental features for more accurate prediction. | BMC Genomics (2018) ( | |
| Inference of the 3D structure of chromatin | Transcriptional decomposition | This separates RNA expression into positionally dependent (PD) component and positionally independent (PI) effects by transcriptional decomposition method to show the predictability of fine-scale chromatin interactions, chromosomal positioning, and three-dimensional chromatin architecture. | Nat. Commun (2018) ( |
| CHINN (Chromatin Interaction Neural Network) | This predicts chromatin interactions between open chromatin regions using DNA sequence and distance using convolutional neural network. This also extracts sequence features and feed into classifiers. | bioRxiv (2019) ( | |
| HiC-Reg | This method uses one-dimensional regulatory signals (chromatin marks, architecture, transcription factor proteins, and chromatin accessibility) and the published Hi-C dataset as training count data to predict cell line-specific contact counts. A random forest regression model is used as the main prediction algorithm. | Nat. Commun (2019) ( |