Literature DB >> 36187912

Multi-omics Data Integration Model Based on UMAP Embedding and Convolutional Neural Network.

Bashier ElKarami¹, Abedalrhman Alkhateeb², Hazem Qattous², Lujain Alshomali², Behnam Shahrrava¹.

Abstract

Introduction: Multi-omics data integration facilitates collecting richer understanding and perceptions than separate omics data. Various promising integrative approaches have been utilized to analyze multi-omics data for biomedical applications, including disease prediction and disease subtypes, biomarker prediction, and others.
Methods: In this paper, we introduce a multi-omics data integration method that is constructed using the combination of gene similarity network (GSN) based on uniform manifold approximation and projection (UMAP) and convolutional neural networks (CNNs). The method utilizes UMAP to embed gene expression, DNA methylation, and copy number alteration (CNA) to a lower dimension creating two-dimensional RGB images. Gene expression is used as a reference to construct the GSN and then integrate other omics data with the gene expression for better prediction. We used CNNs to predict the Gleason score levels of prostate cancer patients and the tumor stage in breast cancer patients.
Results: The model proposed near perfection with accuracy above 99% with all other performance measurements at the same level. The proposed model outperformed the state-of-art iSOM-GSN model that constructs the GSN map based on the self-organizing map.
Conclusion: The results show that UMAP as an embedding technique can better integrate multi-omics maps into the prediction model than SOM. The proposed model can also be applied to build a multi-omics prediction model for other types of cancer.

Entities: Chemical

Keywords: Multi-omics data integration; UMAP; cancer; data embedding; deep learning

Year: 2022 PMID： 36187912 PMCID： PMC9523837 DOI： 10.1177/11769351221124205

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

The advancement in next-generation sequencing (NGS) technology has exponentially increased the amount of available biological data. NGS methods generate cost-effective enormous volumes of omics data, including genomic, transcriptomic, proteomic, and epigenomic data. This leads to delivering a more comprehensive understanding of the various properties of genes, proteins, or biomolecules. Multi-omics data provide multiple views with different feature sets for the same patients. Therefore, there is necessary to develop new approaches to handle large-scale data to integrate and analyze multi-omics data, and machine learning is playing a vital role in this task.[1,2] Arjmand et al discussed the importance of utilizing machine learning in multi-omics data integration in the prognosis, diagnosis, and treatment of cancer. Cai et al listed the technology that has been used to measure each omic, the resultant data formats, and the corresponding analytical strategies. The analytics strategies may handle the data in earlier (data concatenation), middle, late stage of the prediction model. While concatenation treats the heterogeneous omics data similarly, the late stage strategy treats various omics independently. In the middle stage strategy, the model is built based on finding the relationships among the omic features, then integrate them in one prediction model that learns from the associations between the extracted relationships to mimic the actual biological associations. Some methods apply clustering based on a joint latent variable model to integrate multi-omics data to categorize cancer data into subtypes.[3,4] In this work, we propose utilizing an embedding technique to merge various omics in the prediction model. The aim is to find the global association among discriminative features from the various omics data that works together in the cause or the progress of the diseases. Data embedding techniques are used to map the data into lower dimensional approaches.[5-8] Kohonen suggested an artificial neural network named self-organizing map (SOM) to find spatial organized internal representations for higher-dimensional features. SOM topological information must be determined in advance. Hinton and Roweis proposed using stochastic neighbor embedding (SNE) to locate the sample’s data points in a 2 or three-dimensional map. SNE struggles to capture the local structure of the data in a map, which is solved later by using the t-distributed variant, where the newer version is known as t-SNE. UMAP is another dimensionality reduction and visualization method that is built on mathematical foundations related to the Laplacian eigenmaps. UMAP preserves more of the global structure than t-SNE with superior run time performance. Argelaguet et al proposed multi-omics factor analysis (MOFA), a Bayesian model that factorizes omics data to extract fundamental causes of variation in multi-omics data sets. Chalise and Fridley introduced iOmicsPASS, a network-based multi-omics integration method that provides a supervised evaluation of quantitative multi-omics data to calculate biological interaction scores. A shrunken gene-centroid algorithm is applied to the scores to discover predictive sub-networks for phenotypic groups. Chalise and Fridley introduced a clustering integration approach named intNMF for multi-omics data integration. The approach exploits non-negative matrix factorization (NMF) to classify disease subtypes of datasets consisting of DNA methylation, mRNA gene expression, and protein expression. The approach utilizes a separate set of clusters of multiple high-dimensional molecular data without the need for distributional assumptions. Meng et al proposed a multivariate integration method called multiple co-inertia analysis (MCIA). The method applies a covariance optimization criterion to detect interactions and variations between multiple datasets by projecting the multiple datasets into the same dimensional space. In another study by Lyu and Haque the high dimensional RNA-Seq data was embedded into a two-dimensional map to classify tumor types through a convolutional neural network. A heatmap was created for all the genes to confirm the association of top genes to tumor-specific pathways. Fatima and Rueda introduced iSOM-GSN, which integrates multi-omics data of gene expression, gene CNA, and gene methylation by embedding the high dimensional multi-omics data into a lower two-dimensional grid. The embedding applies Kohonen’s self-organizing map (SOM) to gene expression data and merges them with other genomic features to enhance visualization and performance. Then, a convolutional neural network is applied to classify diseases types and status. In addition, Alkhateeb et al introduced a prediction model of a 5-year interval survival of breast cancer InClust5 based on integrating multi-omics data that consists of gene expression, copy number alteration (CNA), and clinical features datasets using a deep learning model. It expands the iSOM-GSN model by exploiting a self-organizing map (SOM) to embed each omics data into a lower two-dimensional relational map instead of relying only on the gene expression map as in the iSOM-GSN model. Three convolutional neural networks are used to classify each map. The outputs of CNNs are fed to an integration layer that utilizes majority votes to predict the model’s output. Another work involving SOM was introduced by Jansen et al and named SOMatic. SOMatic is a gene regulatory network that integrates scRNA-seq and scATAC-seq data by assembling a self-organizing map (SOM) for each dataset to distinguish genes and chromatin that might alter over time. Then, -means clustering accumulates the 2 SOMs data into meta-clusters to connect similar genes and corresponding genomic regions. In another study, Zhou et al incorporated t-distributed stochastic neighbor embedding (t-SNE) and residual neural network (ResNet) to integrate multi-omics data, including gene expression, copy number alteration (CNA), and mRNA for Nottingham Prognostics Index (NPI) prediction in a cohort of breast cancer patients. t-SNE was applied separately to each omics data, then concatenated their maps before being fed to the residual neural network (ResNet). This paper designs a GSN via UMAP to integrate multi-omics data for predictions of disease states. First, we apply UMAP to the gene expression omics to embed it into a lower dimension and create a template map to project other omics data into the template to enhance the performance. Then, all feature maps will feed a convolutional neural network for disease classification.

Materials and Methods

Datasets

In this work, 2 cancer data sets are investigated: TCGA Prostate Adenocarcinoma (PRCA) for patients’ classification based on Gleason scores and the TCGA Breast Invasive Carcinoma (BRCA) that explores the tumor stages. Both data sets contain 3 omics: gene expression, DNA methylation, and copy number alteration (CNA). The total number of samples for PRCA and BRCA is 499 and 570, respectively. The PRCA is divided into 3 classes; 3 + 4, 4 + 3, and the combination of 4 + 5 and 5 + 4 as the same class due to the low number of samples in theses advance classes. The BRCA is divided into 3 classes: 2A, 2B, and 3A. We only considered the samples with the 3 omics, which reduced the number of samples to 387 and 392 patients for PRCA and BRCA, respectively. The distributions of samples in both datasets are listed in Table 1.

Table 1.

The distribution of samples among 2 datasets the Gleason score classes in the PRCA dataset and the tumor stage classes in the BRCA dataset.

The PRCA data set		The BRCA dataset
Number of samples	Gleason score class	Number of samples	Tumor stage
147	3 + 4	179	2A
101	4 +3	129	3B
139	4 + 5 and 5 + 4	84	3A

The distribution of samples among 2 datasets the Gleason score classes in the PRCA dataset and the tumor stage classes in the BRCA dataset.

Pre-processing

We adopted the preprocessing steps in Fatima and Rueda. First, the gene expression features were filtered to eliminate all those with less than 0.2% variance. As a result, the number of gene expression features went down from about 39 000 to 16 000. Then, all 3 omics data were normalized on an average scale, and genes that are not listed in HUGO format were eliminated. The last step was to substantially distinguish the mutated genes through the MutsigCV algorithm ; it calculates False-discovery rates (q-values), then genes with q ⩽ 0.1 were identified as significantly mutated that yielded select 14 mutated genes from MutsigCV output for this study. These genes are listed in Table 2.

Table 2.

The top 14 selected genes using MutsigCV in the PRCA dataset.

Gene	Gene Description
SPOP	SPOP (Speckle Type BTB/POZ Protein) is a Protein Coding gene.
FOXA1	FOXA1 (Forkhead Box A1) is a Protein Coding gene.
CTNNB1	CTNNB1 (Catenin Beta 1) is a Protein Coding gene.
CLPTM1L	Cleft Lip And Palate Transmembrane Protein 1-Like Protein is a Protein Coding gene.
DPYSL2	DPYSL2 (Dihydropyrimidinase Like 2) is a Protein Coding gene.
NEIL1	NEIL1 (Nei Like DNA Glycosylase 1) is a Protein Coding gene.
PITPNM2	PITPNM2 (Phosphatidylinositol Transfer Protein Membrane Associated 2) is a Protein Coding gene.
ATM	Ataxia-telangiectasia (A-T) is a recessive disorder resulting from germline mutation of the A-T mutated (ATM) gene on chromosome 11q.
EMG1	EMG1 (EMG1 N1-Specific Pseudouridine Methyltransferase) is a Protein Coding gene.
ETV3	ETV3 (ETS Variant Transcription Factor 3) is a Protein Coding gene.
BRAF	BRAF (B-Raf Proto-Oncogene, Serine/Threonine Kinase) is a Protein Coding gene.
NKX3-1	NKX3-1 (NK3 Homeobox 1) is a Protein Coding gene.
ZMYM3	ZMYM3 (Zinc Finger MYM-Type Containing 3) is a Protein Coding gene.
SALL1	SALL1 (Spalt Like Transcription Factor 1) is a Protein Coding gene.

The top 14 selected genes using MutsigCV in the PRCA dataset.

Proposed method

The workflow of our method is illustrated in Figure 1. It starts by generating a gene similarity network (GSN) via UMAP on gene expression omics to convert the high-dimensional gene expression omics to a two-dimensional map and create a feature template. Then, the template integrates all omics data and depicts each sample as a colored image filled in with all omics data. Finally, those images are fed to CNN for classification.

Figure 1.

Shows the workflow of the proposed method.

Uniform manifold approximation and projection (UMAP)

Uniform manifold approximation and projection is a dimension reduction method employing Riemannian geometry and algebraic topology theory. UMAP utilizes the high dimensional data to build a fuzzy weighted graph illustrating the likelihood of connection of each pair of data points. Then, this graph is mapped into a lower dimensionality, creating a fuzzy graph similar to the high dimensional graph to preserve the local structure. UMAP assumes a uniform distribution of data on the Riemannian manifold, the approximation of the Riemannian metric is locally constant, and the manifold is locally connected. UMAP uses a radius to connect each data point with its neighbors that fall within the radius distance. The radius is a critical aspect of UMAP, where it might cause points to cluster in small and isolated clusters with a small radius or too large clusters with a large radius. UMAP overcomes this issue by selecting a local radius based on the distance between each point and its nearest neighbor. UMAP operates by building the weighted -neighborhood graph and then computing the low-dimensional layout of this graph.

Weighted K-neighborhood Graph

Assume the input dataset with dissimilarity metric for each and an input hyperparameter , we will compute and as follows: is set to satisfy the following: Where: is - nearest neighbor for each point . Now, we compute the weighted directed graph , where the vertices of are the dataset . By constructing the set of directed edges . The weight function w is computed as follows: The adjacency matrix of undirected weighted graph B can be computed as: Where: A is the weighted adjacency matrix of D–, and “∘” is the Hadamard product.

Low-dimensional layout

A low-dimensional force-directed graph layout algorithm is employed in practice by UMAP. The algorithm applies repulsive force at vertices and gravitational forces at edges. The gravitational force between 2 vertices and at coordinates is computed as the following: Where: and are hyper-parameters. The repulsive force is computed as the following: Where: is a constant number to avoid dividing by zero.

Gene similarity network and omics integration

We apply UMAP on the gene expression omics to build the GSN and visualize genes on a two-dimensional map. The two-dimensional map coordinates the genes based on their similarity and shows the connections between linked genes. The two-dimensional map is used as a template to integrate all omics data. The integration is done by creating a circular zone of chosen radius around genes points as depicted in Figure 2, then filling those zones with different colors related to omics’ type as shown in Figure 3. Each data sample would contribute to coloring the RGB palette if it only falls within a certain radius of a gene point. The red color (R) is donated for gene expression, the green (G) for DNA methylation, and the blue (B) for CNA.

Figure 2.

The template created by UMAP and gene expression.

Figure 3.

The integration of the 3 omics where the mixed (RGB) colors indicate the combination of their values.

The template created by UMAP and gene expression. The integration of the 3 omics where the mixed (RGB) colors indicate the combination of their values.

Classification

CNNs are deep feed-forward neural networks that apply convolution operations as feature extraction from images. In addition to the convolutional layers, CNNs consist of other layers that incorporate dropout regularization technique to enhance their performance. These layers include pooling layers, fully-connected layers, and classification layers. The architecture of our CNN is as follows:

First convolutional layer

It consists of 32 convolutional filters of size 3 × 3 with a rectified linear operator (ReLU), a Max-pooling layer of 2 × 2 size and 1 × 1 stride, a normalized layer, and a dropout layer of 20% ratio.

Second convolutional layer

Third convolutional layer

It consists of 32 convolutional filters of size 3 × 3 with a rectified linear operator (ReLU), a Max-pooling layer of 2 × 2 size and 2 × 2 stride, a normalized layer, and a dropout layer of 50% ratio.

First fully connected layer

It consists of 128 neurons with a rectified linear operator (ReLU), a normalized layer, and a dropout layer of a 10% ratio.

Second fully connected layer

It is the prediction layer, and it consists of 3 neurons that feed their output to a Softmax layer to predict the classes based on their probabilities.

Experiments and Results

For this experiment, we applied the proposed model to both the PRCA and the BRCA data sets. We kept the default setting of UMAP’s neighbor, which is 15. Using grid-search, we set the learning rate to 0.07 and employed 1000 epochs which provided the best accuracy. The datasets samples are divided into 70% training pool and 30% testing pool. We also ran iSOM-GSN model on the both data sets and kept the default parameter to compare it with the proposed method. The proposed method performed very well in the testing pool, where it achieved over 99% in all evaluation metrics, as in Table 3.

Table 3.

Performance evaluation of the proposed model and iSOM-GSN.

	The PRCA data set		The BRCA data set
Evaluation metric	Proposed model	iSOM-GSN	Proposed model	iSOM-GSN
Accuracy	99.37%	97.89%	97.66%	82.83%
Precision	99.69%	98.82%	99.59%	98.58%
Recall	99.83%	98.72%	99.70%	98.88%
F1-measure	99.75%	98.71%	99.63%	98.65%
AUC	0.9992	0.9984	0.9982	0.9676

Performance evaluation of the proposed model and iSOM-GSN. We used the following performance measurements as evaluation metrics: Where: is positive predictive value that is measured as equation (11): Sensitivity or true positive rate (TPR) is defined as: Specificity or true negative rate (TNR) is defined as: is true positive, is false positive, is true negative, and is false negative. The results illustrate the robustness of our model. For both data sets, the performance of the model almost scored near perfection in each evaluation metric. For the PRCA data set, the accuracy of the proposed model is 99.37%, while it is 97.89% for iSOM-GSN model. The area under the curve (AUC) is a robust overall performance measurement ; it measures how the prediction model can classify both positive and negative classes. The proposed model scored 0.9992 of AUC compared to 0.9984 for the iSOM-GSN. Similarly, for the BRCA data set, the accuracy of the proposed model is 97.66%, while it is 82.83% for iSOM-GSN model. The proposed model scored 0.9982 of AUC compared to 0.9676 for i-SOM-GSN. The proposed model outperformed iSOM-GSN by 1% to 2% in the remaining performance measurements as seen in Table 3 for both data sets.

Discussion

Many previous studies depend on early data concatenation or independent analysis of the late merging of the omics data in the prediction model. Data embedding techniques try to extract the meaningful relationships using visual maps, then merge those relationships in the CNN model to find the global associations from the spatial representation of the omics. The model utilizes UMAP, which tries to find the global and local structure of the relationships among the features and represent it on a two-dimensional map. The model outperformed the state-of-art iSOM-GSN in predicting the cancer outcomes from 2 publicly available data sets. The first is Gleason score levels in prostate cancer, and the second is the tumor stage in breast cancer. The clinical reports may incorporate the multi-omics biomarkers to assist the physicians in prescribing the proper treatment. Similar to iSOM-GSN, this work’s main limitation is that it can only integrate 3 omics because we are using the RGB coloring system. Another limitation of the current multi-omics data models is the lack of a large number of samples. Most of the current publicly available data sets contain a couple of hundreds of samples that may lead to insignificant results in the lower number of data sets’ samples.

Conclusion

Cancer has a heterogeneous nature, where there is always a necessity to find biomarkers for different subtype of cancer. In this model, the GSN map was created using UMAP to merge patients’ samples with 3 omics, including Gene expression, DNA methylation, and CNA. The maps are colored using the samples values in the RGB coloring system. The embedded patients’ maps are fed into a deep learning prediction model consisting of several CNN levels. UMAP extracts the discriminative relationships between the features by mapping them into Laplacian eigenmaps. The model is applied to the PRCA and BRCA to predict the outcome of cancer, and it outperformed iSOM-GSN that is another embedding data integration model. The integration happens at the middle stage of the machine learning model. While the genes were selected from the gene expression data to create the template, all omics have been used in coloring the two-dimensional maps. The future direction is to investigate more types of cancer and other complex diseases, and to enhance the embedding techniques to reach the ultimate modeling of the molecular-based analysis.

12 in total

1. Pattern discovery and cancer gene identification in integrated cancer genomic data.

Authors: Qianxing Mo; Sijian Wang; Venkatraman E Seshan; Adam B Olshen; Nikolaus Schultz; Chris Sander; R Scott Powers; Marc Ladanyi; Ronglai Shen
Journal: Proc Natl Acad Sci U S A Date: 2013-02-21 Impact factor: 11.205

2. The meaning and use of the area under a receiver operating characteristic (ROC) curve.

Authors: J A Hanley; B J McNeil
Journal: Radiology Date: 1982-04 Impact factor: 11.105

3. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis.

Authors: Ronglai Shen; Adam B Olshen; Marc Ladanyi
Journal: Bioinformatics Date: 2009-09-16 Impact factor: 6.937

4. Integrative clustering of multi-level 'omic data based on non-negative matrix factorization algorithm.

Authors: Prabhakar Chalise; Brooke L Fridley
Journal: PLoS One Date: 2017-05-01 Impact factor: 3.240

5. Building gene regulatory networks from scATAC-seq and scRNA-seq using Linked Self Organizing Maps.

Authors: Camden Jansen; Ricardo N Ramirez; Nicole C El-Ali; David Gomez-Cabrero; Jesper Tegner; Matthias Merkenschlager; Ana Conesa; Ali Mortazavi
Journal: PLoS Comput Biol Date: 2019-11-04 Impact factor: 4.475

6. Classification of Breast Cancer Nottingham Prognostic Index Using High-Dimensional Embedding and Residual Neural Network.

Authors: Li Zhou; Maria Rueda; Abedalrhman Alkhateeb
Journal: Cancers (Basel) Date: 2022-02-13 Impact factor: 6.639

Review 7. Machine learning for multi-omics data integration in cancer.

Authors: Zhaoxiang Cai; Rebecca C Poulos; Jia Liu; Qing Zhong
Journal: iScience Date: 2022-01-22

Review 8. Machine Learning: A New Prospect in Multi-Omics Data Analysis of Cancer.

Authors: Babak Arjmand; Shayesteh Kokabi Hamidpour; Akram Tayanloo-Beik; Parisa Goodarzi; Hamid Reza Aghayan; Hossein Adibi; Bagher Larijani
Journal: Front Genet Date: 2022-01-27 Impact factor: 4.599

9. Mutational heterogeneity in cancer and the search for new cancer-associated genes.

Authors: Michael S Lawrence; Petar Stojanov; Paz Polak; Gregory V Kryukov; Kristian Cibulskis; Andrey Sivachenko; Scott L Carter; Chip Stewart; Craig H Mermel; Steven A Roberts; Adam Kiezun; Peter S Hammerman; Aaron McKenna; Yotam Drier; Lihua Zou; Alex H Ramos; Trevor J Pugh; Nicolas Stransky; Elena Helman; Jaegil Kim; Carrie Sougnez; Lauren Ambrogio; Elizabeth Nickerson; Erica Shefler; Maria L Cortés; Daniel Auclair; Gordon Saksena; Douglas Voet; Michael Noble; Daniel DiCara; Pei Lin; Lee Lichtenstein; David I Heiman; Timothy Fennell; Marcin Imielinski; Bryan Hernandez; Eran Hodis; Sylvan Baca; Austin M Dulak; Jens Lohr; Dan-Avi Landau; Catherine J Wu; Jorge Melendez-Zajgla; Alfredo Hidalgo-Miranda; Amnon Koren; Steven A McCarroll; Jaume Mora; Brian Crompton; Robert Onofrio; Melissa Parkin; Wendy Winckler; Kristin Ardlie; Stacey B Gabriel; Charles W M Roberts; Jaclyn A Biegel; Kimberly Stegmaier; Adam J Bass; Levi A Garraway; Matthew Meyerson; Todd R Golub; Dmitry A Gordenin; Shamil Sunyaev; Eric S Lander; Gad Getz
Journal: Nature Date: 2013-06-16 Impact factor: 49.962

10. A multivariate approach to the integration of multi-omics datasets.

Authors: Chen Meng; Bernhard Kuster; Aedín C Culhane; Amin Moghaddas Gholami
Journal: BMC Bioinformatics Date: 2014-05-29 Impact factor: 3.169