| Literature DB >> 36034741 |
Hanyi Mo1,2, Rainer Breitling3, Chiara Francavilla2,4, Jean-Marc Schwartz1.
Abstract
Breast cancer is one of the most common cancers threatening women worldwide. A limited number of available treatment options, frequent recurrence, and drug resistance exacerbate the prognosis of breast cancer patients. Thus, there is an urgent need for methods to investigate novel treatment options, while taking into account the vast molecular heterogeneity of breast cancer. Recent advances in molecular profiling technologies, including genomics, epigenomics, transcriptomics, proteomics and metabolomics data, enable approaching breast cancer biology at multiple levels of omics interaction networks. Systems biology approaches, including computational inference of 'big data' and mechanistic modelling of specific pathways, are emerging to identify potential novel combinations of breast cancer subtype signatures and more diverse targeted therapies.Entities:
Keywords: Breast cancer; Deep learning; Multi-omics modelling; Network biology; Precision oncology
Year: 2022 PMID: 36034741 PMCID: PMC9402443 DOI: 10.1016/j.coemr.2022.100350
Source DB: PubMed Journal: Curr Opin Endocr Metab Res ISSN: 2451-9650
Figure 1Selected Omics/clinical data commonly used in bioinformatics analysis.
Selected data resources from molecular profiling technologies useful for breast cancer data analysis, grouped into three categories in line with their purposes and usages. Resources under ‘data portals and databases’ not only can host various cancer-specific data portals, including the selected data projects listed under the category ‘ongoing data projects’, but also provide download possibilities and other bioinformatics tools for downstream analysis such as visualisation and pathway enrichment analysis. ‘General omics data sources’ list four representative databases for gene expression, protein expression and compound information with a larger scope than cancer research. The International Cancer Genome Consortium (ICGC) [22], Genomic Data Commons (GDC) [25], cBioPortal for Cancer Genomics [57], Catalogue Of Somatic Mutations In Cancer (COSMIC) [58], Transcriptome Alterations in CanCer Omnibus (TACCO) [59], Genomics of Drug Sensitivity in Cancer (GDSC) [44]. The Cancer Genome Atlas (TCGA) [23], Clinical Proteomic Tumor Analysis Consortium (CPTAC) [24], Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [60,61], Gene Expression Omnibus (METABRIC) [62], PRoteomics IDEntifications (PRIDE) [63], MetaboLights [26]. Data types are described according to omics levels.
| Data types | Description | Highlight | |
|---|---|---|---|
| Cancer-specific data portals or databases | |||
| ICGC | G | A comprehensive interactive database portal containing data from 84 cancer programs worldwide, 77 million somatic mutations and molecular data from over 24,000 donors. | ICGC encompasses various index search technologies to optimise computational performance for large-scale searches. |
| GDC | G | An information web-based database harmonising data from various cancer projects including TCGA and CPTAC (see below) for visualisation and downloading. | GDC aims at developing a holistic taxonomy of cancer types and providing state-of-the-art bioinformatics tools to enhance the interpretation of data. |
| cBioPortal for Cancer Genomics | G | A data portal hosting data from over 5000 tumour samples from 20 cancer studies, including the METABRIC (see below) project, enabling both web access and script libraries (e.g., MATLAB and R) to meet customised analysis requirements. | The cBioPortal provides unique functionality of interactive network analysis for studying the cancer of interest and supports the visualisation of mutations within Pfam protein domains. |
| COSMIC | G | A thorough data portal with a specialised focus on somatic mutations driving 10 cancer development, consisting of 6 million coding mutations across 1.4 million tumour samples. | COSMIC provides an improved data visualisation and downloading portal that also hosts a 3-D protein structure exploration tool (COSMIC-3D) to link mutations to protein function. |
| TACCO | T | An easy-to-use interface for connecting transcriptome data (e.g., differentially expressed genes, DEGs and differentially expressed miRNAs, DEmiRNAs) and pathway dysregulations to clinical outcomes in pan-cancer studies. | TACCO allows users to either select DEGs/DEmiRNAs from pre-defined gene lists or upload genes of interest to perform downstream tasks (e.g., KEGG pathway/gene ontology enrichment analysis, multi-gene prognostic models). |
| GDSC | G | A pharmacogenomic data repository hosting information on anti-cancer drug sensitivity and molecular markers of drug responses, containing overall 518 compounds targeting 24 pathways. | GDSC differentiates data by the response to anti-cancer drugs and by pathways in a pan-cancer and pan-drug manner. It also allows browsing data by tissue-specific terms and incorporates TCGA cancer classifications and COSMIC mutation identities. |
| Ongoing cancer-specific data projects | |||
| TCGA | G | A long-term cancer genomic project launched in 2006, characterising more than 20,000 primary cancer samples and mapping them to 33 cancer types. | TCGA utilises numerous data generation platforms including RNA-seq, miRNA-seq, DNA-seq, array-based SNP, array-based DNA methylation sequencing, and reverse-phase protein array to provide a collection of omics data types for cancer studies. |
| CPTAC | P | A project emphasising mass spectrometry-based protein profiling of tumour samples in accordance with TCGA projects. | CPTAC incorporates the CPTAC Common Data Analysis Platform (CDAP) to diminish instrumentation variability among data and to better integrate with TCGA datasets. |
| METABRIC | G | A BC-specific data program for elucidating molecular drivers with an extensive focus on inherited copy number variations/acquired copy number alterations (CNVs/CNAs). | METABRIC identifies novel loci that contribute to breast carcinogenesis and discovers that somatic CNAs show more prognostic power in a long-term clinical context compared to germline CNVs. |
| General single-omics data sources | |||
| GEO | G | A public archive for researchers to submit array- or sequence-based functional genomic data, accessible by both web portal and R library interface. | GEO contains almost 147,000 breast cancer-related studies and 184 breast cancer-related datasets to date. |
| ArrayExpress | G | A public archive hosting data generated by a variety of profiling technologies with most on DNA, RNA assays while few on protein and metabolic profiling. | ArrayExpress contains over 4000 experiments regarding breast cancer including 788 DNA, 3314 RNA and 29 protein assays. |
| PRIDE | P | A proteome-focused repository mainly for depositing mass spectrometry-based proteomics data, including protein and post-translational modification expression data. | PRIDE hosts over 500 breast cancer proteomics datasets to date with details on sample preparation and data processing. |
| MetaboLights | M | A database hosting metabolomics experimental data, relevant information and a central hub for metabolomics related data and tools. | MetaboLights encompasses over 200 breast cancer-related compounds and around 134 case studies for breast cancer research. |
G, genomics; T, transcriptomics; E, epigenomics; P, proteomics; M, metabolomics; C, clinical data
Deep learning-based multi-omics integration approaches including case studies on breast cancer. Subtype-GAN [39], Denoising autoencoder for accurate CAncer Prognosis prediction (DCAP) [41], DeepProg [42], BRCA Multiomics [38], Multi-Omics Late Integration (MOLI) [43], Survival Analysis Learning with Multi-Omics neural Networks (SALMON) [37], DeepType [36], Concatenation AutoEncoder (ConcatAE) and Cross-modality AutoEncoder (CrossAE) [64], IntegrativeVAEs [65], Drug Response analysis Integrating Multi-omics (DRIM) [66].
| Software | Arch. | Purpose | Highlights |
|---|---|---|---|
| Subtype-GAN | GAN | To extract low-dimension features for predicting novel biomarkers and patient stratifications. | The first algorithm to explore the potential of generative adversarial network (GAN) architecture to improve the feature selection process by autoencoder (AE) methods. |
| DCAP | AE | To predict differentially expressed genes (DEGs) and to discriminate high- and low-risk groups of patients based on predicted DEGs. | Pan-cancer risk prediction system. It ranks the importance of omics data types by mRNA expression > miRNA expression > DNA methylation > copy number variations (CNVs). |
| DeepProg | AE | To predict patient survival subtypes using supervised machine learning algorithms from reduced dimensions by AE. | Trains on pan-cancer datasets to allow learning from well-established survival of cancer types to predict that for other less-studied cancer types. Flexible using input data types (e.g., mRNA expression). |
| BRCA Multiomics | MLP | To predict survival and drug responses at the same time by combining two multilayer perceptron (MLP) inferences using survival datasets from TCGA and drug response datasets from GDSC, respectively. | The tool focuses on breast cancer omics data and clinical outcomes and tries to build a connection between patient survival and treatment outcomes to predict if the treatment indeed improves the patient's condition. |
| MOLI | AE | To predict drug responses from selected features by training on each omics data type separately and then concatenating them into one representation. | MOLI employs a ‘late integration’ strategy and trains on drug response datasets targeting biological pathways rather than specific cancer types to hypothesise other non-traditional drugs for treating BC. |
| SALMON | MLP | To predict patient survival and characterise which data types are most pivotal predictors by incorporating omics data and clinical annotations (e.g., age). | SALMON groups patients by their ages at diagnosis (young: 26–50, middle: 51–70, elderly: 71–90). It identified that PR status is most predictive for the young group, ER status for the middle group and mRNA co-expression modules for the elderly group. |
| DeepType | MLP | To extend gene markers (218 DEGs) for breast cancer patient stratification by integrating omics data types and previous PAM50 subtypes. | The first deep learning-based method for patient stratification using mRNA expression only. The involvement of prior knowledge (PAM50 subtypes) addresses de novo clustering problems. |
| ConcatAE and CrossAE | AE | To question the essence of multi-omics integration, the expression similarity or the difference between omics data types, which is more informative for patient survival prediction. | By comparing learning from the similarity and from the difference between the expression in omics data types, it reports that the expression difference is a stronger predictor. |
| IntegrativeVAEs | AE | To investigate the inner architectures of AE for feature selection for classifying patient data by clinical annotations (e.g., PAM50 labels and metastasis status). | Patient samples are labelled by distance relapse and the co-effects of gene expression, CNA and clinical annotations are learned by different inner designs of AE for predicting relapse possibilities. |
| DRIM | AE | To model drug sensitivity from cancer cell lines and drug perturbation by selecting DEGs and analysing them according to pathway enrichment analysis. | DRIM provides a user-friendly website to select drug/cell line of interest for non-experts and allows users to customise the feature selection methods. |
Arch.: The deep learning architecture mainly used in these studies.
Figure 2Representation of three common deep learning architectures for multi-omics integration in cancer research. a) The autoencoder (AE) architecture composed of an encoder and a decoder. Multi-omics data (inputs) are fed into the encoder to generate the low-dimensional latent space. The latent features are decoded then to reconstruct the original dimension space. The learning process is achieved by minimising the difference between inputs and outputs. b) The multilayer perceptron (MPL) architecture for a binary classifier using selected features from multi-omics data to predict a clinical outcome (e.g., metastasis or not). c) The generative adversarial network (GAN) adds random noise to latent features and compares generated samples (from the generator) from noise-perturbated features with original samples (by the discriminator). The discriminator then continuously feedbacks to adjust variables in the latent space. In panels a and b, the nodes, also known as neurons, represent individual data dimensions/features in each layer. The edges connecting these nodes are analogous to the synapses between neurons in biological neural networks: they represent the (weighted) propagation of information between neurons.