| Literature DB >> 34322589 |
Jose Liñares-Blanco1,2, Alejandro Pazos1,2,3, Carlos Fernandez-Lozano1,2,3.
Abstract
In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.Entities:
Keywords: BRCA; Cancer; Data integration; Machine learning; Multi-omics; Random Forest; Support Vector Machines; TCGA
Year: 2021 PMID: 34322589 PMCID: PMC8293929 DOI: 10.7717/peerj-cs.584
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Different types of data present in the TCGA repository.
| DNA Sequencing | Whole genome sequences |
|---|---|
| Whole exome sequences | |
| Sequences traces | |
| Mutations, including coding, splice site, germline and noncoding somatic variants | |
| mRNA sequences (calculated expression per gene, exon, splice junction and isoform) | |
| miRNA sequences (calculated expression per miNRA and isoform) | |
| Total RNA sequences (calculated expression per gene, exon, splice junction and isoform) | |
| Expression signals per gene, exon, splice junction, miRNA and isoform | |
| Arrays (raw, unnormalized, normalized) | |
| Low-pass DNA sequencing (whole genomes sequences, variants and coverage) | |
| Gene expression (raw, normalized and calls) | |
| Exon expression (raw, normalized and calls) | |
| miRNA expression (raw, normalized and calls) | |
| Array-based methylation (raw signal intensity, calculated beta values) | |
| Protein expression (high-resolution images of protein arrays, raw signals, normalized expression and mass spectrometry protein) | |
| Microsatelite instability (markers and classification) | |
| ATAC-seq (chromatine accesibility) | |
| Clinical information about patients (e.g., sex, race, ethnicity, drugs taken, metastasis status and response to treatment) | |
| Information about samples (e.g., the weight of a sample portion, days to collect and time of freezing) | |
| Images of the tumors |
Figure 1Quantification of the number of samples in the TCGA repository, classified by type of tumour and type of biotechnological analysis.
Clin, Clinical; SNP6, SNP6 CopyNum; DNAseq, LowPass DNASeq CopyNum; Mutat, Mutation Annotation File; Met, Methylation; rawMut, rawMutation Annotation File; Prot, Reverse Phase Protein Array.
Enumeration of the different cohorts presented by the TCGA repository, classified according tothe tissue of origin of the tumour.
In addition, the original paper published by the TCGA consortium is cited.
| Cancer type | Acronym | Tissue | Citation |
|---|---|---|---|
| Breast Ductal/Lobular Carcinoma | BRCA | Breast | ( |
| Glioblastoma Multiforme | GBM | Central Nervous System | ( |
| Lower Grade Glioma | LGG | Central Nervous System | ( |
| Adrenocortical Carcinoma | ACC | Endocrine | ( |
| Papillary Thyroid Carcionma | THCA | Endocrine | ( |
| Paraganglioma & Pheochromocytoma | PCPG | Endocrine | ( |
| Cholangiocarcinoma | CHOL | Gastrointestinal | ( |
| Colon Adenocarcinoma | COAD | Gastrointestinal | ( |
| Rectal Adenocarcinoma | READ | Gastrointestinal | ( |
| Esophageal Cancer | ESCA | Gastrointestinal | ( |
| Liver Hepatocellular Carcionoma | LIHC | Gastrointestinal | ( |
| Pancreatic Ductal Adenocarcinoma | PAAD | Gastrointestinal | ( |
| Stomach Cancer | STAD | Gastrointestinal | ( |
| Cervical Cancer | CESC | Gynecologic | ( |
| Ovarian Serous Cystadenocarcinoma | OV | Gynecologic | ( |
| Uterine Carcinosarcoma | UCS | Gynecologic | ( |
| Uterine Corpus Endometrial Carcinoma | UCEC | Gynecologic | ( |
| Head and Neck Squamous Cell Carcinoma | HNSC | Head and Neck | ( |
| Uveal Melanoma | UVM | Head and Neck | ( |
| Acute Myeloid Leukemia | AML | Hematologic | ( |
| Thymoma | THYM | Hematologic | ( |
| Cutaneous Melanoma | SKCM | Skin | ( |
| Sarcoma | SARC | Soft Tissue | ( |
| Lung Adenocarcinoma | LUAD | Thoracic | ( |
| Lung Squamous Cell Carcinoma | LUSC | Thoracic | ( |
| Mesothelioma | MESO | Thoracic | ( |
| Chromophobe Renal Cell Carcinoma | KICH | Urologic | ( |
| Clear Cell Kidney Carcinoma | KIRC | Urologic | ( |
| Papillary Kidney Carcinoma | KIRP | Urologic | ( |
| Prostate Adenocarcinoma | PRAD | Urologic | ( |
| Testicular Germ Cell Cancer | TGCT | Urologic | ( |
| Urothelial Bladder Carcinoma | BLCA | Urologic | ( |
| Diffuse Large B-cell Lymphoma | DLBC | Lymphatic tissue |
Figure 2(A) Number of papers that used each type of algorithm, and (B) relations between omics data used in each work.
Figure 3Number of papers published with each of the TCGA cohorts.
Upset plot showing the number of works published with each tumor type and their combinations.
Figure 4The proportion of published works with ML techniques according to the type of biological problem.