| Literature DB >> 31797627 |
Suzan Arslanturk1, Sorin Draghici, Tin Nguyen.
Abstract
Vast repositories of heterogeneous data from existing sources present unique opportunities. Taken individually, each of the datasets offers solutions to important domain and source-specific questions. Collectively, they represent complementary views of related data entities with an aggregate information value often well exceeding the sum of its parts. Integration of heterogeneous data is therefore paramount to i) obtain a more unified picture and comprehensive view of the relations, ii) achieve more robust results, iii) improve the accuracy and integrity, and iv) illuminate the complex interactions among data features. In this paper, we have proposed a data integration methodology to identify subtypes of cancer using multiple data types (mRNA, methylation, microRNA and somatic variants) and different data scales that come from different platforms (microarray, sequencing, etc.). The Cancer Genome Atlas (TCGA) dataset is used to build the data integration and cancer subtyping framework. The proposed data integration and disease subtyping approach accurately identifies novel subgroups of patients with significantly different survival profiles. With current availability of vast genomics, and variant data for cancer, the proposed data integration system will better differentiate cancer and patient subtypes for risk and outcome prediction and targeted treatment planning without additional cost and precious lost time.Entities:
Mesh:
Substances:
Year: 2020 PMID: 31797627 PMCID: PMC6933742
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Description of the five datasets from The Cancer Genome Atlas (TCGA)
| Data Set | Patients | Data Type | Components no. |
|---|---|---|---|
| 124 | mRNA | 17,974 | |
| Methylation | 23,265 | ||
| miRNA | 590 | ||
| Somatic Variant | 3412 | ||
| 273 | mRNA | 12,042 | |
| Methylation | 22,833 | ||
| miRNA | 534 | ||
| Somatic Variant | 5172 | ||
| 158 | mRNA | 16,818 | |
| Methylation | 22,833 | ||
| miRNA | 552 | ||
| Somatic Variant | 1259 | ||
| 172 | mRNA | 20,100 | |
| Methylation | 22,533 | ||
| miRNA | 718 | ||
| Somatic Variant | 8805 | ||
| 145 | mRNA | 17,062 | |
| Methylation | 24,454 | ||
| miRNA | 710 | ||
| Somatic Variant | 13,309 | ||
Figure 1.Framework of the proposed subtyping and data integration method
Comparison of the subtypes identified using the proposed method and state-of-the-art techniques. Cells highlighted in green have Cox P-values < 0.01. Cells highlighted in yellow have Cox P-values between 0.01 and 0.05.
| Name | Data type | k | Cox P | k | Cox P | k | Cox P | k | Cox P | k | Cox P | k | Cox P |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mRNA | 2 | 0.408 | 2 | 0.408 | 5 | 0.281 | 2 | 0.992 | 10 | 0.056 | 2 | 0.408 | |
| Methylation | 2 | 10−4 | 2 | 10−4 | 6 | 0.001 | 2 | 0.017 | 10 | 0.003 | 3 | 10−4 | |
| miRNA | 3 | 0.051 | 4 | 0.086 | 6 | 0.526 | 2 | 0.401 | 10 | 0.09 | 2 | 0.276 | |
| Somatic Variant | 2 | 0.016 | - | - | - | - | 3 | 0.632 | 8 | 0.324 | - | - | |
| 7 | 0.039 | 4 | 0.162 | 5 | 0.156 | 2 | 0.408 | ||||||
| mRNA | 6 | 0.003 | 5 | 0.003 | 6 | 8×10−4 | 2 | 0.327 | 6 | 0.01 | 2 | 0.027 | |
| Methylation | 4 | 0.893 | 6 | 0.239 | 7 | 0.049 | 2 | 0.993 | 10 | 0.002 | 2 | 0.04 | |
| miRNA | 2 | 0.065 | 2 | 0.072 | 6 | 0.017 | 3 | 0.183 | - | - | 2 | 0.07 | |
| Somatic Variant | 6 | 0.469 | - | - | - | - | 3 | 0.532 | 5 | 0.324 | - | - | |
| 8 | 0.035 | 3 | 0.027 | 5 | 0.036 | 3 | 0.032 | ||||||
| mRNA | 2 | 0.902 | 2 | 0.902 | 8 | 0.114 | 2 | 0.969 | 9 | 0.101 | 2 | 0.902 | |
| Methylation | 4 | 0.048 | 4 | 0.048 | 8 | 0.578 | 5 | 0.878 | 10 | 0.083 | 2 | 0.702 | |
| miRNA | 3 | 0.218 | 3 | 0.218 | 5 | 0.142 | 2 | 0.105 | - | - | 2 | 0.093 | |
| Somatic Variant | 2 | 0.002 | - | - | - | - | 3 | 0.324 | 10 | 0.132 | - | - | |
| 7 | 7 | 0.667 | 2 | 0.398 | 10 | 0.402 | 2 | 0.902 | |||||
| mRNA | 2 | 0.109 | 2 | 0.113 | 8 | 0.048 | 2 | 0.148 | 6 | 0.29 | 2 | 0.113 | |
| Methylation | 2 | 0.719 | 2 | 0.741 | 8 | 0.034 | 2 | 0.389 | 10 | 0.194 | 2 | 0.741 | |
| miRNA | 4 | 0.468 | 4 | 0.452 | 7 | 0.318 | 3 | 0.131 | - | - | 2 | 0.801 | |
| Somatic Variant | 9 | 0.365 | - | - | - | - | 3 | 0.218 | 10 | 0.421 | - | - | |
| 5 | 0.225 | 2 | 0.246 | 10 | 0.319 | 2 | 0.113 | ||||||
| mRNA | 2 | 0.176 | 2 | 0.176 | 7 | 0.073 | 2 | 0.219 | 9 | 0.072 | 2 | 0.176 | |
| Methylation | 3 | 0.111 | 3 | 0.111 | 6 | 0.128 | 3 | 0.577 | 10 | 0.14 | 3 | 0.111 | |
| miRNA | 2 | 0.138 | 2 | 0.138 | 5 | 0.509 | 2 | 0.138 | - | - | 2 | 0.138 | |
| Somatic Variant | 2 | 0.076 | - | - | - | - | 3 | 0.124 | 9 | 0.348 | - | - | |
| 6 | 0.104 | 3 | 0.248 | 7 | 0.067 | 2 | 0.176 | ||||||
Figure 2.Kaplan-Meier survival curves of integrative genomic data clustering using proposed approach (left), PINS (center) and CC (right).