| Literature DB >> 26283178 |
Angela Serra1, Michele Fratello2, Vittorio Fortino3, Giancarlo Raiconi4, Roberto Tagliaferri5, Dario Greco6.
Abstract
BACKGROUND: Multiple high-throughput molecular profiling by omics technologies can be collected for the same individuals. Combining these data, rather than exploiting them separately, can significantly increase the power of clinically relevant patients subclassifications.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26283178 PMCID: PMC4539887 DOI: 10.1186/s12859-015-0680-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The proposed approach: The computational approach is composed of four steps. First, the data is pre-processed. In each view feature with low variance are filtered out. Furthermore, the features are clustered in order to reduce the input dimension. From each cluster prototype are extracted. These prototypes are the only features used in following steps (a). Second, the prototypes are ranked by the patient class separability and the most significant ones are selected (b). Third, the patients are clustered and the membership matrices are obtained (c). Fourth, a late integration approach is utilized to integrate clustering results (d).
Datasets: Description of the datasets used in this study
| Dataset | Response | N(0) | N(1) | N(2) | N(3) | Gene | RNASeq | microRNA | miRNASeq | Protein | Copy | Clinical |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| expression | expression | expression | number | data | ||||||||
| Breast Cancer from The Cancer genome Atlas, N = 151 | ||||||||||||
| TCGA.BRC | Pam50 (Her2,Basal,LumA,LumB) | 24 | 13 | 55 | 59 | x | x | |||||
| Breast Cancer from The Gene Expression Omnibus, N = 201 | ||||||||||||
| OXF.BRC.1 | Pam50 (Her2,Basal,LumA,LumB) | 26 | 6 | 117 | 52 | x | x | |||||
| OXF.BRC.2 | Clinical (Level1, Level2, Level3, Level4) | 73 | 54 | 42 | 32 | x | x | |||||
| Prostate Cancer from Memorial Sloan-Kettering Cancer Center, N = 88 | ||||||||||||
| MSKCC.PRCA | Tumor stages T1 vs. T2, T3, T4 | 53 | 35 | x | x | x | x | |||||
| Ovarian Cancer from The Cancer Genome Atlas, N = 398 | ||||||||||||
| TCGA.OVG | Tumor stage I,II, Tumor stage III, Tumor stage IV | 33 | 315 | 50 | x | x | x | |||||
| Glioblastoma Multiforme from The Cancer genome Atlas, N = 167 | ||||||||||||
| from TCGA.GBM | (Classical, Mesechymal, Neural, Proneural) | 37 | 54 | 24 | 52 | x | x | |||||
“N” is the number of subjects for each dataset. Ni is the number of samples in the i-th class. An x denotes if that view (column) is available for a specific dataset (row)
Validation Results: The mean classification error, normalized mutual information (NMI) and stability, on all datasets, are shown, measuring the agreement between the clusters resulting from an approach and the real patient classification
| Feature | Integration | Algorithm | Error | NMI | Stability | |
|---|---|---|---|---|---|---|
| Single View | All Feature | - | Ward | 30,08 % | 26 % | 86 % |
| - | Kmeans | 30,93 % | 25 % | 51 % | ||
| - | Pamk | 30,75 % | 24 % | 94 % | ||
| Selected Prototype | - | Ward | 30,72 % | 26 % | 89 % | |
| - | Kmeans | 30,36 % | 25 % | 52 % | ||
| - | Pamk | 30,78 % | 24 % | 96 % | ||
| Multi-View | All Feature | Early Integration | Tw-kmeans | 37,10 % | 24 % | 69 % |
| All Feature | Intermediate Integration | SNF | 30,83 % | 22 % | 83 % | |
| All Feature in Cluster of Selected Prototype | Intermediate Integration | SNF | 31,31 % | 18 % | 82 % | |
| Selected Prototype | Late Integration unsupervised | MF/GLI |
|
|
| |
| Selected Prototype | Late Integration semi-supervised | MF/GLI |
|
|
|
Bold font in percentage indicates best performance in the experiments
Fig. 2Multi-View Clusters Statistics: For each cluster class label, the p-value and the view contribution are reported. For all the six datasets, the results showed that the matrix factorization method gives lower classification error and better accuracy than the approach with general linear integration
Fig. 3Cluster Impurity difference between single view and integration analysis: Cluster impurity was evaluated as the fraction of objects that were inconsistent with the label of the cluster. It was calculated using each data type alone and by integrating them. Errors decreased with the integration approach in particular when the semi-supervised methodologies were used
Fig. 4Difference between alternative integration methods: The mean cluster stability is reported, as calculated on four covariates represented by the type of experiment executed. Clustering stability was calculated by comparing the unsupervised and the semi-supervised mode, both using either all the features or only the selected prototypes
Best combination of methods for each step: Summary of the best combination of algorithms for each view used to obtain the best grouping of patients that identifies significant sub-classes
| (a) | (b) | (c) | (d) | ||
|---|---|---|---|---|---|
| Dataset | Views | Feature | Feature | Patients | Late |
| clustering | selection | clustering | integration | ||
| TCGA.BRCA | RNASeq | Pam | CAT-score | Kmeans | MF |
| miRNASeq | Pam | CAT-score | Pam | ||
| TCGA.OV | Gene Expression | Pam | Random Forest | DM | MF |
| Protein Expression | Pam | - | DM | ||
| miRNA Expression | Pam | - | DM | ||
| TCGA.GBM | Gene Expressions | Spectral | CAT-score | Kmeans | MF |
| miRNA Expression | Ward | - | Kmeans | ||
| OXF.BRCA.1 | Gene Expressions | Pam | Random Forest | Ward | GLI |
| miRNA Expression | Pam | Random Forest | Kmeans | ||
| OXF.BRCA.2 | Gene Expressions | Pvcluster | CAT-score | Kmeans | MF |
| miRNA Expressions | Pam | Random Forest | Kmeans | ||
| MSKCC | Gene Expressions | Pam | CAT-score | Kmeans | MF |
| miRNA Expressions | Pam | - | Pam | ||
| CNV | Spectral | CAT-score | Kmeans | ||
| Clinical | - | - | Pam |
In the feature selection column the symbol (-) means that feature selection was not executed because the number of features was small. Symbol (DM) in Patient clustering column means that same classification error was obtained with all the algorithms used
Fig. 5Breast Cancer Gene Analysis: (a) the Venn diagram shows the number of common relevant genes between the three datasets. The analysis highlights 45 common genes between the three lists. (b) The bubble plot displays the enriched GO terms found by using DAVID. A transparent bubble indicates a set of significant genes, a dark bubble indicates a set of highly significant genes. The diameter of the bubble indicates the number of genes related to the same GO term
Oxford Dataset: Oxford Dataset, class definition by clinical data
| Class | Clinical information |
|---|---|
| Level1 | er = 1, node = 0, grade = 1–2 |
| er = 1, node = 0, grade = 3–4 | |
| Level2 | er = 1, node > 0, grade = 1–2 |
| er = 1, node > 0, grade = 3–4 | |
| Level3 | er = 0, node = 0, grade = 1–2 |
| er = 0, node = 0, grade = 3–4 | |
| Level4 | er = 0, node > 0, grade = 1–2 |
| er = 0, node > 0, grade = 3–4 |