| Literature DB >> 28835735 |
Bisakha Ray1, Wenke Liu1, David Fenyö1.
Abstract
The amounts and types of available multimodal tumor data are rapidly increasing, and their integration is critical for fully understanding the underlying cancer biology and personalizing treatment. However, the development of methods for effectively integrating multimodal data in a principled manner is lagging behind our ability to generate the data. In this article, we introduce an extension to a multiview nonnegative matrix factorization algorithm (NNMF) for dimensionality reduction and integration of heterogeneous data types and compare the predictive modeling performance of the method on unimodal and multimodal data. We also present a comparative evaluation of our novel multiview approach and current data integration methods. Our work provides an efficient method to extend an existing dimensionality reduction method. We report rigorous evaluation of the method on large-scale quantitative protein and phosphoprotein tumor data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) acquired using state-of-the-art liquid chromatography mass spectrometry. Exome sequencing and RNA-Seq data were also available from The Cancer Genome Atlas for the same tumors. For unimodal data, in case of breast cancer, transcript levels were most predictive of estrogen and progesterone receptor status and copy number variation of human epidermal growth factor receptor 2 status. For ovarian and colon cancers, phosphoprotein and protein levels were most predictive of tumor grade and stage and residual tumor, respectively. When multiview NNMF was applied to multimodal data to predict outcomes, the improvement in performance is not overall statistically significant beyond unimodal data, suggesting that proteomics data may contain more predictive information regarding tumor phenotypes than transcript levels, probably due to the fact that proteins are the functional gene products and therefore a more direct measurement of the functional state of the tumor. Here, we have applied our proposed approach to multimodal molecular data for tumors, but it is generally applicable to dimensionality reduction and joint analysis of any type of multimodal data.Entities:
Keywords: Multimodal data; dimensionality reduction; nonnegative matrix factorization; phenotype prediction; proteogenomics
Year: 2017 PMID: 28835735 PMCID: PMC5564898 DOI: 10.1177/1176935117725727
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Characteristics of data sets/tasks used in this study.
| Breast cancer | N(0) | N(1) | Phosphoprotein | Protein level | Copy number | Transcript level |
|---|---|---|---|---|---|---|
| PR status (negative vs positive) | 34 | 43 | X | X | X | X |
| ER status (negative vs positive) | 23 | 54 | X | X | X | X |
| HER2 status (negative vs positive) | 58 | 19 | X | X | X | X |
| Ovarian cancer | ||||||
| Tumor stage (IC, IIA, IIB, IIC, IIIA and IIIB) vs IIIC | 19 | 50 | X | X | X | X |
| Tumor grade (G1, G2) vs G3 | 57 | 12 | X | X | X | X |
| Survival ≥ 1 y | 12 | 57 | X | X | X | X |
| Survival ≥ 2 y | 22 | 47 | X | X | X | X |
| Survival ≥ 3 y | 36 | 33 | X | X | X | X |
| Survival ≥ 4 y | 49 | 20 | X | X | X | X |
| Survival ≥ 5 y | 55 | 14 | X | X | X | X |
| Colon cancer | ||||||
| Tumor stage (I, IIA, IIB) vs (IIIA, IIIB, IV) | 52 | 38 | X | X | X | |
| Residual tumor R0 vs (RX, R1, and R2) | 68 | 12 | X | X | X | |
| Survival ≥ 1 y | 45 | 45 | X | X | X | |
| Survival ≥ 2 y | 70 | 20 | X | X | X | |
| Survival ≥ 3 y | 79 | 11 | X | X | X |
Abbreviations: ER, estrogen; HER2, human epidermal growth factor receptor 2; PR, progesterone.
N(0) and N(1) denote the number of subjects for classes 0 and 1, respectively. The encoding of classes is given in the first column.
Figure 1.Comparison of the area under ROC curve performance for predictive models built with unimodal data and multimodal data integration using uniform integration and Adaptive Multiview NNMF averaged over all the phenotypes from each Clinical Proteomics Tumor Analysis Consortium data set. The average performance of the best unimodal data was overall comparable with the best models from uniform integration or Adaptive Multiview NNMF. AUC indicates area under ROC curve; NNMF, nonnegative matrix factorization algorithm; ROC, receiver operating characteristic.
Figure 5.(A) Comparisons of unimodal best performing modality with both uniform integration and (B) Adaptive Multiview NNMF for the different tasks. Predictivity is measured by the area under receiver operating characteristic curve (AUC) performance. The results in (A) are obtained using nominal comparison of AUC differences in individual data sets/tasks using uniform integration, whereas the results in (B) are obtained using a nominal comparison of the AUC differences in individual data sets and tasks using Adaptive Multiview NNMF. NNMF indicates nonnegative matrix factorization algorithm.
Figure 2.The AUCs for predictive models built with linear support vector machines on the Clinical Proteomic Tumor Analysis Consortium breast cancer data. Models built with transcript levels performed better than models built with other data modalities for PR status and ER status. For HER2 status, copy number was the most predictive modality. The error bars represent standard errors of the mean. AUC indicates area under receiver operating characteristic curve; ER, estrogen; HER2, human epidermal growth factor receptor 2; PR, progesterone.
AUC performance for the CPTAC breast cancer data using NNMF for unimodal data and Adaptive Multiview NNMF method for multimodal data (top 50-60 components and 5508 genes).
| CPTAC breast cancer | PR status | ER status | HER2 status |
|---|---|---|---|
| Phosphoprotein (PP) level | 0.82 (0.02) | 0.93 (0.02) | 0.83 (0.05) |
| Copy number (CN) | 0.71 (0.03) | 0.88 (0.02) |
|
| Transcript (T) level |
|
| 0.92 (0.03) |
| Protein (P) level | 0.85 (0.04) | 0.94 (0.02) | 0.93 (0.04) |
| PP, CN | 0.78 (0.03) | 0.91 (0.03) | 0.97 (0.03) |
| PP, GE | 0.86 (0.03) | 0.98 (0.02) | 0.87 (0.03) |
| PP, P | 0.85 (0.03) | 0.93 (0.03) | 0.91 (0.02) |
| CN, T | 0.82 (0.03) | 0.98 (0.03) | 0.97 (0.03) |
| CN, P | 0.75 (0.04) | 0.92 (0.04) | 0.97 (0.04) |
| T, P | 0.88 (0.03) | 0.99 (0.02) | 0.92 (0.04) |
| PP, CN, T | 0.84 (0.04) | 0.98 (0.02) | 0.86 (0.04) |
| PP, CN, P | 0.82 (0.02) | 0.94 (0.03) | 0.85 (0.03) |
| PP, T, P | 0.86 (0.03) | 0.98 (0.03) | 0.84 (0.02) |
| CN, T, P | 0.85 (0.03) | 0.97 (0.02) | 0.85 (0.04) |
| PP, CN, T, P | 0.87 (0.01) | 0.96 (0.01) | 0.88 (0.01) |
Abbreviations: AUC indicates area under receiver operating characteristic curve; CPTAC, Clinical Proteomic Tumor Analysis Consortium; ER, estrogen; HER2, human epidermal growth factor receptor 2; NNMF, nonnegative matrix factorization algorithm; PR, progesterone.
Bold values indicate the best unimodal performance. The numbers in parentheses indicate standard error.
AUC for the CPTAC ovarian cancer data using NNMF for unimodal data and Adaptive Multiview NNMF method for multimodal data (top 50-60 components and 1441 genes).
| CPTAC ovarian cancer | Tumor stage | Tumor grade | ≥1 y | ≥2 y | ≥3 y | ≥4 y | ≥5 y |
|---|---|---|---|---|---|---|---|
| Phosphoprotein (PP) level |
|
| 0.79 (0.02) |
|
| 0.69 (0.01) |
|
| Copy number (CN) | 0.71 (0.01) | 0.80 (0.01) | 0.77 (0.02) | 0.70 (0.01) |
| 0.70 (0.01) | 0.74 (0.01) |
| Transcript (T) level | 0.72 (0.01) | 0.76 (0.01) | 0.75 (0.02) |
|
| 0.69 (0.01) |
|
| Protein (P) level | 0.72 (0.01) | 0.70 (0.02) |
|
| 0.68 (0.03) |
| 0.74 (0.02) |
| PP, CN | 0.70 (0.02) | 0.82 (0.01) | 0.79 (0.01) | 0.71 (0.02) | 0.69 (0.01) | 0.70 (0.02) | 0.75 (0.02) |
| PP, GE | 0.71 (0.02) | 0.78 (0.01) | 0.76 (0.02) | 0.71 (0.02) | 0.68 (0.01) | 0.70 (0.02) | 0.72 (0.02) |
| PP, P | 0.71 (0.02) | 0.80 (0.02) | 0.84 (0.02) | 0.72 (0.02) | 0.67 (0.01) | 0.69 (0.02) | 0.75 (0.02) |
| CN, T | 0.74 (0.02) | 0.79 (0.02) | 0.75 (0.02) | 0.70 (0.02) | 0.70 (0.02) | 0.71 (0.02) | 0.74 (0.02) |
| CN, P | 0.69 (0.02) | 0.80 (0.02) | 0.79 (0.02) | 0.71 (0.02) | 0.68 (0.01) | 0.68 (0.02) | 0.76 (0.02) |
| T, P | 0.72 (0.02) | 0.73 (0.02) | 0.76 (0.02) | 0.71 (0.02) | 0.69 (0.02) | 0.71 (0.02) | 0.76 (0.02) |
| PP, CN, T | 0.72 (0.02) | 0.77 (0.02) | 0.77 (0.02) | 0.72 (0.01) | 0.70 (0.02) | 0.68 (0.02) | 0.74 (0.02) |
| PP, CN, P | 0.73 (0.02) | 0.81 (0.02) | 0.85 (0.02) | 0.71 (0.02) | 0.70 (0.02) | 0.70 (0.02) | 0.76 (0.02) |
| PP, T, P | 0.72 (0.02) | 0.76 (0.02) | 0.77 (0.02) | 0.74 (0.02) | 0.70 (0.02) | 0.71 (0.01) | 0.76 (0.2) |
| CN, T, P | 0.72 (0.02) | 0.76 (0.02) | 0.78 (0.01) | 0.71 (0.02) | 0.69 (0.02) | 0.69 (0.02) | 0.75 (0.02) |
| PP, CN, T, P | 0.73 (0.01) | 0.78 (0.01) | 0.77 (0.01) | 0.73 (0.01) | 0.70 (0.01) | 0.71 (0.01) | 0.76 (0.01) |
Abbreviations: AUC indicates area under receiver operating characteristic curve; CPTAC, Clinical Proteomic Tumor Analysis Consortium; NNMF, nonnegative matrix factorization algorithm.
Bold values indicate the best unimodal performance. The numbers in parentheses indicate standard error.
Figure 3.The AUCs for predictive models built with omics data and linear support vector machines on the Clinical Proteomic Tumor Analysis Consortium ovarian cancer data. The best performing models for tumor stage and tumor grade were based on phosphoprotein levels. For survival ≥2 years and beyond, all the modalities showed comparable performance. For survival ≥1 year, protein expression was the most predictive modality. The error bars represent standard errors of the mean. AUC indicates area under receiver operating characteristic curve.
AUC performance for the CPTAC colon cancer data using NNMF for unimodal data and Adaptive Multiview NNMF method for multimodal data (top 50-60 components and 3764 genes).
| CPTAC colon cancer | Tumor stage | Residual tumor | ≥1 y | ≥2 y | ≥3 y |
|---|---|---|---|---|---|
| Copy number (CN) | 0.67 (0.01) | 0.78 (0.02) | 0.67 (0.01) |
|
|
| Transcript (T) level | 0.67 (0.01) | 0.76 (0.03) |
|
| 0.78 (0.03) |
| Protein (P) level |
|
| 0.67 (0.02) |
|
|
| CN, T | 0.68 (0.02) | 0.66 (0.03) | 0.66 (0.01) | 0.69 (0.01) | 0.79 (0.02) |
| CN, P | 0.71 (0.02) | 0.72 (0.03) | 0.66 (0.01) | 0.69 (0.01) | 0.79(0.03) |
| GE, P | 0.71 (0.02) | 0.73 (0.03) | 0.67 (0.02) | 0.69 (0.02) | 0.79 (0.02) |
| CN, T, P | 0.71 (0.02) | 0.71 (0.03) | 0.66 (0.01) | 0.69 (0.02) | 0.76 (0.02) |
Abbreviations: AUC indicates area under receiver operating characteristic curve; CPTAC, Clinical Proteomic Tumor Analysis Consortium; NNMF, nonnegative matrix factorization algorithm.
Bold values indicate the best unimodal performance. The numbers in parentheses indicate standard error.
P < .05.
Figure 4.The AUCs for predictive models built with omics data with linear support vector machines on the Clinical Proteomic Tumor Analysis Consortium colon cancer data. The best performing models for tumor stage and residual tumor were based on protein levels. The error bars represent standard errors of the mean. AUC indicates area under receiver operating characteristic curve.
|
|
| 1: Repeat until maximum iterations |
| a. For each resampling iteration do: |
| i. Hold out specific test samples |
| ii. Initialize |
| iii. Perform dimensionality reduction on |
| iv. Train model on |
| v. Test model on |
| 2: Average cross-validation performances from |
| 3: Scale AUC performance, |
| 4: Repeat until maximum iterations |
| a. For each resampling iteration do: |
| i. Hold out specific samples |
| ii. Perform dimensionality reduction on |
| iii. Train model on support vector machine classifier using concatenated, multimodal matrices |
| iv. To give the test samples a projection in the same space as the training data to get the reduced test data |
| v. Test model on uniformly integrated matrices |
| 5: Average cross-validation performance to obtain final AUC. |