| Literature DB >> 28688226 |
Nora K Speicher1, Nico Pfeifer1.
Abstract
Personalized treatment of patients based on tissue-specific cancer subtypes has strongly increased the efficacy of the chosen therapies. Even though the amount of data measured for cancer patients has increased over the last years, most cancer subtypes are still diagnosed based on individual data sources (e.g. gene expression data). We propose an unsupervised data integration method based on kernel principal component analysis. Principal component analysis is one of the most widely used techniques in data analysis. Unfortunately, the straightforward multiple kernel extension of this method leads to the use of only one of the input matrices, which does not fit the goal of gaining information from all data sources. Therefore, we present a scoring function to determine the impact of each input matrix. The approach enables visualizing the integrated data and subsequent clustering for cancer subtype identification. Due to the nature of the method, no hyperparameters have to be set. We apply the methodology to five different cancer data sets and demonstrate its advantages in terms of results and usability.Entities:
Keywords: Cancer subtyping; Dimensionality reduction; Integrative kernel principal component analysis; Multiple kernel learning; Patient clustering
Mesh:
Year: 2017 PMID: 28688226 PMCID: PMC6042822 DOI: 10.1515/jib-2017-0019
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Survival analysis of clustering results of kPCA used with an integrated kernel (gain function PCA), the kernel with the largest variance in the first p dimensions (max variance PCA) and average kernel PCA (average kPCA).
| Cancer type | Gain function kPCA | Max variance kPCA | Average kPCA | |
|---|---|---|---|---|
| BIC | 3 | 0.59 (2) | ||
| COAD | 2 | 3.28E−2 (3) | ||
| GBM | 3 | 0.11 (5) | 0.11 (5) | 1.59E−2 (4) |
| KRCCC | 3 | 1.37E−2 (14) | 2.27E−2 (14) | 0.17 (8) |
| LSCC | 4 |
In brackets, the number of clusters determined by the silhouette value are given. Bold p-values refer to significant results with respect to the threshold α = 0.01.