| Literature DB >> 31500324 |
Nicolas Sompairac1,2,3,4, Petr V Nazarov5, Urszula Czerwinska6,7,8, Laura Cantini9, Anne Biton10, Askhat Molkenov11, Zhaxybay Zhumadilov12,13, Emmanuel Barillot14,15,16, Francois Radvanyi17,18, Alexander Gorban19,20, Ulykbek Kairov21, Andrei Zinovyev22,23,24.
Abstract
Independent component analysis (ICA) is a matrix factorization approach where the signals captured by each individual matrix factors are optimized to become as mutually independent as possible. Initially suggested for solving source blind separation problems in various fields, ICA was shown to be successful in analyzing functional magnetic resonance imaging (fMRI) and other types of biomedical data. In the last twenty years, ICA became a part of the standard machine learning toolbox, together with other matrix factorization methods such as principal component analysis (PCA) and non-negative matrix factorization (NMF). Here, we review a number of recent works where ICA was shown to be a useful tool for unraveling the complexity of cancer biology from the analysis of different types of omics data, mainly collected for tumoral samples. Such works highlight the use of ICA in dimensionality reduction, deconvolution, data pre-processing, meta-analysis, and others applied to different data types (transcriptome, methylome, proteome, single-cell data). We particularly focus on the technical aspects of ICA application in omics studies such as using different protocols, determining the optimal number of components, assessing and improving reproducibility of the ICA results, and comparison with other popular matrix factorization techniques. We discuss the emerging ICA applications to the integrative analysis of multi-level omics datasets and introduce a conceptual view on ICA as a tool for defining functional subsystems of a complex biological system and their interactions under various conditions. Our review is accompanied by a Jupyter notebook which illustrates the discussed concepts and provides a practical tool for applying ICA to the analysis of cancer omics datasets.Entities:
Keywords: cancer; data analysis; data integration; dimension reduction; independent component analysis; omics data
Mesh:
Year: 2019 PMID: 31500324 PMCID: PMC6771121 DOI: 10.3390/ijms20184414
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Independent component analysis (ICA) is a standard tool for reducing the complexity of omics datasets in cancer biology. (a) ICA belongs to the family of matrix factorization methods, approximating a 2D matrix by a product of two much smaller matrices, containing metagenes and metasamples, in the case of omics data. (b) ICA can be considered as a rotation of PCA axes, after data “whitening” (i.e., orienting the Gaussian ellipsoid along the coordinate axes and scaling them to unit variance). (c) The major types of applications of ICA in cancer biology. (d) The number of publications in PubMed mentioning ICA and the number of publications simultaneously mentioning ICA and “tumor” or “cancer”.
Figure 2Features of ICA applied to a synthetic (a) and two real-life datasets (breast cancer The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) transcriptomic datasets) (b,c). (a) Independent Component Analysis is able to disentangle (or deconvolute) two intersecting Gaussian distributions with coinciding means and whose principal axes form a sharp angle; (b) 100 order ICA decomposition of the TCGA and METABRIC datasets. Each component represented as a metagene was correlated to either immune infiltration-related or proliferation-related meta-metagenes derived from Reference [33]. This analysis shows that only one of the components was strongly correlated to the cell-cycle, while several can be associated with the presence of an immune-infiltrated ICA-derived signature (this, probably, signifies the ability of ICA to deconvolute the major immune cell types in an unsupervised manner (see, Reference [42]); (c) correlations matrix between the metagenes of independent components extracted from the TCGA and METABRIC separately. It shows that, for some components computed for different datasets, there exists a strong and unique association between them, indicating the high reproducibility of the ICA results (e.g., see Reference [38]).
Figure 3Interpretation of ICA components using histopathology imaging of bladder tumor cross-sections. Each metasample produced by ICA defined a ranking, which was used to sort the images. Visual inspection determines a clear trend in the images towards the increase of certain elements (presence of smooth muscle cells, myofibroblasts (cancer-associated fibroblasts), dividing cells). Two example images per component selected from the top and the bottom of the rankings are shown here. Green rhombuses designate normal samples. Black circles designate cells of interest: muscle cell (left), myofibroblast (middle), cells in mitosis (right). The figure is reproduced from the Supplementary Materials of Reference [33] with permission.
Figure 4Use of ICA components in meta-analysis of multiple omics datasets. (a) Pairwise comparison of two sets of ICA metagenes led to an asymmetric correlation matrix (same as in Figure 2c) which can be converted to a graph using some threshold and selecting the maximal correlations. If two components are maximally correlated with each other, then such a correlation defines reciprocal best hit (RBH). (b) Graph of maximal correlations (reciprocal and not) exceeding certain threshold among components computed for 22 cancer transcriptomic dataset. Each node is a component, and an edge denotes a correlation. Color reflects the cancer type (e.g., red is bladder cancer). Communities in this graph define highly reproducible cancer type-specific and universal latent factors The figure is reproduced with permission from Reference [33].
Figure 5Examples of utility of ICA for unsupervised deconvolution of cell types. (a) Application of ICA to the Sequencing Quality Control consortium (SEQC) dataset [76] containing measurements of two references transcriptomic profiles of cell lines and their mixtures at known proportions. The first two ICs identify the types and the effect of the platform. (b) Correlation graph among selected components from ICA applied to six non-redundant breast cancer transcriptomic datasets. Three cliques formed in the graph correspond to major immune cell types. The thickness of the edges reflects the absolute correlation value. “Immune” meta-metagene was defined in Reference [33] as the one associated with the presence of immune infiltrate in a tumor. This figure was reproduced with permission from Reference [42].
Figure 6Application of ICA in single cell data analysis of tumors (study of glioblastoma from Reference [79]). (a) t-distributed stochastic neighbor embedding (t-SNE) visualization of the data reveals a strong batch effect. Grey and red/blue dots represent cells from the same cell line, analyzed in two batches (batch 1—grey dots, batch 2—red and blue cells). The green dots show a cell population from a different cell line added to the dataset for the reason of comparison. (b) t-SNE visualization of the data after eliminating signals contained in one IC associated with batch effect. (c) In ICA decomposition of single cell scRNA-Seq data from cancer studies, usually there exist two components associated with phases of the cell cycle (G1/S, DNA replication, and G2/M, mitosis). Here the loadings of such two components are visualized. Black arrows show the regions when the labeled genes are highly expressed. Yellow arrows show assumed direction of the progression through the cell cycle.