| Literature DB >> 26969681 |
Chen Meng, Oana A Zeleznik, Gerhard G Thallinger, Bernhard Kuster, Amin M Gholami, Aedín C Culhane.
Abstract
State-of-the-art next-generation sequencing, transcriptomics, proteomics and other high-throughput 'omics' technologies enable the efficient generation of large experimental data sets. These data may yield unprecedented knowledge about molecular pathways in cells and their role in disease. Dimension reduction approaches have been widely used in exploratory analysis of single omics data sets. This review will focus on dimension reduction approaches for simultaneous exploratory analyses of multiple data sets. These methods extract the linear relationships that best explain the correlated structure across data sets, the variability both within and between variables (or observations) and may highlight data issues such as batch effects or outliers. We explore dimension reduction techniques as one of the emerging approaches for data integration, and how these can be applied to increase our understanding of biological systems in normal physiological function and disease.Entities:
Keywords: dimension reduction; exploratory data analysis; integrative genomics; multi-assay; multi-omics data integration; multivariate analysis
Mesh:
Year: 2016 PMID: 26969681 PMCID: PMC4945831 DOI: 10.1093/bib/bbv108
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Glossary
| Term | Definition |
|---|---|
| Variance | The variance of a random variable measures the spread (variability) of its realizations (values of the random variable). The variance is always a positive number. If the variance is small, the values of the random variable are close to the mean of the random variable (the spread of the data is low). A high variance is equivalent to widely spread values of the random variable. See [ |
| Standard deviation | The standard deviation of a random variable measures the spread (variability) of its realizations (values of the random variable). It is defined as the square root of the variance. The standard deviation will have the same units as the random variable, in contrast to the variance. See [ |
| Covariance | The covariance is an unstandardized measure about the tendency of two random variables to vary together. See [ |
| Correlation | The correlation of two random variables is defined by the covariance of the two random variables normalized by the product between their standard deviations. It measures the linear relationship between the two random variables. The correlation coefficient ranges between −1 and +1. See [ |
| Inertia | Inertia is a measure for the variability of the data. The inertia of a set of points relative to one point P is defined by the weighted sum of the squared distances between each considered point and the point P. Correspondingly, the inertia of a centered matrix (mean is equal to zero) is simply the sum of the squared matrix elements. The inertia of the matrix |
| Co-inertia | The co-inertia is a global measure for the co-variability of two data sets (for example, two high-dimensional random variables). If the data sets are centered, the co-inertia is the sum of squared covariances. When coupling a pair of data sets, the co-inertia between two matrices, |
| Orthogonal | Two vectors are called orthogonal if they form an angle that measures 90 degrees. Generally, two vectors are orthogonal if their inner product is equal to zero. Two orthogonal vectors are always linearly independent. See [ |
| Independent | In linear algebra, two vectors are called linearly independent if their liner combination is equal to zero only when all constants of the linear combination are equal to zero. See [ |
| Eigenvector, eigenvalue | An eigenvector of a matrix is a vector that does not change its direction after a linear transformation. The vector |
| Linear combination | Mathematical expression calculated through the multiplication of variables with constants and adding the individual multiplication results. A linear combination of the variables |
| Omics | The study of biological molecules in a comprehensive fashion. Examples of omics data types include genomics, transcriptomics, proteomics, metabolomics and epigenomics [ |
| Dimension reduction | Dimension reduction is the mapping of data to a lower dimensional space such that redundant variance in the data is reduced or discarded, enabling a lower-dimensional representation without significant loss of information. See [ |
| Exploratory data analysis | EDA is the application of statistical techniques that summarize the main characteristics of data, often with visual methods. In contrast to statistical hypothesis testing (confirmatory data analysis), EDA can help to generate hypotheses. See [ |
| Sparse vector | A sparse vector is a vector in which most elements are zero. A sparse loadings matrix in PCA or related methods reduce the number of features contributing to a PC. The variables with nonzero entries (features) are the ‘selected features'. See [ |
Dimension reduction methods for one data set
| Method | Description | Name of R function {R package} |
|---|---|---|
| PCA | Principal component analysis | prcomp{stats}, princomp{stats}, dudi.pca{ade4}, pca{vegan}, PCA{FactoMineR}, principal{psych} |
| CA, COA | Correspondence analysis | ca{ca}, CA{FactoMineR}, dudi.coa{ade4} |
| NSC | Nonsymmetric correspondence analysis | dudi.nsc{ade4} |
| PCoA, MDS | Principal co-ordinate analysis/multiple dimensional scaling | cmdscale{stats} dudi.pco{ade4} pcoa{ape} |
| NMF | Nonnegative matrix factorization | nmf{nmf} |
| nmMDS | Nonmetric multidimensional scaling | metaMDS{vegan} |
| sPCA, nsPCA, pPCA | Sparse PCA, nonnegative sparse PCA, penalized PCA. (PCA with feature selection) | SPC{PMA}, spca{mixOmics}, nsprcomp{nsprcomp}, PMD{PMA} |
| NIPALS PCA | Nonlinear iterative partial least squares analysis (PCA on data with missing values) | nipals{ade4} pca{pcaMethods} |
| pPCA, bPCA | Probabilistic PCA, Bayesian PCA | pca{pcaMethods} |
| MCA | Multiple correspondence analysis | dudi.acm{ade4}, mca{MASS} |
| ICA | Independent component analysis | fastICA{FastICA} |
| sIPCA | Sparse independent PCA (combines sPCA and ICA) | sipca{mixOmics} ipca{mixOmics} |
| plots | Graphical resources | R packages including scatterplot3d, ggord |
aAvailable in Bioconductor.
bOn github: devtools::install_github (‘fawda123/ggord').
cOn github: devtools::install_github (‘ggbiplot', ‘vqv').
dOn github: devtools::install_github (‘ropensci/plotly').
Dimension reduction methods for pairs of data sets
| Method | Description | Feature selection | R Function {package} |
|---|---|---|---|
| CCA | Canonical correlation analysis. Limited to | No | cc{cca} CCorA{vegan}, |
| CCA | Canonical correspondence analysis is a constrained correspondence analysis, which is popular in ecology | No | cca{ade4} cca{vegan} cancor{stats} |
| RDA | Redundancy analysis is a constrained PCA. Popular in ecology | No | rda{vegan} |
| Procrutes | Procrutes rotation rotates a matrix to maximum similarity with a target matrix minimizing sum of squared differences | No | procrustes{vegan} procuste{ade4} |
| rCCA | Regularized canonical correlation | No | rcc{cca} |
| sCCA | Sparse CCA | Yes | CCA{pma} |
| pCCA | Penalized CCA | Yes | spCCA{spCCA} supervised version |
| WAPLS | Weighted averaging PLS regression | No | WAPLS{rioja}, wapls{paltran} |
| PLS | Partial least squares of K-tables (multi-block PLS) | No | mbpls{ade4}, plsda{caret} |
| sPLS pPLS | Sparse PLS Penalized PLS | Yes | spls{spls} spls{mixOmics} ppls{ppls} |
| sPLS-DA | Sparse PLS-discriminant analysis | Yes | splsda{mixOmics}, splsda{caret} |
| cPCA | Consensus PCA | No | cpca{mogsa} |
| CIA | Coinertia analysis | No | coinertia{ade4} cia{made4} |
aA source for confusion, CCA is widely used as an acronym for both Canonical ‘Correspondence' Analysis and Canonical ‘Correlation' Analysis. Throughout this article we use CCA for canonical ‘correlation' analysis. Both methods search for the multivariate relationships between two data sets. Canonical ‘correspondence' analysis is an extension and constrained form of ‘correspondence' analysis [22]. Both canonical ‘correlation' analysis and RDA assume a linear model; however, RDA is a constrained PCA (and assumes one matrix is the dependent variable and one independent), whereas canonical correlation analysis considers both equally. See [23] for more explanation.
Dimension reduction methods for multiple (more than two) data sets
| Method | Description | Feature selection | Matched cases | R Function {package} |
|---|---|---|---|---|
| MCIA | Multiple coinertia analysis | No | No | mcia{omicade4}, mcoa{ade4} |
| gCCA | Generalized CCA | No | No | regCCA{dmt} |
| rGCCA | Regularized generalized CCA | No | No | regCCA{dmt} rgcca{rgcca} wrapper.rgcca{mixOmics} |
| sGCCA | Sparse generalized canonical correlation analysis | Yes | No | sgcca{rgcca} wrapper.sgcca{mixOmics} |
| STATIS | Structuration des Tableaux á Trois Indices de la Statistique (STATIS). Family of methods which include X-statis | No | No | statis{ade4} |
| CANDECOMP/ PARAFAC / Tucker3 | Higher order generalizations of SVD and PCA. Require matched variables and cases. | No | Yes | CP{ThreeWay}, T3{ThreeWay}, PCAn{PTaK}, CANDPARA{PTaK} |
| PTA | Partial triadic analysis | No | Yes | pta{ade4}, |
| statico | Statis and CIA (find structure between two pairs of K-tables) | No | No | statico{ade4} |
Figure 1.Results of a PCA analysis of mRNA gene expression data of melanoma (ME), leukemia (LE) and central nervous system (CNS) cell lines from the NCI-60 cell line panel. All variables were centered and scaled. Results show (A) a biplot where observations (cell lines) are points and gene expression profiles are arrows; (B) a heatmap showing the gene expression of the same 20 genes in the cell lines; red to blue scale represent high to low gene expression (light to dark gray represent high to low gene expression on the black and white figure); (C) correlation circle; (D) variance barplot of the first ten PCs. To improve the readability of the biplot, some labels of the variables (genes) in (A) have been moved slightly. A colour version of this figure is available online at BIB online: http://bib.oxfordjournals.org.
Figure 2.MCIA of mRNA, miRNA and proteomics profiles of melanoma (ME), leukemia (LE) and central nervous system (CNS) cell lines. (A) shows a plot of the first two components in sample space (sample ‘type' is coded by the point shape; circles for mRNAs, triangles for proteins and squares for miRNAs). Each sample (cell line) is represented by a “star”, where the three omics data for each cell line are connected by lines to a center point, which is the global score (F) for that cell line, the shorter the line, the higher the level of concordance between the data types and the global structure. (B) shows the variable space of MCIA. A variable that is highly expressed in a cell line will be projected with a high weight (far from the origin) in the direction of that cell line. Some miRNAs with a large distance from the origin are labeled, as these miRNAs are the strongly associated with cancer tissue of origin. (C) shows the correlation coefficients of the proteome profiling of SR with other cell lines. The proteome profiling of SR cell line is more correlated with melanoma cell line. There may be a technical issue with the LE.SR proteomics data. (D) A scree plot of the eigenvalues and (E) a plot of data weighting space. A colour version of this figure is available online at BIB online: http://bib.oxfordjournals.org.