| Literature DB >> 25250315 |
Ching Siang Tan1, Wai Soon Ting1, Mohd Saberi Mohamad1, Weng Howe Chan1, Safaai Deris1, Zuraini Ali Shah1.
Abstract
When gene expression data are too large to be processed, they are transformed into a reduced representation set of genes. Transforming large-scale gene expression data into a set of genes is called feature extraction. If the genes extracted are carefully chosen, this gene set can extract the relevant information from the large-scale gene expression data, allowing further analysis by using this reduced representation instead of the full size data. In this paper, we review numerous software applications that can be used for feature extraction. The software reviewed is mainly for Principal Component Analysis (PCA), Independent Component Analysis (ICA), Partial Least Squares (PLS), and Local Linear Embedding (LLE). A summary and sources of the software are provided in the last section for each feature extraction method.Entities:
Mesh:
Year: 2014 PMID: 25250315 PMCID: PMC4164313 DOI: 10.1155/2014/213656
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Plot of genes.
A summary for PCA software.
| Number | Software | Author/year | Language | Features |
|---|---|---|---|---|
| 1 | FactoMineR |
Lê et al. [ | R | (i) Various dimension reduction methods such as PCA, CA, and MCA |
|
| ||||
| 2 | ExPosition | Beaton et al. [ | R | (i) Numerous multivariate analysis methods such as PCA and Generalized Principal Component Analysis (GPCA) |
|
| ||||
| 3 | amap | Lucas [ | R | (i) Different types of PCA are provided: PCA, Generalized PCA, and Robust PCA |
|
| ||||
| 4 | ADE-4 | Thioulouse et al. [ | R | A variety of methods such as PCA, CA, Principal Analysis Regression, PLS, and others are offered |
|
| ||||
| 5 | MADE4 |
Culhane et al. [ | R | (i) Functions provided by ADE-4 |
|
| ||||
| 6 | XLMiner | Witten and Frank [ | Implemented in Excel | (i) Provision of data reduction methods such as PCA |
|
| ||||
| 7 | ViSta |
Young et al. [ | C++, Fortran, XLisp, and ViDAL | (i) Multivariate analysis methods are offered such as PCA, Interactive Cluster Analysis, and Parallel Boxplots |
|
| ||||
| 8 | imDEV | Grapov and Newman [ | Visual Basic and R | (i) Data preprocessing: missing values imputation and data transformations |
|
| ||||
| 9 | Statistics Toolbox | The MathWorks [ | MATLAB | (i) Multivariate statistics such as PCA, clustering, and others |
|
| ||||
| 10 | Weka | Hall et al. [ | Java | A variety of machine learning algorithms are provided such as feature selection, data preprocessing, regression, dimension reduction, classification, and clustering methods |
|
| ||||
| 11 | NAG Library | NAG Toolbox for MATLAB | Fortran and C | (i) Provision of more than 1700 mathematical and statistical algorithms |
Sources of PCA software.
| Number | Software | Sources |
|---|---|---|
| 1 | FactoMineR |
|
| 2 | ExPosition |
|
| 3 | Amap |
|
| 4 | ADE-4 |
|
| 5 | MADE4 |
|
| 6 | XLMiner |
|
| 7 | ViSta |
|
| 8 | imDEV |
|
| 9 | Statistics Toolbox |
|
| 10 | Weka |
|
| 11 | NAG Library |
|
Related work.
| Software | Author | Motivation | Advantage |
|---|---|---|---|
| FactoMineR |
Lê et al. (2009) [ | (i) Providing a multivariate data analytic technique for applications in biological systems | (i) It provides a geometrical point of view and a lot of graphical outputs |
|
| |||
| MADE4 | Culhane et al. [ | To provide a simple-to-use tool for multivariate analysis of microarray data | (i) Accepts a wide variety of gene-expression data input formats |
|
| |||
| Statistic toolbox |
The MathWorks [ | High-dimensional and complex microarray data need automatic/computer aided tools for analysis | Elegant matrix support; visualization |
|
| |||
| imDev | Grapov and Newman, 2012 [ | Omics experiments generate complex high-dimensional data requiring multivariate analyses | (i) User-friendly graphical interface |
Figure 2Correlation-based graph.
Summary of ICA software.
| Number | Software | Author/year | Language | Features |
|---|---|---|---|---|
| 1 | FastICA |
Marchini et al. [ | R and MATLAB | ICA algorithm is provided for implementing the analysis using ICA |
|
| ||||
| 2 | JADE | Nordhausen et al. [ | R | (i) JADE algorithm is provided for ICA |
|
| ||||
| 3 | HiPerSAT | Keith et al. [ | C++, MATLAB, and EEGLAB | (i) Integration of FastICA, Informax, and SOBI algorithms |
|
| ||||
| 4 | MineICA | Biton et al. [ | R | (i) Storage and visualization of ICA results |
|
| ||||
| 5 | Pearson ICA | Karnanen [ | R | Extraction of the independent components using the minimization of mutual information from the Pearson system |
|
| ||||
| 6 | Maximum Likelihood ICA | Teschenforff [ | R | Implementation of the Maximum Likelihood and fixed-point algorithm into ICA |
Sources of ICA software.
| Number | Software | Sources |
|---|---|---|
| 1 | FastICA | R: |
| MATLAB: | ||
| 2 | JADE |
|
| 3 | HiPerSAT |
|
| 4 | MineICA |
|
| 5 | Pearson ICA |
|
| 6 | Maximum Likelihood ICA |
|
Figure 3(a, b, and c) Heatmaps showing the original and corrected expression levels for the first 1000 genes in the Golub data. (a) Heatmap for the first 1000 genes in the original Golub expression data. (b) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by use of the R package ber. (c) Heatmap for the first 1000 genes in the adjusted Golub expression data obtained by the use of our R package svapls.
A summary of PLS software.
| Number | Software | Author/year | Language | Features |
|---|---|---|---|---|
| 1 | PLS Discriminant Analysis |
Barker and Rayens [ | C/C++, Visual Basic | PLS for discriminant analysis |
|
| ||||
| 2 | Least Squares–PLS |
Jørgensen et al. [ | R | Implementation combining PLS and ordinary least squares |
|
| ||||
| 3 | Powered PLS Discriminant Analysis | Liland and Indahl [ | R | Extraction of information for multivariate classification problems |
|
| ||||
| 4 | Penalized PLS |
Kr | R | Extension of PLS regression using penalization technique |
|
| ||||
| 5 | SlimPLS | Gutkin et al. [ | R | Multivariate feature extraction method which incorporates feature dependencies |
|
| ||||
| 6 | Sparse PLS Discriminant Analysis, Sparse Generalized PLS | Chung and Keles [ | R | Sparse version techniques employing feature extraction and dimension reduction simultaneously |
|
| ||||
| 7 | PLS Degrees of Freedom | Kramer and Sugiyama [ | R | Using an unbiased estimation of the degrees of freedom for PLS regression |
|
| ||||
| 8 | Surrogate Variable Analysis PLS |
Chakraborty and Datta [ | R | Extraction of the informative features with hidden confounders which are unaccounted for |
|
| ||||
| 9 | PLS Path Modelling | Sanchez and Trinchera [ | R | A multivariate feature extraction analysis technique based on the cause-effect relationships of the unobserved and observed features |
|
| ||||
| 10 | PLS Regression for Generalized Linear Models |
Bertrand et al. (2013) [ | R | PLS regression is used to extract the predictive features from the generalized linear models |
Sources of PLS software.
| Number | Software | Sources |
|---|---|---|
| 1 | PLS Discriminant Analysis |
|
| 2 | Least Squares–PLS |
|
| 3 | Powered PLS Discriminant Analysis |
|
| 4 | Penalized PLS |
|
| 5 | SlimPLS |
|
| 6 | Sparse PLS Discriminant Analysis, Sparse Generalized PLS |
|
| 7 | Degrees of Freedom of PLS |
|
| 8 | Surrogate Variable Analysis PLS |
|
| 9 | PLS Path Modelling |
|
| 10 | PLS Regression for Generalized Linear Models |
|
Related work.
| Software | Author | Motivation | Advantage |
|---|---|---|---|
| plsRglm (R package) |
Bertrand et al. (2010) [ | (i) To deal with incomplete datasets using cross-validation | (i) Provides formula support |
|
| |||
| SVA-PLS |
Chakraborty and Datta [ | (i) To identify the genes that are differentially expressed between the samples from two different tissue types | (i) Relatively better at discovering a higher proportion of the truly significant genes |
|
| |||
| SlimPLS | Gutkin et al. [ | To obtain a low dimensional approximation of a matrix that is “as close as possible” to a given vector | (i) Focuses solely on feature selection |
Figure 5Two-dimensional embedding of the Golub et al. [69] leukemia dataset (top: Isomap; bottom: LLE).
Figure 4Plot of dimension versus residual variance.
A summary of LLE software.
| Number | Software | Author/year | Language | Features |
|---|---|---|---|---|
| 1 | lle | Diedrich and Abel [ | R | (i) LLE algorithm is provided for transforming high-dimensional data into low-dimensional data |
|
| ||||
| 2 | RDRToolbox | Bartenhagen [ | R | (i) LLE and Isomap for feature extraction |
|
| ||||
| 3 | Scikit-learn | Pedregosa et al. [ | Python | (i) Classification, manifold learning, feature extraction, clustering, and other methods are offered |
Sources of LLE software.
| Number | Software | Sources |
|---|---|---|
| 1 | lle |
|
| 2 | RDRToolbox |
|
| 3 | Scikit-learn |
|
Related work.
| Software | Author | Motivation | Advantage |
|---|---|---|---|
| RDRToolbox | Bartenhagen [ | (i) To reduce high dimensionality microarray data | (i) Combine information from all features |
|
| |||
| Scikit-learn |
Pedregosa et al. [ | To calculate activity index parameters through clustering | (i) Easy-to-use interface |
|
| |||
| lle |
Diedrich and Abel [ | Currently available data dimension reduction methods are either supervised, where data need to be labeled, or computational complex | (i) Fast |