| Literature DB >> 23282007 |
Sungin Park1, Soo-Yong Shin, Kyu-Baek Hwang.
Abstract
BACKGROUND: Multidimensional scaling (MDS) is a widely used approach to dimensionality reduction. It has been applied to feature selection and visualization in various areas. Among diverse MDS methods, the classical MDS is a simple and theoretically sound solution for projecting data objects onto a low dimensional space while preserving the original distances among them as much as possible. However, it is not trivial to apply it to genome-scale data (e.g., microarray gene expression profiles) on regular desktop computers, because of its high computational complexity.Entities:
Mesh:
Year: 2012 PMID: 23282007 PMCID: PMC3521231 DOI: 10.1186/1471-2105-13-S17-S23
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Process of divide-and-conquer mode. First, a dissimilarity matrix is randomly decomposed into p submatrices along the diagonal, D1, ..., D. Second, s objects are sampled from each of the submatrices. Then, the sampled objects are merged to construct a new dissimilarity submatrix M. The one-shot MDS method is applied to D1, ..., Das well as M. The resulting coordinates are dMDS1, ..., dMDSas well as mMDS, respectively. After that, the objects sampled from each of D1, ..., Dare extracted from the resulting coordinates matrices, comprising subdMDS1, ..., subdMDSas well as mMDS1, ..., mMDS. For each pair, subdMDSand mMDS(i = 1, 2, ..., p), a linear transformation matrix Ais obtained by minimizing ||AsubdMDS- mMDS||, where || · || denotes L2 norm. The linearly transformed objects newdMDSon a reduced dimension are obtained by AdMDS. Finally, newdMDS1, ..., newdMDSare combined to produce the MDS result for the entire objects.
Benchmark datasets
| Dataset | Source | Number of Attributes | Number of Instances | Pearson's Median Skewness Coefficient | Coefficient of Variation |
|---|---|---|---|---|---|
| IRIS | UCI ML Repository | 4 | 150 | 0.34 | 0.64 |
| Dermatology | UCI ML Repository | 33 | 366 | -0.61 | 0.42 |
| GEO | 4,000 | 2,000 | 0.94 | 1.08 | |
| GEO | 1,000 | 9,300 | 0.73 | 0.56 | |
| MNIST | MNIST | 784 | 10,000 | -0.13 | 0.14 |
UCI ML Repository is UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets.html. GEO is Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/. MNIST is the MNIST Database of handwritten digits http://yann.lecun.com/exdb/mnist/. M. musculus Microarray is a modified dataset from Mus musculus microarrays in GEO and S. cerevisiae Microarray is a modified dataset from Saccharomyces cerevisiae microarrays in GEO. MNIST dataset is from scanned handwritten digit images of 28 × 28 pixels.
Experimental setting
| Dataset | Size of Dissimilarity Matrix | No. of Submatrices | No. of Samples |
|---|---|---|---|
| IRIS | 150 × 150 | 3 | 20 |
| Dermatology | 366 × 366 | 3 | 60 |
| 2,000 × 2,000 | 10 | 100 | |
| 9,300 × 9,300 | 10 | 150 | |
| MNIST | 10,000 × 10,000 | 10 | 150 |
These parameters were set for comparison experiments of the divide-and-conquer mode of CFMDS. In fact, the CFMS application automatically detects the available memory size and these parameters are subsequently determined. For IRIS, Dermatology, and M. muculus Microarray datasets, these parameters were set arbitrarily, because they can be processed by the one-shot mode of CFMDS.
Figure 2Comparison results of execution time. Average running time in seconds is shown. The y-axis is in log scale. Random (MaxMin) means the divide-and-conquer mode of CFMDS with Random (MaxMin) sampling. One-shot MDS represents CFMDS without divide-and-conquer. Conventional MDS represents the classical MDS implemented using C# or MATLAB in serial computing environments. "0.00" denotes "not applicable." For S. cerevisiae and MNIST datasets, we were not able to apply the one-shot mode of CFMDS due to the memory limitation in our graphics card.
Figure 3Comparison results of accuracy. Pearson's correlation coefficient was used as accuracy. The mean value and standard deviation from 100 independent simulation results are shown. Random (MaxMin) means the divide-and-conquer mode of CFMDS with Random (MaxMin) sampling.