| Literature DB >> 34934107 |
Aditya Dubey1, Akhtar Rasool2.
Abstract
For most bioinformatics statistical methods, particularly for gene expression data classification, prognosis, and prediction, a complete dataset is required. The gene sample value can be missing due to hardware failure, software failure, or manual mistakes. The missing data in gene expression research dramatically affects the analysis of the collected data. Consequently, this has become a critical problem that requires an efficient imputation algorithm to resolve the issue. This paper proposed a technique considering the local similarity structure that predicts the missing data using clustering and top K nearest neighbor approaches for imputing the missing value. A similarity-based spectral clustering approach is used that is combined with the K-means. The spectral clustering parameters, cluster size, and weighting factors are optimized, and after that, missing values are predicted. For imputing each cluster's missing value, the top K nearest neighbor approach utilizes the concept of weighted distance. The evaluation is carried out on numerous datasets from a variety of biological areas, with experimentally inserted missing values varying from 5 to 25%. Experimental results prove that the proposed imputation technique makes accurate predictions as compared to other imputation procedures. In this paper, for performing the imputation experiments, microarray gene expression datasets consisting of information of different cancers and tumors are considered. The main contribution of this research states that local similarity-based techniques can be used for imputation even when the dataset has varying dimensionality and characteristics.Entities:
Year: 2021 PMID: 34934107 PMCID: PMC8692342 DOI: 10.1038/s41598-021-03438-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Missing value imputation procedure.
Symbols and definitions.
| Symbols | Definition and description |
|---|---|
| G | Initial dataset |
| N | Number of data instances |
| M | Number of attributes |
| Similarity matrix for the N data instances | |
| Similarity between data instance | |
| K | Desired number of clusters |
| D | Degree matrix |
| W | Adjacency matrix |
| L | Graph laplacian |
| C | Set of clusters |
Figure 2Block diagram of proposed method.
Summary of the datasets.
| Datasets | Original matrix ( | Complete matrix ( | Missing rate (%) | References |
|---|---|---|---|---|
| GDS1761 | 0.15 | [ | ||
| GDS5232 | 48.05 | [ | ||
| GDS2735 | 7.66 | [ | ||
| GDS1210 | 0.02 | [ |
Figure 3Similarity graph for four gene expressed dataset (a) GDS1761 (b) GDS5232 (c) GDS2735 (d) GDS1210.
Figure 4RMSE vs. the number of neighbors on the four gene expressed dataset (a) GDS1761 (b) GDS5232 (c) GDS2735 (d) GDS1210.
Figure 5RMSE comparison of proposed technique with other techniques having different missing percentage on the four gene expressed dataset (a) GDS1761 (b) GDS5232 (c) GDS2735 (d) GDS1210.