| Literature DB >> 31970084 |
Akanksha Farswan1, Anubha Gupta1, Ritu Gupta2, Gurvinder Kaur2.
Abstract
Purpose: Gene expression data generated from microarray technology is often analyzed for disease diagnostics and treatment. However, this data suffers with missing values that may lead to inaccurate findings. Since data capture is expensive, time consuming, and is required to be collected from subjects, it is worthwhile to recover missing values instead of re-collecting the data. In this paper, a novel but simple method, namely, DSNN (Doubly Sparse DCT domain with Nuclear Norm minimization) has been proposed for imputing missing values in microarray data. Extensive experiments including pathway enrichment have been carried out on four blood cancer dataset to validate the method as well as to establish the significance of imputation.Entities:
Keywords: AML; CLL; MM; blood cancer; compressive sensing; gene enrichment analysis; machine learning; matrix imputation
Year: 2020 PMID: 31970084 PMCID: PMC6960109 DOI: 10.3389/fonc.2019.01442
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 6.244
Review of existing methods for missing value imputation in gene expression data.
| Method | Imputes missing values by first estimating the local correlation among the group of genes that are highly correlated with the gene containing missing values and then using the local correlation to calculate the missing value | Imputes missing values by utilizing the global correlation among the genes in the complete gene expression matrix | Exploits both the global and local correlation among genes to calculate missing values in gene expression data | Imputes missing values by integrating already existing domain knowledge to imputation methods. Information about biological process in the microarray experiment etc. is an example of domain knowledge that can be integrated to the method |
| Advantages | Perform optimally when the data is heterogeneous i.e., genes exhibit dominant local similarity structure | Perform optimally when the data has high global covariance in expression matrix | Perform optimally regardless of the type of covariance present in the gene expression data | Improves accuracy of missing value imputation and perform optimally in presence of noisy data |
| Limitations | Perform poorly when data lacks local similarity structure | Fail to perform well when the data is heterogeneous | Perform sub optimally when data is noisy and has high missing rates | Perform sub optimally when data has high missing rates |
| Examples | Gaussian mixture clustering imputation (GMCimpute) ( Least square imputation (LSimpute) ( Bayesian gene selection BGSregress ( | Bayesian Principal Component Analysis (BPCA) ( SVDimpute (Singular Value Decomposition) ( | LinCmb ( HPM-MI (Hybrid Prediction Model with Missing value Imputation) ( Tri-imputation ( | GOimpute ( HAimpute (Imputation using Histone Acetylation information) ( |
Figure 1Workflow of the proposed analysis.
Figure 2Each curve represents DCT coefficients of a few randomly chosen columns and rows of gene expression matrices of CLL dataset.
Figure 3Semi-log plots show NMSE after imputation on (A) CLL, (B) AML, (C) MM-Spanish, and (D) MM-Indian dataset using Stage-1 only, Stage-2 only, and Proposed DSNN method (Stage-1 + Stage-2).
Figure 4Small Semi-log plots showing comparison of the proposed DSNN method with the three state-of-the-art methods in terms of NMSE for (A) CLL, (B) AML, (C) MM-Spanish, and (D) MM-Indian dataset.
Classification accuracy and F1 score for CLL dataset at varying sampling ratios (FR, feature reduction; SR, sampling ratio; Obs., observed; Rec., recovered using DSNN method.
| 10% | 0.71 | 0.87 | 0.73 | 0.77 | 0.84 | 0.96 | 0.85 | 0.97 | 0.86 | 0.96 | 0.89 | 0.98 |
| 20% | 0.71 | 0.87 | 0.75 | 0.78 | 0.80 | 0.97 | 0.84 | 0.98 | 0.86 | 0.98 | 0.87 | 0.99 |
| 30% | 0.79 | 0.89 | 0.77 | 0.81 | 0.85 | 0.97 | 0.84 | 0.98 | 0.85 | 0.99 | 0.87 | 0.99 |
| 40% | 0.79 | 0.89 | 0.81 | 0.91 | 0.85 | 0.98 | 0.85 | 0.98 | 0.86 | 0.99 | 0.88 | 0.99 |
| 50% | 0.80 | 0.89 | 0.85 | 0.97 | 0.86 | 0.99 | 0.85 | 0.98 | 0.88 | 0.99 | 0.90 | 0.99 |
| 60% | 0.78 | 0.92 | 0.87 | 0.97 | 0.85 | 0.99 | 0.85 | 0.98 | 0.90 | 0.99 | 0.92 | 0.99 |
| 70% | 0.83 | 0.90 | 0.90 | 0.97 | 0.86 | 0.99 | 0.86 | 0.98 | 0.93 | 0.99 | 0.96 | 0.99 |
| 80% | 0.83 | 0.91 | 0.96 | 0.98 | 0.86 | 0.99 | 0.87 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 |
| 90% | 0.85 | 0.91 | 0.97 | 0.97 | 0.87 | 0.98 | 0.91 | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 |
| 10% | 0.72 | 0.86 | 0.72 | 0.72 | 0.78 | 0.96 | 0.79 | 0.96 | 0.79 | 0.95 | 0.85 | 0.98 |
| 20% | 0.72 | 0.85 | 0.74 | 0.72 | 0.77 | 0.97 | 0.78 | 0.98 | 0.79 | 0.98 | 0.81 | 0.98 |
| 30% | 0.78 | 0.88 | 0.77 | 0.77 | 0.79 | 0.97 | 0.78 | 0.97 | 0.79 | 0.99 | 0.82 | 0.99 |
| 40% | 0.78 | 0.86 | 0.80 | 0.90 | 0.79 | 0.98 | 0.79 | 0.98 | 0.79 | 0.99 | 0.85 | 0.99 |
| 50% | 0.80 | 0.88 | 0.84 | 0.96 | 0.80 | 0.99 | 0.79 | 0.98 | 0.84 | 0.99 | 0.87 | 0.99 |
| 60% | 0.78 | 0.90 | 0.86 | 0.96 | 0.79 | 0.99 | 0.79 | 0.98 | 0.88 | 0.99 | 0.90 | 0.99 |
| 70% | 0.82 | 0.89 | 0.90 | 0.97 | 0.80 | 0.99 | 0.80 | 0.98 | 0.92 | 0.99 | 0.96 | 0.99 |
| 80% | 0.82 | 0.90 | 0.96 | 0.98 | 0.80 | 0.98 | 0.82 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 |
| 90% | 0.84 | 0.91 | 0.97 | 0.97 | 0.82 | 0.98 | 0.89 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 |
Classification Accuracy and F1 score for AML dataset at varying sampling ratios (FR, feature reduction; SR, sampling ratio; Obs., observed; Rec., recovered using DSNN method).
| 10% | 0.55 | 0.84 | 0.54 | 0.83 | 0.60 | 0.86 | 0.86 | 0.96 | 0.76 | 0.91 | 0.96 | 0.98 |
| 20% | 0.50 | 0.98 | 0.50 | 0.98 | 0.97 | 0.97 | 0.98 | 0.98 | 0.73 | 0.99 | 0.91 | 0.99 |
| 30% | 0.45 | 0.99 | 0.45 | 0.99 | 0.97 | 0.97 | 0.98 | 0.98 | 0.76 | 1.0 | 0.91 | 0.99 |
| 40% | 0.53 | 0.99 | 0.59 | 0.99 | 0.95 | 0.99 | 0.99 | 1.0 | 0.71 | 1.0 | 0.86 | 1.0 |
| 50% | 0.54 | 0.98 | 0.56 | 0.99 | 0.96 | 0.96 | 0.99 | 0.99 | 0.77 | 1.0 | 0.83 | 0.99 |
| 60% | 0.63 | 0.98 | 0.70 | 0.99 | 0.98 | 1.0 | 0.99 | 1.0 | 0.75 | 1.0 | 0.93 | 1.0 |
| 70% | 0.63 | 0.96 | 0.67 | 0.99 | 0.98 | 0.98 | 0.99 | 1.0 | 0.82 | 1.0 | 0.96 | 0.99 |
| 80% | 0.75 | 0.96 | 0.77 | 0.99 | 0.99 | 0.99 | 0.96 | 1.0 | 0.87 | 0.98 | 0.96 | 1.0 |
| 90% | 0.80 | 0.94 | 0.87 | 0.99 | 0.99 | 0.99 | 0.96 | 0.99 | 0.94 | 0.99 | 0.97 | 0.99 |
| 10% | 0.53 | 0.83 | 0.54 | 0.83 | 0.48 | 0.85 | 0.86 | 0.95 | 0.76 | 0.91 | 0.96 | 0.98 |
| 20% | 0.49 | 0.98 | 0.50 | 0.98 | 0.97 | 0.97 | 0.98 | 0.98 | 0.73 | 1.0 | 0.91 | 0.99 |
| 30% | 0.45 | 1.0 | 0.46 | 0.99 | 0.97 | 0.97 | 0.98 | 0.99 | 0.76 | 1.0 | 0.91 | 0.99 |
| 40% | 0.52 | 0.99 | 0.60 | 1.0 | 0.96 | 0.99 | 1.0 | 1.0 | 0.72 | 1.0 | 0.86 | 1.0 |
| 50% | 0.53 | 0.98 | 0.57 | 0.99 | 0.96 | 0.96 | 0.99 | 0.99 | 0.78 | 1.0 | 0.82 | 0.99 |
| 60% | 0.64 | 0.97 | 0.70 | 0.99 | 0.98 | 1.0 | 0.99 | 1.0 | 0.75 | 1.0 | 0.93 | 1.0 |
| 70% | 0.64 | 0.97 | 0.68 | 1.0 | 0.98 | 0.98 | 0.99 | 1.0 | 0.82 | 1.0 | 0.96 | 0.99 |
| 80% | 0.73 | 0.96 | 0.77 | 0.99 | 0.98 | 0.99 | 0.96 | 1.0 | 0.87 | 0.98 | 0.96 | 1.0 |
| 90% | 0.77 | 0.93 | 0.87 | 0.98 | 0.99 | 0.99 | 0.96 | 0.99 | 0.94 | 0.99 | 0.97 | 0.99 |
Figure 5Comparison of different methods in terms of classification accuracy and F1 score at varying sampling ratios on CLL dataset.
Figure 6Comparison of different methods in terms of classification accuracy and F1 score at varying sampling ratios on AML dataset.
Figure 7Few important KEGG pathways at 70% observed and imputed data for CLL data. Adjusted p-values are shown in brackets.
Figure 8Few important KEGG pathways at 70% observed and imputed data for AML data. Adjusted p-values are shown in brackets.
Figure 9Few important KEGG pathways at 70% observed and imputed data for MM-Spanish data. Adjusted p-values are shown in brackets.
Figure 10Few important KEGG pathways at 70% observed and imputed data for MM-Indian data. Adjusted p-values are shown in brackets.
Proposed DSNN Method
| |