| Literature DB >> 33193667 |
Meng Song1, Jonathan Greenbaum2, Joseph Luttrell1, Weihua Zhou3, Chong Wu4, Hui Shen2, Ping Gong5, Chaoyang Zhang1, Hong-Wen Deng2.
Abstract
Multi-omics studies, which explore the interactions between multiple types of biological factors, have significant advantages over single-omics analysis for their ability to provide a more holistic view of biological processes, uncover the causal and functional mechanisms for complex diseases, and facilitate new discoveries in precision medicine. However, omics datasets often contain missing values, and in multi-omics study designs it is common for individuals to be represented for some omics layers but not all. Since most statistical analyses cannot be applied directly to the incomplete datasets, imputation is typically performed to infer the missing values. Integrative imputation techniques which make use of the correlations and shared information among multi-omics datasets are expected to outperform approaches that rely on single-omics information alone, resulting in more accurate results for the subsequent downstream analyses. In this review, we provide an overview of the currently available imputation methods for handling missing values in bioinformatics data with an emphasis on multi-omics imputation. In addition, we also provide a perspective on how deep learning methods might be developed for the integrative imputation of multi-omics datasets.Entities:
Keywords: autoencoders; deep learning; integrative imputation; machine learning; multi-omics imputation; multi-view matrix factorization; single-omics imputation; transfer learning
Year: 2020 PMID: 33193667 PMCID: PMC7594632 DOI: 10.3389/fgene.2020.570255
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Genotype imputation methods.
| Method | Remarks | Strengths | Limitations | |
| Reference-based | fastPHASE | Haplotype cluster and HMM | Handles samples from multiple subpopulations | Does not estimate recombination rates |
| IMPUTE2 | MCMC and HMM | First tool to use pre-phasing | Computational complexity | |
| IMPUTE4 | Improvement of IMPUTE2 | Faster and more memory efficient | ||
| BEAGLE 5.0 | Graphical model | Handles multi-allelic markers | Computational complexity | |
| MACH | HMM model | Computational complexity | ||
| FISH | Segmental HMM | No pre-phasing and less computational complexity | ||
| Minimac3 | Improvement of MACH | Engine for web-based imputation servers | ||
| TUNA, PLINK, UNPHASED, SNPMStat | SNP-tagging approaches | Simpler and faster than HMM-based methods | Only considers local LD structure | |
| Reference-free | SVD, Mean, RF, KNN | Statistical techniques | Easy to implement | Does not model linkage patterns, recombination hotspots, mutations, genotyping errors |
| SCDA | Sparse convolutional denoising autoencoder | Deep learning | Hard to interpret the prediction mechanisms |
Gene expression data imputation methods.
| Category | Method | Remarks | Strengths | Limitations | |
| Bulk RNA-seq | Statistical methods | Mean | Row average | Simple | Low accuracy |
| KNNimpute | Hot deck imputation | Simple | Difficult to determine K | ||
| GMCimpute | Gaussian mixture clustering with model averaging | Suited to both cross-sectional and time series | Same as KNNimpute | ||
| SEQimpute | MI imputation | Vulnerable to outliers | |||
| GOKNN/GOLLS | Cold deck imputation with gene ontology | Incorporates prior knowledge | |||
| scRNA-seq | Classic ML methods | MAGIC | Neighborhood-based Markov-affinity matrix | Can recover gene-gene relationships | May introduce bias for true zeros |
| DrImpute | Clustering based | Ignores gene-level correlation | |||
| scImpute | Gamma-Normal mixture model | Learns gene dropout probabilities | |||
| SAVER | Bayesian-based model | Quantifies estimation uncertainty | May introduce bias for true zeros | ||
| SAVER-X | Bayesian-based model and autoencoder | Web-based imputation tool | |||
| VIPER | Weighted penalized regression model | Free of tuning parameters | No uncertainty quantification | ||
| EnImpute | Ensemble learning | Combines eight approaches | |||
| Deep learning-based methods | SAUCIE | Multi-task deep autoencoder | Difficult to evaluate accuracy | ||
| AutoImpute | Autoencoder-based | ||||
| DCA | Autoencoder with the ZINB loss function | Overfitting | |||
| scVI | Stochastic optimization and VAE | High scalability | |||
| DeepImpute | Deep neural network-based | Constructs sub-neural networks |
Epigenomic data imputation methods.
| Method | Remarks | Strengths | Limitations | |
| Statistical methods | ChromImpute | Ensemble of regression trees | Does not incorporate genetic variation as an input | |
| Melissa | Bayesian hierarchical method | Considers local correlations from neighbor CpGs and information across similar cells | No consideration of heterogeneity at the single gene level | |
| PREDICTD | PARAFAC ( | 3D tensor decomposition and cloud computing | Does not learn non-linear relationships | |
| Deep learning-based methods | Avocado | Tensor factorization and deep neural network | 3D tensor decomposition, DNN to learn non-linear relationships | Hyperparameter settings may influence precision and recall |
| SCALE | VAE and GMM | |||
| DeepCpG | Deep learning-based joint model | Uses associations between neighbor CpGs as well as between DNA sequence patterns and methylation states | Does not integrate multi-omics data profiled in the same cell |
Proteomic data imputation methods.
| Method | Remarks | Strengths | Limitations | |
| Single-digit replacement | LOD1 | Half of the global minimal intensity among peptides | Simple, good performance for largely left-censored missing values | Poor classification accuracy at peptide and protein levels |
| LOD2 | Half of the minimal intensity of individual peptide | Same as LOD1 | Same as LOD1 | |
| RTI | Random drawing from a truncated normal distribution | Same as LOD1/LOD2 | Same as LOD1/LOD2 | |
| Local methods | KNN | Weighted average intensity of K most similar peptides | Simple | Difficult to determine K |
| LLS | Least-squares based regression model | Automatically estimates K most similar peptides | ||
| LSA | Weighted LLS | May need to remove features with high missing rate before imputation | ||
| REM | Regularized EM model | May lead to biased estimators and convergence issues | ||
| MBI | ANOVA model | |||
| Global methods | PPCA | PCA and EM | ||
| BPCA | PCA, Bayesian estimation and EM | Model parameters automatically determined | Assumes global covariance structure which may introduce bias |
FIGURE 1The genetic information flow from DNA to protein via RNA and the interaction between transcriptomics and genomics, epigenomics or proteomics in multi-omics data imputation. The top diagram shows the central dogma of molecular biology in which DNA is transcribed to RNA and then translated to proteins. The bottom diagram shows the integrative relationship between the genomics, epigenomics, transcriptomics and proteomics datasets on which multi-omics data imputation methods are based. These methods are built by combining different omics datasets to perform integrative imputation of missing values. This is shown here by using arrows numbered by the type of data combination they represent. For example, arrow 1 shows that multi-omics imputation can be facilitated by leveraging the correlation between data from both genomics (such as SNP) and transcriptomics (such as gene expression).
Integrative imputation methods for multi-omics datasets.
| Category | Method | Remarks | References | |
| Genomics and transcriptomics | ML-based regression model | PrediXcan | ENet | |
| S-PrediXcan | GWAS summary statistics | |||
| FUSION | BSLMM | |||
| TIGAR | DPR | |||
| CoMM | EM | |||
| Epigenomics and transcriptomics | ML-based regression model | Lin | Ensemble learning | |
| EpiXcan | WENet | |||
| TOBMI | KNN | |||
| Transfer learning | TDimpute | Transfer learning and DNN | ||
| Transcriptomics and proteomics | Transfer learning | cTP-net | SAVER-X and MB-DNN | |
| Seurat v3 | Anchor-based transfer-learning | |||
| Tri-omics | Multi-view matrix factorization | MI-MFA | STATIS | |
| LF-IMVC | Multi-View Clustering | |||
| MOFA | Unsupervised factorization |