Yeping Lina Qiu1,2, Hong Zheng1, Olivier Gevaert1,3. 1. Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA 94305, USA. 2. Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA. 3. Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA.
Abstract
BACKGROUND: As missing values are frequently present in genomic data, practical methods to handle missing data are necessary for downstream analyses that require complete data sets. State-of-the-art imputation techniques, including methods based on singular value decomposition and K-nearest neighbors, can be computationally expensive for large data sets and it is difficult to modify these algorithms to handle certain cases not missing at random. RESULTS: In this work, we use a deep-learning framework based on the variational auto-encoder (VAE) for genomic missing value imputation and demonstrate its effectiveness in transcriptome and methylome data analysis. We show that in the vast majority of our testing scenarios, VAE achieves similar or better performances than the most widely used imputation standards, while having a computational advantage at evaluation time. When dealing with data missing not at random (e.g., few values are missing), we develop simple yet effective methodologies to leverage the prior knowledge about missing data. Furthermore, we investigate the effect of varying latent space regularization strength in VAE on the imputation performances and, in this context, show why VAE has a better imputation capacity compared to a regular deterministic auto-encoder. CONCLUSIONS: We describe a deep learning imputation framework for transcriptome and methylome data using a VAE and show that it can be a preferable alternative to traditional methods for data imputation, especially in the setting of large-scale data and certain missing-not-at-random scenarios.
BACKGROUND: As missing values are frequently present in genomic data, practical methods to handle missing data are necessary for downstream analyses that require complete data sets. State-of-the-art imputation techniques, including methods based on singular value decomposition and K-nearest neighbors, can be computationally expensive for large data sets and it is difficult to modify these algorithms to handle certain cases not missing at random. RESULTS: In this work, we use a deep-learning framework based on the variational auto-encoder (VAE) for genomic missing value imputation and demonstrate its effectiveness in transcriptome and methylome data analysis. We show that in the vast majority of our testing scenarios, VAE achieves similar or better performances than the most widely used imputation standards, while having a computational advantage at evaluation time. When dealing with data missing not at random (e.g., few values are missing), we develop simple yet effective methodologies to leverage the prior knowledge about missing data. Furthermore, we investigate the effect of varying latent space regularization strength in VAE on the imputation performances and, in this context, show why VAE has a better imputation capacity compared to a regular deterministic auto-encoder. CONCLUSIONS: We describe a deep learning imputation framework for transcriptome and methylome data using a VAE and show that it can be a preferable alternative to traditional methods for data imputation, especially in the setting of large-scale data and certain missing-not-at-random scenarios.
Authors: Joshua D Campbell; Christina Yau; Reanne Bowlby; Yuexin Liu; Kevin Brennan; Huihui Fan; Alison M Taylor; Chen Wang; Vonn Walter; Rehan Akbani; Lauren Averett Byers; Chad J Creighton; Cristian Coarfa; Juliann Shih; Andrew D Cherniack; Olivier Gevaert; Marcos Prunello; Hui Shen; Pavana Anur; Jianhong Chen; Hui Cheng; D Neil Hayes; Susan Bullman; Chandra Sekhar Pedamallu; Akinyemi I Ojesina; Sara Sadeghi; Karen L Mungall; A Gordon Robertson; Christopher Benz; Andre Schultz; Rupa S Kanchi; Carl M Gay; Apurva Hegde; Lixia Diao; Jing Wang; Wencai Ma; Pavel Sumazin; Hua-Sheng Chiu; Ting-Wen Chen; Preethi Gunaratne; Larry Donehower; Janet S Rader; Rosemary Zuna; Hikmat Al-Ahmadie; Alexander J Lazar; Elsa R Flores; Kenneth Y Tsai; Jane H Zhou; Anil K Rustgi; Esther Drill; Ronglei Shen; Christopher K Wong; Joshua M Stuart; Peter W Laird; Katherine A Hoadley; John N Weinstein; Myron Peto; Curtis R Pickering; Zhong Chen; Carter Van Waes Journal: Cell Rep Date: 2018-04-03 Impact factor: 9.423
Authors: Vivek Singh; Rishikesan Kamaleswaran; Donald Chalfin; Antonio Buño-Soto; Janika San Roman; Edith Rojas-Kenney; Ross Molinaro; Sabine von Sengbusch; Parsa Hodjat; Dorin Comaniciu; Ali Kamen Journal: iScience Date: 2021-11-27