Yishu Wang1, Lin Wang1, Dejie Yang2, Minghua Deng3. 1. Center for Quantitative Biology, Peking University, Beijing 100871, China. 2. Institute of Computing Technology, Chinese Academy of Science, Beijing 100190, China. 3. Center for Quantitative Biology, Peking University, Beijing 100871, China; School of Mathematical Sciences, Peking University, Beijing 100871, China; Center for Statistical Sciences, Peking University, Beijing 100871, China. Electronic address: dengmh@math.pku.edu.cn.
Abstract
BACKGROUND: Epistatic Miniarray Profiles (EMAP) enable the research of genetic interaction as an important method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since such missing values hinder downstream analysis. While some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data which has resulted in a higher accuracy rate compared with existing methods. RESULTS: The improved SVD imputation method adopts an effective soft-threshold to the SVD approach which has been shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful modules, known pathways and protein complexes could be detected. CONCLUSION: While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions.
BACKGROUND: Epistatic Miniarray Profiles (EMAP) enable the research of genetic interaction as an important method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since such missing values hinder downstream analysis. While some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data which has resulted in a higher accuracy rate compared with existing methods. RESULTS: The improved SVD imputation method adopts an effective soft-threshold to the SVD approach which has been shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful modules, known pathways and protein complexes could be detected. CONCLUSION: While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions.