Ki-Yeol Kim1, Byoung-Jin Kim, Gwan-Su Yi. 1. School of engineering, Information and Communications University, 103-6 Munji-dong, Yusung-gu, Daejon 305-714, South Korea. kky1004@icu.ac.kr <kky1004@icu.ac.kr>
Abstract
BACKGROUND: The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. RESULTS: We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. CONCLUSIONS: Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data.
BACKGROUND: The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. RESULTS: We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. CONCLUSIONS: Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data.
Authors: O Troyanskaya; M Cantor; G Sherlock; P Brown; T Hastie; R Tibshirani; D Botstein; R B Altman Journal: Bioinformatics Date: 2001-06 Impact factor: 6.937
Authors: Sean P Bohen; Olga G Troyanskaya; Orly Alter; Roger Warnke; David Botstein; Patrick O Brown; Ronald Levy Journal: Proc Natl Acad Sci U S A Date: 2003-02-05 Impact factor: 11.205
Authors: A P Gasch; P T Spellman; C M Kao; O Carmel-Harel; M B Eisen; G Storz; D Botstein; P O Brown Journal: Mol Biol Cell Date: 2000-12 Impact factor: 4.138
Authors: A A Alizadeh; M B Eisen; R E Davis; C Ma; I S Lossos; A Rosenwald; J C Boldrick; H Sabet; T Tran; X Yu; J I Powell; L Yang; G E Marti; T Moore; J Hudson; L Lu; D B Lewis; R Tibshirani; G Sherlock; W C Chan; T C Greiner; D D Weisenburger; J O Armitage; R Warnke; R Levy; W Wilson; M R Grever; J C Byrd; D Botstein; P O Brown; L M Staudt Journal: Nature Date: 2000-02-03 Impact factor: 49.962
Authors: M E Garber; O G Troyanskaya; K Schluens; S Petersen; Z Thaesler; M Pacyna-Gengelbach; M van de Rijn; G D Rosen; C M Perou; R I Whyte; R B Altman; P O Brown; D Botstein; I Petersen Journal: Proc Natl Acad Sci U S A Date: 2001-11-13 Impact factor: 11.205
Authors: P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher Journal: Mol Biol Cell Date: 1998-12 Impact factor: 4.138
Authors: Hiroyuki Yoshimoto; Kirstie Saltsman; Audrey P Gasch; Hong Xia Li; Nobuo Ogawa; David Botstein; Patrick O Brown; Martha S Cyert Journal: J Biol Chem Date: 2002-06-10 Impact factor: 5.157
Authors: Dorota Stefanowicz; Tillie-Louise Hackett; Farshid S Garmaroudi; Oliver P Günther; Sarah Neumann; Erika N Sutanto; Kak-Ming Ling; Michael S Kobor; Anthony Kicic; Stephen M Stick; Peter D Paré; Darryl A Knight Journal: PLoS One Date: 2012-09-06 Impact factor: 3.240
Authors: Sarah Giese; Hamid Hossain; Melanie Markmann; Trinad Chakraborty; Svetlin Tchatalbachev; Florian Guillou; Martin Bergmann; Klaus Failing; Karola Weider; Ralph Brehm Journal: Dis Model Mech Date: 2012-06-14 Impact factor: 5.758