Eunjee Lee1,2,3, Seungyeul Yoo1,2, Wenhui Wang1,2, Zhidong Tu1,2, Jun Zhu1,2,3,4. 1. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA. 2. Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA. 3. Sema4, a Mount Sinai venture, 333 Ludlow street, Stamford, CT 06902, USA. 4. The Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA.
Abstract
BACKGROUND: Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases. RESULTS: Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis. CONCLUSIONS: Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.
BACKGROUND: Data errors, including sample swapping and mis-labeling, are inevitable in the process of large-scale omics data generation. Data errors need to be identified and corrected before integrative data analyses where different types of data are merged on the basis of the annotated labels. Data with labeling errors dampen true biological signals. More importantly, data analysis with sample errors could lead to wrong scientific conclusions. We developed a robust probabilistic multi-omics data matching procedure, proMODMatcher, to curate data and identify and correct data annotation and errors in large databases. RESULTS: Application to simulated datasets suggests that proMODMatcher achieved robust statistical power even when the number of cis-associations was small and/or the number of samples was large. Application of our proMODMatcher to multi-omics datasets in The Cancer Genome Atlas and International Cancer Genome Consortium identified sample errors in multiple cancer datasets. Our procedure was not only able to identify sample-labeling errors but also to unambiguously identify the source of the errors. Our results demonstrate that these errors should be identified and corrected before integrative analysis. CONCLUSIONS: Our results indicate that sample-labeling errors were common in large multi-omics datasets. These errors should be corrected before integrative analysis.
Authors: Harm-Jan Westra; Ritsert C Jansen; Rudolf S N Fehrmann; Gerard J te Meerman; David van Heel; Cisca Wijmenga; Lude Franke Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937
Authors: Yanqing Chen; Jun Zhu; Pek Yee Lum; Xia Yang; Shirly Pinto; Douglas J MacNeil; Chunsheng Zhang; John Lamb; Stephen Edwards; Solveig K Sieberts; Amy Leonardson; Lawrence W Castellini; Susanna Wang; Marie-France Champy; Bin Zhang; Valur Emilsson; Sudheer Doss; Anatole Ghazalpour; Steve Horvath; Thomas A Drake; Aldons J Lusis; Eric E Schadt Journal: Nature Date: 2008-03-16 Impact factor: 49.962
Authors: Hua Zhong; John Beaulaurier; Pek Yee Lum; Cliona Molony; Xia Yang; Douglas J Macneil; Drew T Weingarth; Bin Zhang; Danielle Greenawalt; Radu Dobrin; Ke Hao; Sangsoon Woo; Christine Fabre-Suver; Su Qian; Michael R Tota; Mark P Keller; Christina M Kendziorski; Brian S Yandell; Victor Castro; Alan D Attie; Lee M Kaplan; Eric E Schadt Journal: PLoS Genet Date: 2010-05-06 Impact factor: 5.917
Authors: Yi-Hsiang Hsu; M Carola Zillikens; Scott G Wilson; Charles R Farber; Serkalem Demissie; Nicole Soranzo; Estelle N Bianchi; Elin Grundberg; Liming Liang; J Brent Richards; Karol Estrada; Yanhua Zhou; Atila van Nas; Miriam F Moffatt; Guangju Zhai; Albert Hofman; Joyce B van Meurs; Huibert A P Pols; Roger I Price; Olle Nilsson; Tomi Pastinen; L Adrienne Cupples; Aldons J Lusis; Eric E Schadt; Serge Ferrari; André G Uitterlinden; Fernando Rivadeneira; Timothy D Spector; David Karasik; Douglas P Kiel Journal: PLoS Genet Date: 2010-06-10 Impact factor: 5.917
Authors: Seungyeul Yoo; Tao Huang; Joshua D Campbell; Eunjee Lee; Zhidong Tu; Mark W Geraci; Charles A Powell; Eric E Schadt; Avrum Spira; Jun Zhu Journal: PLoS Comput Biol Date: 2014-08-14 Impact factor: 4.475
Authors: Gamze Gürsoy; Prashant Emani; Charlotte M Brannon; Otto A Jolanki; Arif Harmanci; J Seth Strattan; J Michael Cherry; Andrew D Miranker; Mark Gerstein Journal: Cell Date: 2020-11-12 Impact factor: 41.582