Vered Madar1, Sandra Batista2. 1. Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709, USA and. 2. Department of Computer Science, Princeton University, Princeton, NJ 08540, USA.
Abstract
MOTIVATION: We address a common problem in large-scale data analysis, and especially the field of genetics, the huge-scale testing problem, where millions to billions of hypotheses are tested together creating a computational challenge to control the inflation of the false discovery rate. As a solution we propose an alternative algorithm for the famous Linear Step Up procedure of Benjamini and Hochberg. RESULTS: Our algorithm requires linear time and does not require any P-value ordering. It permits separating huge-scale testing problems arbitrarily into computationally feasible sets or chunks Results from the chunks are combined by our algorithm to produce the same results as the controlling procedure on the entire set of tests, thus controlling the global false discovery rate even when P-values are arbitrarily divided. The practical memory usage may also be determined arbitrarily by the size of available memory. AVAILABILITY AND IMPLEMENTATION: R code is provided in the supplementary material. CONTACT: sbatista@cs.princeton.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: We address a common problem in large-scale data analysis, and especially the field of genetics, the huge-scale testing problem, where millions to billions of hypotheses are tested together creating a computational challenge to control the inflation of the false discovery rate. As a solution we propose an alternative algorithm for the famous Linear Step Up procedure of Benjamini and Hochberg. RESULTS: Our algorithm requires linear time and does not require any P-value ordering. It permits separating huge-scale testing problems arbitrarily into computationally feasible sets or chunks Results from the chunks are combined by our algorithm to produce the same results as the controlling procedure on the entire set of tests, thus controlling the global false discovery rate even when P-values are arbitrarily divided. The practical memory usage may also be determined arbitrarily by the size of available memory. AVAILABILITY AND IMPLEMENTATION: R code is provided in the supplementary material. CONTACT: sbatista@cs.princeton.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Owen A Brady; Eutteum Jeong; José A Martina; Mehdi Pirooznia; Ilker Tunc; Rosa Puertollano Journal: Elife Date: 2018-12-06 Impact factor: 8.140
Authors: Jia-Hao Bi; Yi-Fan Tong; Zhe-Wei Qiu; Xing-Feng Yang; John Minna; Adi F Gazdar; Kai Song Journal: BioData Min Date: 2019-06-26 Impact factor: 2.522
Authors: Hemanth N Banavath; Barbara Roman; Nathan Mackowski; Debjit Biswas; Junaid Afzal; Yohei Nomura; Soroosh Solhjoo; Brian O'Rourke; Mark Kohr; Elizabeth Murphy; Charles Steenbergen; Samarjit Das Journal: J Am Heart Assoc Date: 2019-12-05 Impact factor: 5.501