BACKGROUND: Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. OBJECTIVES: The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. METHODS: Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. RESULTS: Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. CONCLUSIONS: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.
BACKGROUND: Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem. OBJECTIVES: The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities. METHODS: Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians. RESULTS: Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software. CONCLUSIONS: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.
Authors: Elizabeth S Burnside; Daniel L Rubin; Jason P Fine; Ross D Shachter; Gale A Sisney; Winifred K Leung Journal: Radiology Date: 2006-09 Impact factor: 11.105
Authors: Susan Walsh; Alexander Lindenbergh; Sofia B Zuniga; Titia Sijen; Peter de Knijff; Manfred Kayser; Kaye N Ballantyne Journal: Forensic Sci Int Genet Date: 2010-10-14 Impact factor: 4.882
Authors: Kathleen D Askland; Sarah Garnaat; Nicholas J Sibrava; Christina L Boisseau; David Strong; Maria Mancebo; Benjamin Greenberg; Steve Rasmussen; Jane Eisen Journal: Int J Methods Psychiatr Res Date: 2015-05-21 Impact factor: 4.035
Authors: Andrea Ripoli; Emanuela Sozio; Francesco Sbrana; Giacomo Bertolino; Carlo Pallotto; Gianluigi Cardinali; Simone Meini; Filippo Pieralli; Anna Maria Azzini; Ercole Concia; Bruno Viaggi; Carlo Tascini Journal: Infection Date: 2020-08-01 Impact factor: 3.553
Authors: Serina L Robinson; Barbara R Terlouw; Megan D Smith; Sacha J Pidot; Timothy P Stinear; Marnix H Medema; Lawrence P Wackett Journal: J Biol Chem Date: 2020-08-21 Impact factor: 5.157
Authors: Scott Powers; Junyang Qian; Kenneth Jung; Alejandro Schuler; Nigam H Shah; Trevor Hastie; Robert Tibshirani Journal: Stat Med Date: 2018-03-06 Impact factor: 2.373
Authors: Mark Alan Fontana; Stephen Lyman; Gourab K Sarker; Douglas E Padgett; Catherine H MacLean Journal: Clin Orthop Relat Res Date: 2019-06 Impact factor: 4.176
Authors: Qinxin Pan; Ting Hu; James D Malley; Angeline S Andrew; Margaret R Karagas; Jason H Moore Journal: Genet Epidemiol Date: 2014-02-17 Impact factor: 2.135