Fei Deng1, Haijun Zhou2, Yong Lin3, John A Heim4, Lanlan Shen5, Yuan Li6, Lanjing Zhang7. 1. School of Electrical and Electronic Engineering, Shanghai Institute of Technology, China. 2. Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Houston, TX, USA. 3. Rutgers Cancer Institute of New Jersey, Rutgers, New Brunswick, NJ, USA; Department of Biostatistics, Rutgers School of Public Health, Piscataway, NJ, USA. 4. Princeton Medical Center, Plainsboro, NJ, USA. 5. Department of Pediatrics, Baylor College of Medicine, USDA/ARS Children's Nutrition Research Center, Houston, TX, USA. 6. Department of Pathology, Fudan University Shanghai Cancer Center, Shanghai, China; Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China. Electronic address: lumoxuan2009@163.com. 7. Rutgers Cancer Institute of New Jersey, Rutgers, New Brunswick, NJ, USA; Princeton Medical Center, Plainsboro, NJ, USA; Department of Biological Sciences, Rutgers University, Newark, NJ, USA; Department of Chemical Biology, Rutgers Ernest Mario School of Pharmacy, Rutgers University, Piscataway, Newark, USA. Electronic address: lanjing.zhang@rutgers.edu.
Abstract
BACKGROUND: Random forests (RF) is a widely used machine-learning algorithm, and outperforms many other machine learning algorithms in prediction-accuracy. But it is rarely used for predicting causes of death (COD) in cancer patients. On the other hand, multicategory COD are difficult to classify in lung cancer patients, largely because they have multiple labels (versus binary labels). METHODS: We tuned RF algorithms to classify 5-category COD among the lung cancer patients in the surveillance, epidemiology and end results-18, whose lung cancers were diagnosed in 2004, for the completeness in their follow-up. The patients were randomly divided into training and validation sets (1:1 and 4:1 sample-splits). We compared the prediction accuracy of the tuned RF and multinomial logistic regression (MLR) models. RESULTS: We included 42,257 qualified lung cancers in the database. The COD were lung cancer (72.41%), other causes or alive (14.43%), non-lung cancer (6.85%), cardiovascular disease (5.35%), and infection (0.96%). The tuned RF model with 300 iterations and 10 variables outperformed the MLR model (accuracy = 69.8% vs 64.6%, 1:1 sample-split), while 4:1 sample-split produced lower prediction-accuracy than 1:1 sample-split. The top-10 important factors in the RF model were sex, chemotherapy status, age (65+ vs < 65 years), radiotherapy status, nodal status, T category, histology type and laterality, all of which except T category and laterality were also important in MLR model. CONCLUSION: We tuned RF models to predict 5-category CODs in lung cancer patients, and show RF outperforms MLR in prediction accuracy. We also identified the factors associated with these COD.
BACKGROUND: Random forests (RF) is a widely used machine-learning algorithm, and outperforms many other machine learning algorithms in prediction-accuracy. But it is rarely used for predicting causes of death (COD) in cancer patients. On the other hand, multicategory COD are difficult to classify in lung cancer patients, largely because they have multiple labels (versus binary labels). METHODS: We tuned RF algorithms to classify 5-category COD among the lung cancer patients in the surveillance, epidemiology and end results-18, whose lung cancers were diagnosed in 2004, for the completeness in their follow-up. The patients were randomly divided into training and validation sets (1:1 and 4:1 sample-splits). We compared the prediction accuracy of the tuned RF and multinomial logistic regression (MLR) models. RESULTS: We included 42,257 qualified lung cancers in the database. The COD were lung cancer (72.41%), other causes or alive (14.43%), non-lung cancer (6.85%), cardiovascular disease (5.35%), and infection (0.96%). The tuned RF model with 300 iterations and 10 variables outperformed the MLR model (accuracy = 69.8% vs 64.6%, 1:1 sample-split), while 4:1 sample-split produced lower prediction-accuracy than 1:1 sample-split. The top-10 important factors in the RF model were sex, chemotherapy status, age (65+ vs < 65 years), radiotherapy status, nodal status, T category, histology type and laterality, all of which except T category and laterality were also important in MLR model. CONCLUSION: We tuned RF models to predict 5-category CODs in lung cancer patients, and show RF outperforms MLR in prediction accuracy. We also identified the factors associated with these COD.
Authors: Jasjit S Suri; Mrinalini Bhagawati; Sudip Paul; Athanasios D Protogerou; Petros P Sfikakis; George D Kitas; Narendra N Khanna; Zoltan Ruzsa; Aditya M Sharma; Sanjay Saxena; Gavino Faa; John R Laird; Amer M Johri; Manudeep K Kalra; Kosmas I Paraskevas; Luca Saba Journal: Diagnostics (Basel) Date: 2022-03-16