You Won Lee1, Jae Woo Choi2, Eun-Hee Shin3. 1. Department of Tropical Medicine and Parasitology, Seoul National University College of Medicine and Institute of Endemic Diseases, Seoul, 03080, Republic of Korea. 2. Department of Pharmacology, Yonsei University College of Medicine, Seoul, 03722, Republic of Korea; Severance Biomedical Science Institute, Yonsei University College of Medicine, Seoul, 03722, Republic of Korea. 3. Department of Tropical Medicine and Parasitology, Seoul National University College of Medicine and Institute of Endemic Diseases, Seoul, 03080, Republic of Korea; Seoul National University Bundang Hospital, Seongnam, 13620, Republic of Korea. Electronic address: ehshin@snu.ac.kr.
Abstract
BACKGROUND: Rapid diagnosing is crucial for controlling malaria. Various studies have aimed at developing machine learning models to diagnose malaria using blood smear images; however, this approach has many limitations. This study developed a machine learning model for malaria diagnosis using patient information. METHODS: To construct datasets, we extracted patient information from the PubMed abstracts from 1956 to 2019. We used two datasets: a solely parasitic disease dataset and total dataset by adding information about other diseases. We compared six machine learning models: support vector machine, random forest (RF), multilayered perceptron, AdaBoost, gradient boosting (GB), and CatBoost. In addition, a synthetic minority oversampling technique (SMOTE) was employed to address the data imbalance problem. RESULTS: Concerning the solely parasitic disease dataset, RF was found to be the best model regardless of using SMOTE. Concerning the total dataset, GB was found to be the best. However, after applying SMOTE, RF performed the best. Considering the imbalanced data, nationality was found to be the most important feature in malaria prediction. In case of the balanced data with SMOTE, the most important feature was symptom. CONCLUSIONS: The results demonstrated that machine learning techniques can be successfully applied to predict malaria using patient information.
BACKGROUND: Rapid diagnosing is crucial for controlling malaria. Various studies have aimed at developing machine learning models to diagnose malaria using blood smear images; however, this approach has many limitations. This study developed a machine learning model for malaria diagnosis using patient information. METHODS: To construct datasets, we extracted patient information from the PubMed abstracts from 1956 to 2019. We used two datasets: a solely parasitic disease dataset and total dataset by adding information about other diseases. We compared six machine learning models: support vector machine, random forest (RF), multilayered perceptron, AdaBoost, gradient boosting (GB), and CatBoost. In addition, a synthetic minority oversampling technique (SMOTE) was employed to address the data imbalance problem. RESULTS: Concerning the solely parasitic disease dataset, RF was found to be the best model regardless of using SMOTE. Concerning the total dataset, GB was found to be the best. However, after applying SMOTE, RF performed the best. Considering the imbalanced data, nationality was found to be the most important feature in malaria prediction. In case of the balanced data with SMOTE, the most important feature was symptom. CONCLUSIONS: The results demonstrated that machine learning techniques can be successfully applied to predict malaria using patient information.
Authors: Manfred Musigmann; Burak Han Akkurt; Hermann Krähling; Benjamin Brokinkel; Dylan J H A Henssen; Thomas Sartoretti; Nabila Gala Nacul; Walter Stummer; Walter Heindel; Manoj Mannil Journal: Sci Rep Date: 2022-08-18 Impact factor: 4.996