Literature DB >> 32080115

A Bayesian classification model for discriminating common infectious diseases in Zhejiang province, China.

Fudong Li¹, Yi Shen², Duo Lv³, Junfen Lin¹, Biyao Liu¹, Fan He¹, Zhen Wang¹.

Abstract

To develop a classification model for accurately discriminating common infectious diseases in Zhejiang province, China.Symptoms and signs, abnormal lab test results, epidemiological features, as well as the incidence rates were treated as predictors, and were collected from the published literature and a national surveillance system of infectious disease. A classification model was established using naïve Bayesian classifier. Dataset from historical outbreaks was applied for model validation, while sensitivity, specificity, accuracy, area under the receiver operating characteristic curve (AUC) and M-index were presented.A total of 146 predictors were included in the classification model, for discriminating 25 common infectious diseases. The sensitivity ranged from 44.44% for hepatitis E to 96.67% for measles. The specificity varied from 96.36% for dengue fever to 100% for 5 diseases. The median of total accuracy was 97.41% (range: 93.85%-99.04%). The AUCs exceeded 0.98 in 11 of 12 diseases, except in dengue fever (0.613). The M-index was 0.960 (95%CI 0.941-0.978).A novel classification model was constructed based on Bayesian approach to discriminate common infectious diseases in Zhejiang province, China. After entering symptoms and signs, abnormal lab test results, epidemiological features and city of disease origin, an output list of possible diseases ranked according to the calculated probabilities can be provided. The discrimination performance was reasonably good, making it useful in epidemiological applications.

Entities: Chemical

Mesh：

Year: 2020 PMID： 32080115 PMCID： PMC7034623 DOI： 10.1097/MD.0000000000019218

Source DB: PubMed Journal: Medicine (Baltimore) ISSN： 0025-7974 Impact factor: 1.817

Introduction

Zhejiang province is located on the southeast coast of China, with high incidence rates of many infectious diseases.[ In recent years, outbreaks of infectious diseases were still common in many counties. Therefore, the prevention and control of infectious diseases is a public health priority in Zhejiang. The early detection and efficient control are very important for prevention of further spreading of the disease. As a consequence, epidemiological field investigators must determine the cause of outbreaks as soon as possible. However, it can be quite challenging because the available information is always limited at the early phase of an outbreak.[ In many cases, epidemiological field investigators make a diagnosis based on personal experience and understanding of various diseases, with possibility of misdiagnosis. Timely and accurate diagnoses are vital to ensuring that the proper measures for disease control will be administered and new cases will be minimized.[ Recently, data mining techniques, including Bayesian classifiers, decision tree classifiers and neural network classifiers, have been the widely utilized for discriminating or diagnosing diseases.[ The infectious disease diagnosis module within the well-known Global Infectious Disease and Epidemiology Network (GIDEON) was developed based on Bayesian formula.[ Decision tree was applied for psychiatric diagnosis.[ Cualing et al[ and Shaw[ also used this technique to assist with diagnosing various diseases. Moreover, neural network technique was used to establish a system for diagnosing acute myocardial infarction.[ These studies supported the utility of disease classification. However, many of them were based on clinical data of individual cases, and the models in these studies are typically designed to assist medical professionals for clinical diagnoses. Limited research is available within the public health field to assist in determining the causative disease or pathogen of an epidemic or outbreak. Last, although the incidence rates of infectious diseases are relatively high in Zhejiang, no such study has been performed for this region. Given the fact that Bayesian classifier possesses high predictive accuracy and is suitable for large database,[ this study utilized Bayesian method to construct a classification model for discriminating common infectious diseases in Zhejiang province. The model was designed to provide epidemiological field investigators with an artificial intelligence (AI)-based, efficient method for discriminating various infectious diseases at the early phases of an epidemic or outbreak. It was expected to promote timely implementation of appropriate control measures, and provide potential clues to narrow the range of laboratory pathogens screening.

Methods

Classification algorithm

A naïve Bayes algorithm was used in this study.[ There is a group of diseases that contains j types of disease: D = (D1, D2 … Dj). The prior probabilities of these diseases are P(D1), P(D2) … P(Dj). Furthermore, there are k attributes or predictors including symptoms and signs, abnormal lab test results and epidemiological features, that form a set of attributes: S ={S1, S2…Sk). The conditional probabilities of these attributes when certain diseases exist are P(S1|Dj), P(S2|Dj)… P(Sk|Dj). When a patient presents n attributes, which form a set of presence attributes: S ={S1, S2… Sn}, the posterior probability of a disease for this patient, according to the Bayesian formula, would be: f = 1, 2… j. P(Df|S) is the probability of the fth type of disease being accompanied by the presence of attribute set S. The probability of the disease depends on the value of P(Df|S). That is, if the value of P(Dg|S) is the highest of all j posterior probabilities, then the likelihood of the gth type of disease is the highest with the presence of attribute set S, and Dg is the maximum likelihood diagnosis. Finally, all possible diseases with posterior probabilities are ranked from highest probability to lowest probability, and presented on the output list. A flow chart of the algorithm is shown in Figure 1. The model was performed using SAS (V.9.3, SAS institute) software.

Figure 1

Flow chart of the algorithm.

Data collection

The prior probability, P(Dj), was estimated according to the incidence rates of all included infectious diseases in every cities of Zhejiang province. The incidence data was collected from the China Information System for Disease Control and Prevention (CISDCP),[ a national surveillance system of infectious disease reported by medical institutions in real time. The conditional probability, P(Sk|Dj), was estimated based on the frequency of the corresponding attribute prevalent in individuals with each disease. Epidemiological features were collected from CISDCP. The symptoms and signs, as well as abnormal lab test results within each specific disease were derived from the epidemiology literature. The data including total number of patients, numbers of patients with each symptoms and signs, and numbers of patients with each abnormal lab test result, were abstracted from each literature to calculate weighted frequencies. We systematically searched the following Chinese databases: the China Knowledge Resource Integrated Database (www.cnki.net), Wanfang Data (wanfangdata.com.cn), VIP Journal Integration Platform (www.cqvip.com) and China Biology Medicine disc (www.sinomed.ac.cn). The search terms included “epidemic investigations” OR “outbreak investigation” AND names of each infectious disease. The references in published articles were also searched. Initially, titles and abstracts were screened to exclude ineligible studies. Then the full texts were reviewed for all the remaining studies. The literature screening procedures are presented in Figure 2.

Figure 2

Flow diagram of literature screening procedure. CNKI, China Knowledge Resource Integrated Database; VIP = VIP Journal Integration Platform; CBMdisc = China Biology Medicine disc.

Model validation

Dataset from historical outbreaks was utilized to validate the model. The data was collected from several outbreak investigations of infectious disease in Zhejiang province. During each investigation, epidemiological and clinical data has been collected based on a standard questionnaire by trained investigators for each patient. Most symptoms and signs were recorded in a dichotomized way (yes/no). General laboratory results have been documented as continuous variables with threshold of normality if available. Infectious etiology was determined according to strict case definitions from the Chinese Guideline of Diagnosis and Treatment for corresponding diseases issued by the National Health Commission of the People's Republic of China. The sensitivity, specificity, total accuracy, and area under the receiver operating characteristic curve (AUC) have been widely used as criteria for evaluating a diagnosis model.[ The following terms are fundamental to understanding the utility of them: True positive (TP): the patient has a disease and the prediction is positive. False positive (FP): the patient does not have a disease but the prediction is positive. True negative (TN): the patient does not have a disease and the prediction is negative. False negative (FN): the patient has a disease but the prediction is negative. The sensitivity of a diagnosis model refers to the ability of the model to correctly identify those patients with the disease: The specificity of a diagnosis model refers to the ability of the test to correctly identify those patients without the disease: The accuracy of a diagnosis model refers to the ability of the model to correctly identify those patients with the disease and without the disease: The receiver operating characteristic (ROC) plot expresses relationship between sensitivity and 1-Specificity. The closer the ROC curve is located to upper-left hand corner, the better the model. The AUC can have any value between 0 and 1 and it is a good indicator of the goodness of the model. We primarily employed these four parameters to assess discrimination performance of the model. The above-mentioned parameters were usually applied for binary outcomes. Considering the model designed for discriminating various diseases (polytomous outcomes), we obtained these parameters of category i by comparing category i with all other categories combined (1-vs-rest measure).[ Additionally, the M-index,[ a pairwise approach that averages all pairwise AUCs, was also evaluated, where the pairwise AUC measures the discrimination between any two categories. It is suggested independent of the category prevalence,[ with 0.5 and 1 as the values represented for random and perfect discrimination. All results were presented as point estimation with 95% confidence intervals (CIs).

Results

The initial search identified 2963 potentially relevant articles of 25 infectious disease. After screening duplicate records, titles, abstracts and full texts, 2400 articles were exclude. Finally, 563 articles were included for further data extraction. Table 1 showed the included 25 infectious diseases with the information on included literature. Frequencies of symptoms and signs, and abnormal lab test results were derived for estimation of conditional probabilities (see supplementary material). Meanwhile, the data of incidence rates as well as epidemiological features (age and gender) within each disease was collected to establish the database for model construction.

Table 1

Infectious diseases included in the model.

Infectious diseases included in the model. Dataset from historical outbreaks in Zhejiang province involving 12 diseases were used for model validation. Patient's characteristics of the validation dataset were summarized in Table 2. A total of 520 cases were included in validation dataset. The sample sizes of 12 diseases ranged from 13 to 93. The mean age of all patients was 22.37 years, with a highest mean age (71.06 years) in those diagnosed with Hepatitis E. 66.92% (348/520) were male, with a male-to-female ratio of 2.02:1. The calendar year of the outbreaks varied from 2005 to 2012. The majority of diseases possessed data from one outbreak.

Table 2

Patient's characteristics of the validation dataset.

Patient's characteristics of the validation dataset. The validation results were presented in Table 3. The highest sensitivity of the model was achieved for measles (96.67%), and the lowest for hepatitis E (44.44%). The specificity varied from 96.36% for dengue fever to 100% for 5 diseases including leptospirosis, acute hemorrhagic conjunctivitis, epidemic cerebrospinal meningitis, hepatitis E, and epidemic hemorrhagic fever. The median of total accuracy was 97.41% (range: 93.85% for dengue fever to 99.04% for bacillary dysentery). The AUCs exceeded 0.98 in 11 of 12 diseases, except in one disease (0.613 for dengue fever). The M-index (0.960, 95%CI 0.941–0.978) appeared very close to 1, which also indicated high discrimination performance of the model.

Table 3

Validation results of the classification model.

Discussion

A novel classification model was established for discriminating common infectious diseases in this study. The model can diagnose 25 common infectious diseases in Zhejiang province based on symptoms and signs, abnormal lab test results, epidemiological features, and incidence rates. By using standard validation methods, we affirmed that the model had good discrimination performance. Bayesian approach is widely adopted in epidemiology and clinical studies on developing discrimination or diagnostic models, due to its adequate capability in classifying multiple categories and suitability for large databases.[ The infectious disease diagnosis module contained in the well-known Global Infectious Disease and Epidemiology Network (GIDEON) was developed based on Bayesian formula.[ An evaluation study of GIDEON showed the accuracy was 64% of the 129 fevers with infectious etiology.[ Another study indicated the correct diagnoses ranked first for 52% when diagnosing febrile illnesses in Japanese returning travelers.[ Although the accuracy of GIDEON is acceptable, better predictive performance is needed. Furthermore, the data of symptoms and signs in GIDEON does not perfectly match those of the Chinese population, limiting its application in China. Therefore, some researchers tried to construct classification models especially for the Chinese population.[ Unfortunately, the predictive accuracies were undesirable. Moreover, the models designed in those studies were not validated appropriately. Most of them did not conduct ROC-AUC analysis and provided poor statistical description (e.g., lack of confidence intervals). In our study, the model was established based on data from the Chinese population, which are permitted for use in field investigation of infectious disease outbreaks in China. By using standard validation methods, the model presented relatively excellent discrimination performance. The major problem in developing an infectious disease diagnosis program is difficulty in obtaining reliable and accurate individual level training data. For the majority of statistical approaches, it is essential to acquire adequate sample size of individual level data within each category or disease. Nevertheless, it is unrealistic particularly when there are many outcome categories or diseases taken into consideration in modeling process. Fortunately, naïve Bayesian algorithm can overcome this challenge, in which the aggregated data instead of individual level data is sufficient for modeling. In our study, the conditional probabilities of symptoms and signs as well as abnormal lab test results, were estimated based on the frequencies of corresponding predictors derived from a certain amount of literature. Each literature of an epidemic or outbreak investigation possessed a certain amount of patients. Consequently, it could be assumed that a large enough sample size has been achieved within each disease (see Table 1). To select the most likely diagnoses among the multiplicity of possible diseases is another major challenge for modeling,[ by the fact that some symptoms and signs are quite similar among diseases affected same organ systems. The majority of diseases could be correctly discriminated in our model. Whereas, the sensitivity is below 50% for hepatitis E and dengue fever, although the sensitivity cannot entirely reflect the discrimination performance. Patients with hepatitis E may present with few clinical features. In our validation dataset, more than one third of patients with hepatitis E (7/18) were asymptomatic, who were discriminated with pulmonary tuberculosis as 1st ranking by the model. It is worth noting that the correct diagnosis of hepatitis E retained in top 3 ranking for all these patients. Furthermore, symptoms and signs of dengue fever are partially nonspecific, resulting in other 6 diseases ranked 1st which were actually incorrect. The correct diagnosis of dengue fever appeared in top 3 ranking for 59.26% (15/28). According to our results, we think the output list of diseases ranking is helpful for users. Many previous studies[ used the correct diagnoses appeared on the differential diagnosis lists or in the top 5 ranking as arbitrary indicators for evaluation. The validation results of 1st ranking performance in our model seem somewhat better than those using more tolerant indicators in previous studies. Besides the sensitivity, other parameters for validation demonstrated well discriminative capability of the model. There are several advantages in this study. The quantitative results are provided on the output list of model. All possible diseases can be listed and ranked from the highest probability to the lowest probability. Existing medical decision-support programs are often inadequate in achieving a match to the most likely diagnosis.[ It was suggested that a given disease was usually retained in the top 5 ranking when its probability exceeded 1%.[ As a consequence, the list of predicted diagnoses is valuable in reminding users of alternative diseases that might otherwise have been ignored.[ We have further assessed the top 3 ranking performance of our model, in which correct diagnosis in top 3 ranking was treated as correct discrimination. It was found that the sensitivity achieved 100% in 7 of 12 diseases, and also increased in rest 5 diseases than that using 1st ranking as correct discrimination before. Various types of information were utilized as predictors to discriminate the causative disease, including the incidence rates and hundreds of symptoms and signs, abnormal lab test results, and epidemiological features. Bayesian algorithm used in our study is suitable for such a large database with plenty of predictors. Meanwhile, by incorporating prior information on disease incidence, Bayesian classifiers have the potential to estimate disease probability better than other common machine-learning methods.[ Data on incidence rates and epidemiological features was collected from a national surveillance system of infectious disease,[ which guaranteed the data quality. In addition, the method for obtaining conditional probabilities ensured the enough sample size and adequacy for modelling. Standard statistical methods are utilized to validate the discrimination performance of the model, encompassing sensitivity, specificity, total accuracy, AUC and M-index. Seeing that 1-vs-rest measure for calculating former four parameters may be dominated by highly prevalent categories in the rest group,[ M-index was calculated as an alternative measure in this study. Both of two measures demonstrated that our model gained a notably high level of discriminative ability across multiple infectious diseases. Several methodological issues and limitations need to be mentioned. First, validation dataset involved the cases of 12 diseases only, due to the limited data resources in individual level we finally collected. Therefore, the discrimination performance was not able to be evaluated among other diseases, although the validation results were satisfactory in current 12 diseases. So we expect more validation data of other diseases. Second, the model is limited to discriminate infectious diseases already included in the database. Data on other diseases can be included to extend the application range of the model in future. Third, real-time updates of incidence rates should be carried out in future uses of the model. Meanwhile, data on conditional probabilities also needs to be updated regularly from information in latest literature. Forth, the conditional probabilities of symptoms and signs as well as abnormal lab results of each disease differ between countries. Since the model was designed for application among Chinese population, only the data reported in Chinese literature was used. Moreover, data on incidence rates of Zhejiang province was used for modeling, and the application of model can be generalized nationwide if that of China was used instead. Fifth, the prior probabilities (incidence rates) may vary during different stages of an outbreak, while our model assumed that the prior probability was constant in a specific event. Last, the data of conditional probabilities is collected from different sources, and the quality of literature may not always be perfect for all diseases. Nevertheless, the publishing process of the included literature at least guarantee the data quality adequate for model construction.

Conclusion

In this study, we constructed a classification model based on Bayesian classifier to discriminate common infectious diseases within Zhejiang province. After entering symptoms and signs, abnormal lab test results, epidemiological features and city of disease origin, the probabilities of diseases can be calculated and an output list of possible diseases ranked from the highest to the lowest probability can be provided. This model offers excellent discrimination performance, which is expected to be beneficial to epidemiological field investigators in determining the cause of an outbreak and to provide clues for laboratory pathogen screening.

Ethical approval and consent to participate

The data, including the incidence rates and epidemiologic features, was collected from CISDCP. It was exempt from the requirement for ethical approval and informed consent according to the Law of the People's Republic of China on Prevention and Treatment of Infectious diseases. The data, including symptoms and signs, and abnormal lab test results, was collected from previous published literature. Thus no ethical approval and informed consent is required. The data for model validation was collected from the investigation in response to public health emergency. As such, it was exempt from the requirement for ethical approval and informed consent according to the Law of the People's Republic of China on Prevention and Treatment of Infectious diseases. Personal details of patients was anonymized and de-identified prior to analysis.

Acknowledgments

We wish to thank the investigators from Hangzhou Municipal Center for Disease Control and Prevention, Shaoxing Municipal Center for Disease Control and Prevention, Jinhua Municipal Center for Disease Control and Prevention for their investigation and data collection.

Author contributions

Conceptualization: Yi Shen, Fan He, Zhen Wang. Data curation: Biyao Liu. Formal analysis: Fudong Li, Duo Lv. Funding acquisition: Fudong Li, Zhen Wang. Investigation: Junfen Lin, Biyao Liu. Methodology: Fudong Li, Yi Shen. Project administration: Zhen Wang. Resources: Zhen Wang. Software: Duo Lv. Supervision: Yi Shen, Fan He, Zhen Wang. Validation: Junfen Lin. Visualization: Biyao Liu. Writing – original draft: Fudong Li. Writing – review & editing: Yi Shen, Junfen Lin, Zhen Wang.

15 in total

1. Global Infectious Diseases and Epidemiology Network (GIDEON): a world wide Web-based program for diagnosis and informatics in infectious diseases.

Authors: Stephen C Edberg
Journal: Clin Infect Dis Date: 2004-12-06 Impact factor: 9.079

A Bayesian classification model for discriminating common infectious diseases in Zhejiang province, China.

Introduction

Methods

Classification algorithm

Data collection

Model validation

Results

Discussion

Conclusion

Ethical approval and consent to participate

Acknowledgments

Author contributions

1. Global Infectious Diseases and Epidemiology Network (GIDEON): a world wide Web-based program for diagnosis and informatics in infectious diseases.

2. Use of the computer program GIDEON at an inpatient infectious diseases consultation service.

3. Evaluation of the GIDEON expert computer program for the diagnosis of imported febrile illnesses.

4. Assessing the discriminative ability of risk models for more than two outcome categories.

5. The epidemiologic field investigation: science and judgment in public health practice.

6. The reliability of a decision tree technique applied to psychiatric diagnosis.

7. A decision tree approach to psychodiagnosis. The diagnosis of abnormal behaviour.

8. Diagnosis of febrile illnesses in returned travelers using the PC software GIDEON.

9. Use of an artificial neural network for the diagnosis of myocardial infarction.

10. Emergence and control of infectious diseases in China.

Review 1. Modern Machine-Learning Predictive Models for Diagnosing Infectious Diseases.