| Literature DB >> 30566525 |
Behrouz Ehsani-Moghaddam1, John A Queenan1, Jennifer MacKenzie2, Richard V Birtwhistle1.
Abstract
Identifying patients with rare diseases associated with common symptoms is challenging. Hunter syndrome, or Mucopolysaccharidosis type II is a progressive rare disease caused by a deficiency in the activity of the lysosomal enzyme, iduronate 2-sulphatase. It is inherited in an X-linked manner resulting in males being significantly affected. Expression in females varies with the majority being unaffected although symptoms may emerge over time. We developed a Naïve Bayes classification (NBC) algorithm utilizing the clinical diagnosis and symptoms of patients contained within their de-identified and unstructured electronic medical records (EMR) extracted by the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). To do so, we created a training dataset using published results in the scientific literature and from all MPS II symptoms and applied the training dataset and its independent features to compute the conditional posterior probabilities of having MPS II disease as a categorical dependent variable for 506497 male patients. The classifier identified 125 patients with the highest likelihood for having the disease and 18 features were selected to be necessary for forecasting. Next, a Recursive Backward Feature Elimination algorithm was employed, for optimal input features of the NBC model, using a k-fold Cross-Validation with 3 replicates. The accuracy of the final model was estimated by the Validation Set Approach technique and the bootstrap resampling. We also investigated that whether the NBC is as accurate as three other Bayesian networks. The Naïve Bayes Classifier appears to be an efficient algorithm in assisting physicians with the diagnosis of Hunter syndrome allowing optimal patient management.Entities:
Mesh:
Year: 2018 PMID: 30566525 PMCID: PMC6300265 DOI: 10.1371/journal.pone.0209018
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1A screenshot of the billing table from SQL server containing unstructured data from patients.
Fig 2A screenshot of the MPS II dataset containing all symptoms from patients with dichotomous observations.
Fig 3Normal Q-Q plot of MPS II index from patients 21 years old or younger.
Red line represents a distribution reference line with μo equal to the sample mean for a normal distribution.
Fig 4Normal Q-Q plot of MPS II index from patients older than 21.
Red line represents a distribution reference line with μo equal to the sample mean for a normal distribution.
Fig 5The importance of features for MPS II disease forecasting by the NBC algorithm estimated using a ROC curve analysis conducted for each attribute.
Symptom combinations for potential patients diagnosed with MPS II disease by NBC algorithm.
Only the combinations with 1.6% incidence or higher have been presented here.
| Hearing | Otitis | COPD | Hernia | Cardiac | Respiratory | Diarrhea | Apnea | Carpal | Spinal Injury | Skin | Hepatosplenomegaly | Seizure | Joint | Stature | % | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| x | x | x | x | 6.4 | ||||||||||||
| x | x | x | x | 2.4 | ||||||||||||
| x | x | x | x | x | 2.4 | |||||||||||
| x | x | x | x | x | 2.4 | |||||||||||
| x | x | x | x | 1.6 | ||||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | 1.6 | ||||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | 1.6 | ||||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| x | x | x | x | x | 1.6 | |||||||||||
| 58 | 58 | 47 | 45 | 45 | 42 | 39 | 38 | 35 | 33 | 19 | 13 | 11 | 4 | 2 | ||
| 11.9 | 11.9 | 9.6 | 9.2 | 9.2 | 8.6 | 8.0 | 7.8 | 7.2 | 6.7 | 3.9 | 2.7 | 2.2 | 0.8 | 0.4 | 100 |
Features and their associated symptoms in MPS II disease.
The remained features in the final NBC model are show in bold.
| Feature name | Symptom/Description |
|---|---|
| Short stature, contracture, coarse facial features, congenital Musculoskeletal | |
| Joint pain, joint stiffness | |
| Sleep apnea | |
| COPD, airway obstruction | |
| Progressive hearing loss | |
| Spinal cord injury, spinal stenosis, compression, dysostosis, congenital musculoskeletal | |
| Umbilical hernia, inguinal hernia | |
| Chronic ear infections, AOM, otitis | |
| Respiratory infection | |
| Carpel tunnel syndrome | |
| Cardiac disease, heart valve problem, cardiac problem, ventricular hypertrophy | |
| Hepatosplenomegaly, hepatomegaly, enlarged liver, splenomegaly, enlarged spleen | |
| Pebbly skin lesion, thickened skin | |
| Seizure | |
| Diarrhea | |
| Patient’s age: (1 for younger than 21or 0 for otherwise) | |
| Macrocephaly, enlarged head | |
| Vision problem, reduced vision or visual problems | |
| Pneumonia | Recurrent pneumonia |
| Bladder | Bladder obstruction |
Accuracy and Kappa values of features in the NBC model derived from Recursive Backward Feature Elimination algorithm and their positive predictive value.
| Variables | Accuracy | Kappa | Accuracy SD | Kappa SD | PPV |
|---|---|---|---|---|---|
| 0.9847 | 0.8872 | 0.01 | 0.08 | 0.93 | |
| 0.9863 | 0.8883 | 0.01 | 0.09 | 0.49 | |
| 0.9898 | 0.9113 | 0.01 | 0.08 | 0.51 | |
| 0.9904 | 0.9179 | 0.01 | 0.06 | 0.92 | |
| 0.9904 | 0.9179 | 0.01 | 0.06 | 0.15 | |
| 0.9904 | 0.9179 | 0.01 | 0.06 | 0.78 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.84 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.84 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.84 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.49 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.62 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.70 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.86 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.27 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.15 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.15 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.52 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.29 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.12 | |
| 0.9885 | 0.8986 | 0.01 | 0.09 | 0.01 |
* Positive predictive value
The NBC model performance.
The 2 × 2 contingency tables displays the performance evaluation using the bootstrapped resampling (n = 1000) and the Validation Set Approach technique on test dataset. Accuracy was used to select the optimal model by the largest value.
| Actual | |||
|---|---|---|---|
| Predicted | No | Yes | Row Total |
| 357251 | 7815 | 365066 | |
| 0 | 18997 | 18997 | |
| 357251 | 26992 | 384063 | |
| 0.99 | |||
| 0.91 | |||
| 0.84 | |||
| 1.0 | |||
* Estimated by the Validation Set Approach technique
Performance comparison of Bayesian network classifiers using validation dataset.
| Classifier | FN | TN | FP | TP | Accuracy | Sensitivity | Specificity | MR | ASE | SSE | ROC Index |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 147714 | 4198 | 38 | 0.972 | 1.000 | 0.972 | 0.028 | 0.022 | 6600.840 | 1.000 | |
| 0 | 148099 | 3813 | 38 | 0.975 | 1.000 | 0.975 | 0.025 | 0.019 | 5829.130 | 1.000 | |
| 0 | 148587 | 3325 | 38 | 0.978 | 1.000 | 0.978 | 0.022 | 0.016 | 4836.650 | 1.000 | |
| 21 | 151890 | 22 | 17 | 1.000 | 0.447 | 1.000 | 0.000 | 0.000 | 62.720 | 0.999 |
NBC = Naïve Bayes Classifier; TAN = Tree augmented Naïve-Bayes network; BAN = Bayesian network augmented Naïve-Bayes network; MBN = Markov blanket Bayesian Network; FN = false negative; TN = true negative; FP = false positive; TP = true positive; MR = misclassification rate; ASE = average squared error; SSE = sum of squared error.
Fig 6Bayesian network classifiers.
Top left: NBC = Naïve Bayes classifier; top right: TAN = Tree augmented Naïve-Bayes network; bottom left: BAN = Bayesian network augmented Naïve-Bayes network; bottom right: MBN = Markov blanket Bayesian network. Red circles are target variable (MPS II disease) and dark blue circles are features.