| Literature DB >> 28784111 |
Brian Connolly1, K Bretonnel Cohen2, Daniel Santel1, Ulya Bayram1, John Pestian3.
Abstract
BACKGROUND: Probabilistic assessments of clinical care are essential for quality care. Yet, machine learning, which supports this care process has been limited to categorical results. To maximize its usefulness, it is important to find novel approaches that calibrate the ML output with a likelihood scale. Current state-of-the-art calibration methods are generally accurate and applicable to many ML models, but improved granularity and accuracy of such methods would increase the information available for clinical decision making. This novel non-parametric Bayesian approach is demonstrated on a variety of data sets, including simulated classifier outputs, biomedical data sets from the University of California, Irvine (UCI) Machine Learning Repository, and a clinical data set built to determine suicide risk from the language of emergency department patients.Entities:
Keywords: Bayesian; Calibration; Machine learning; Nonparametric; Statistics
Mesh:
Year: 2017 PMID: 28784111 PMCID: PMC5545857 DOI: 10.1186/s12859-017-1736-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Construction of a Polya tree distribution. Adapted from Ferguson [54]
Description of the data sets obtained from the University of California, Irvine Machine Learning repository, including a brief description and the number of cases and controls in the training and testing sets used to demonstrate the proposed method
| Data set | Description | Number of training Cases/Controls | Number of test Cases/Controls | Number of features | Citations |
|---|---|---|---|---|---|
| Lung Cancer | Clinical data, X-ray data, etc. used to predict 3 pathological types of lung cancer. The instances are divided into three classes of 9, 10, and 13 observations. For purposes here, the first two classes are aggregated into a single class. | 8/8 | 11/5 | 54 integer clinical features | [ |
| SPECT | Instances of normal and abnormal cardiac diagnoses. | 40/40 | 172/15 | 22 binary features indicating partial diagnoses | [ |
| Parkinsons | Biomedical voice measurements from 31 people, including 23 with Parkinson’s disease. | 72/25 | 75/23 | 22 real features | [ |
| Arcene | Mass-spectrometric data that can be used to distinguish patients with cancer versus healthy subjects. | 44/56 | 44/56 | The data set contains 10,000 integer features; a Kolmogorov-Smirnov test [ | [ |
| Arrhythmia | Normal and “abnormal” instances of demographic and electrocardiogram features. | 127/99 | 118/108 | 278 categorial, integer and real demographic and electrocardiogram features. A Kolmogorov-Smirnov test [ | [ |
| Breast Cancer | This data set contains features from a digitized images of fine needle aspirates (FNA) of breast masses, which describe characteristics of the cell nuclei present in the images. The data set contains benign and malignant instances of real-valued features. | 130/219 | 111/239 | 8 | [ |
| Contraception | This data set is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey which samples married women who were either not pregnant or do not know if they were at the time of interview. The aim for the binary classifier constructed in this work is to predict whether or not a woman uses contraception based on their categorical and integer-valued demographic and socio-economic characteristics. The subset contains information for 1473 women, who are sub-divided based on their contraceptive use: no use (629), long-term methods (333), or short-term methods (511). The goal of the classifier is to classify women based on whether or not they use contraception based on categorical and integer-valued demographic and socio-economic characteristics. | 423/313 | 421/316 | 8 | [ |
Fig. 2The averaged χ2 p-values from the fit of the calibration to the diagonal in the reliability diagrams (top), the average number of calibration points (middle), and the average range in calibrated probabilities (bottom) for the proposed method (red) and the BBQ method (black)
Fig. 3Histograms of SVM scores from the training set for the two classes, represented as black and red distributions (top row); reliability diagrams for the BBQ method (middle row), and for the proposed method (bottom row). For comparison, the training distributions are generated using both LOO (blue) and 10-fold cross validation (green). Those data sets with large overlaps between the predicted values from the two classes are boxed for emphasis. Note the larger granularity in the (boxed) data set with a larger overlap in the ML scores
Fig. 4Histograms of k-means scores from the training set for the two classes, represented as black and red distributions (top row); reliability diagrams for the BBQ method (middle row), and for the proposed method (bottom row). For comparison, the training distributions are generated using both LOO (blue) and 10-fold cross validation (green). Those data sets with large overlaps between the predicted values from the two classes are boxed for emphasis. Note the systematically larger granularity in those (boxed) data sets with larger overlaps in the ML scores
The χ2 p-values for the fit to the diagonal in the reliability diagram, number of calibrated points, and difference between the maximum and minimum calibrated probabilities (range) for the SVM classifier presented in Fig. 3
| Data set | BBQ | Proposed method | ||||
|---|---|---|---|---|---|---|
| χ2
| Calibrated points | Range | χ2
| Calibrated Points | Range | |
| Lung Cancer | <0.001 | 2 | 0.82 | 0.001 | 4 | 0.90 |
| SPECT | <0.001 | 5 | 0.75 | <0.001 | 7 | 0.92 |
| Parkinsons | 0.01 | 8 | 1.0 | 0.651 | 6 | 0.95 |
| Arcene | 0.387 | 9 | 0.96 | 0.841 | 8 | 0.94 |
| Suicide | 0.048 | 9 | 0.94 | 0.013 | 8 | 0.90 |
| Arrhythmia | 0.521 | 5 | 0.66 | 0.001 | 9 | 0.87 |
| Breast Cancer | 0.003 | 8 | 1.0 | 0.001 | 7 | 1.0 |
|
|
|
|
|
|
|
|
The (Contraception) data set with a large overlap in the score distributions is emphasized in boldface. When compared with the other data sets, the proposed method produces a larger number of calibrated points, indicating a finer granularity in the calibrated probabilities
The χ2 p-values for the fit to the diagonal in the reliability diagram, number of calibrated points, and difference between the maximum and minimum calibrated probabilities (range) for the k-means classifier presented in Fig. 4
| Data set | BBQ | Proposed method | ||||
|---|---|---|---|---|---|---|
| χ2
| Calibrated points | Range | χ2
| Calibrated points | Range | |
| Lung Cancer |
|
|
|
|
|
|
| SPECT | < |
|
| < |
|
|
| Parkinsons |
|
|
|
|
|
|
| Arcene |
|
|
|
|
|
|
| Suicide |
|
|
|
|
|
|
| Arrhythmia |
|
|
|
|
|
|
| Breast Cancer | <0.001 | 3 | 0.96 | <0.001 | 8 | 0.98 |
| Contraception |
|
|
|
|
|
|
The data sets with large overlaps in the score distributions are emphasized in boldface. The proposed method consistently achieves a larger number and more dynamic range of calibrated points. Note the Contraception data set has one calibration point on the reliability diagram, but a finite range. This is due to the number of calibration points being calculated from the number of (binned) points in the reliability diagram
The χ2 p-values for the fit to the diagonal in the reliability diagram, number of calibrated points, and difference between the maximum and minimum calibrated probabilities (range) for various BBQ parameters Fig. 3
| Classifier | Scoring function | Threshold ( | Binning parameter ( |
| Calibration points | Range |
|---|---|---|---|---|---|---|
| SVM | BDeu | 2 | 0.0001 | 0.187 | 7 | 0.95 |
| SVM | BDeu | 4 | 0.0001 | 0.13 | 8 | 0.97 |
| SVM | BDeu2 | N/A | 0.0001 | 0.023 | 9 | 0.97 |
| SVM | BDeu | 2 | 0.001 | 0.187 | 7 | 0.95 |
| SVM | BDeu | 4 | 0.001 | 0.13 | 8 | 0.97 |
|
|
|
|
|
|
|
|
| SVM | BDeu | 2 | 0.01 | 0.187 | 7 | 0.95 |
| SVM | BDeu | 4 | 0.01 | 0.13 | 8 | 0.97 |
| SVM | BDeu2 | N/A | 0.01 | 0.066 | 9 | 0.94 |
| k-means | BDeu | 2 | 0.0001 | 0.502 | 2 | 0.05 |
| k-means | BDeu | 4 | 0.0001 | 0.558 | 2 | 0.06 |
| k-means | BDeu2 | N/A | 0.0001 | 0.497 | 2 | 0.05 |
| k-means | BDeu | 2 | 0.001 | 0.502 | 2 | 0.05 |
| k-means | BDeu | 4 | 0.001 | 0.558 | 2 | 0.06 |
|
|
|
|
|
|
|
|
| k-means | BDeu | 2 | 0.01 | 0.502 | 2 | 0.05 |
| k-means | BDeu | 4 | 0.01 | 0.558 | 2 | 0.06 |
| k-means | BDeu2 | N/A | 0.01 | 0.496 | 2 | 0.05 |
The BBQ default parameters used in the comparisons above are highlighted in boldface