| Literature DB >> 31790458 |
Eric Cawi1, Patricio S La Rosa2, Arye Nehorai1.
Abstract
In this paper we define the concept of the Machine Learning Morphism (MLM) as a fundamental building block to express operations performed in machine learning such as data preprocessing, feature extraction, and model training. Inspired by statistical learning, MLMs are morphisms whose parameters are minimized via a risk function. We explore operations such as composition of MLMs and when sets of MLMs form a vector space. These operations are used to build a machine learning workflow from data preprocessing to final task completion. We examine the Mapper Algorithm from Topological Data Analysis as an MLM, and build several workflows for binary classification incorporating Mapper on Hospital Readmissions and Credit Evaluation datasets. The advantage of this framework lies in the ability to easily build, organize, and compare multiple workflows, and allows joint optimization of parameters across multiple steps in an application.Entities:
Mesh:
Year: 2019 PMID: 31790458 PMCID: PMC6886815 DOI: 10.1371/journal.pone.0225577
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Notation for sets, spaces, functions, etc used throughout the paper.
| Notation | Meaning | Example |
|---|---|---|
| Script Capital | Space |
|
| Bold, Non-Italic | Set | |
| Bold, Italic, Lower Case, Index | Array | |
| Capital Italic | Function | |
| Italic, lower case | Scalar |
Common data operations expressed as MLMs.
| Operation | Input Space | Output Space | Parameters | Morphism | Empirical Risk Function |
|---|---|---|---|---|---|
| Data Encoding | Abstract Space |
| embedding parameters | Injective map: | trivial (one—hot encoding) |
| PCA |
|
|
|
| |
| Linear Regression |
|
|
|
| |
| Logistic Regression |
| [0, 1] |
|
| Maximum Likelihood |
| SVM |
| {−1, 1} |
| ∥ | |
| Decision Tree | Set | { | splitting criterion | Tree | Gini Impurity |
| Standardization |
|
|
| ( | |
| Adaboost |
|
| parameters associated |
| Exponential Loss [ |
| Neural Networks |
|
| Weights in | Loss functions, | |
| Model Evaluation | Collection |
| Evaluation parameters | Performance Metric | Complexity Criterion |
Fig 1Block diagram of Eq 29, showing how a workflow is created for each node.
The first step is one-hot encoding the data to embed it into . The next step computes the Mapper graph of the data. Then models are trained on each node, and summed. Finally, a decision function outputs the final class prediction.
Descriptive table of patient data from Barnes Jewish Hospital.
| Total Cohort | Number Readmitted | |
|---|---|---|
| Congestive Heart Failure (CHF) | 378 | 71 |
| Chronic Obstructive Pulmonary Disease (COPD) | 88 | 9 |
| Acute Myocardial Infarction (AMI) | 198 | 31 |
| Pneumonia (PNA) | 113 | 23 |
| Male | 420 | 74 |
| White | 472 | 76 |
| Has Diabetes | 142 | 32 |
| 65-70 years old (y.o.) | 209 | 41 |
| 70-75 y.o. | 174 | 30 |
| 75-80 y.o. | 135 | 18 |
| 85+ y.o. | 258 | 45. |
| Discharged Home | 296 | 50 |
| Disch. to | 147 | 25 |
| Disch. with Home Health | 270 | 52 |
| Low LACE Score (< 5) | 30 | 3 |
| Medium LACE Score (5-10) | 231 | 27 |
| High LACE Score (> 10) | 515 | 104 |
| Average Length of Stay (Days) | 7.0 | 8.25 |
Results for different workflows of logistic regression on hospital readmissions data, with standard deviations over n = 10 runs.
| LR Workflow | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.49 (0.023) | 0.58 (0.031) | 0.48 (0.021) | 0.49 (0.022) |
| No Transformation, SMOTE | 0.64 (0.033) | 0.62 (0.029) | 0.67 (0.039) | 0.66 (0.037) |
| No Transformation, ROSE | 0.53 (0.041) | 0.54 (0.044) | 0.51 (0.045) | 0.52 (0.045 |
| PCA | 0.58 (0.017) | 0.68 (0.022) | 0.45 (0.029) | 0.49 (0.028) |
| PCA, SMOTE | 0.49 (0.037) | 0.64(0.035) | 0.44 (0.034) | 0.47 (0.034) |
| PCA, ROSE | 0.45 (0.061) | 0.50 (0.059) | 0.55(0.065) | 0.54 (0.063) |
| Mapper, No Transformations | 0.61 (0.048) | 0.62 (0.052) | 0.53 (0.049) | 0.55 (0.050) |
| Mapper, No Transformations, SMOTE | 0.67 (0.066) | 0.60 (0.055) | 0.60 (0.064) | 0.60 (0.062) |
| Mapper, No Transformation, ROSE | 0.62 (0.073) | 0.69 (0.076) | 0.59 (0.078) | 0.61 (0.078) |
| Mapper, Node PCA | 0.55 (0.065) | 0.62 (0.058) | 0.50 (0.059) | 0.52 (0.058) |
| Mapper, Node PCA, SMOTE | 0.69(0.071) | 0.62 (0.069) | 0.78 (0.065) | 0.75 (0.066) |
| Mapper, Node PCA, ROSE | 0.61 (0.084) | 0.58 (0.082) | 0.63 (0.087) | 0.62 (0.086) |
Results for different workflows of Adaboost classifiers, with standard deviations over n = 10 runs.
| AdaBoost workflow | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.50 (0.043) | 0.54 (0.058) | 0.49 (0.049) | 0.50 (0.051) |
| No Transformation, SMOTE | 0.62 (0.056) | 0.65 (0.072) | 0.53 (0.070) | 0.55 (0.071) |
| No Transformation, ROSE | 0.5(0) | 0 (0) | 1 (0) | 0.827 (0) |
| PCA | 0.48 (0.038) | 0.54 (0.044) | 0.53 (0.049) | 0.53 (0.048) |
| PCA, SMOTE | 0.53 (0.051) | 0.50 (0.053) | 0.58 (0.057) | 0.57 (0.056) |
| PCA, ROSE | 0.69 (0.073) | 0.46 (0.078) | 0.74 (0.064) | 0.69 (0.068) |
| Mapper, No Transformations | 0.56 (0.079) | 0.49 (0.082) | 0.75 (0.086) | 0.71 (0.085) |
| Mapper, No Transformations, SMOTE | 0.63 (0.083) | 0.73 (0.077) | 0.63 (0.083) | 0.65 (0.082) |
| Mapper, No Transformation, ROSE | 0.54 (0.098) | 0.42 (0.131) | 0.67 (0.110) | 0.63(0.119) |
| Mapper, Node PCA | 0.63 (0.066) | 0.69 (0.075) | 0.58 (0.082) | 0.60 (0.081) |
| Mapper, Node PCA, SMOTE | 0.58 (0.088) | 0.65 (0.084) | 0.54 (0.091) | 0.56 (0.089) |
| Mapper, Node PCA, ROSE | 0.44 (0.141) | 0.58 (0.109) | 0.51 (0.092) | 0.52(0.095) |
Fig 2Typical Mapper graph generated from hospital readmissions data.
The nodes are colored showing level of readmissions, and larger node size indicates a higher number of patients in that node.
Descriptive table of German Credit Dataset from UCI Repository, monetary values in units of Deutsch Marks.
| Total Cohort | Number with Bad Credit | |
|---|---|---|
| Checking Status <0 | 274 | 135 |
| Checking Status 0-200 | 269 | 105 |
| Checking Status >200 | 63 | 14 |
| Checking Status no account | 394 | 46 |
| Credit History—no history | 40 | 25 |
| Credit History—all paid | 49 | 28 |
| Credit History—existing, paid | 530 | 169 |
| Credit History—Delayed Previously | 60 | 28 |
| Credit History—critical/other debt | 293 | 50 |
| Buying New Car | 234 | 89 |
| Buying Used Car | 103 | 17 |
| Buying furniture/equipment | 181 | 58 |
| Buying Radio/TV | 280 | 62 |
| No Savings | 183 | 32 |
| Savings < 100 | 603 | 217 |
| Savings > 100 | 214 | 51 |
| Rent Housing | 179 | 70 |
| Own Housing | 713 | 186 |
| Free Housing | 64 | 44 |
| Unemployed/unskilled non resident | 22 | 7 |
| Unskilled Resident | 200 | 56 |
| Skilled | 630 | 186 |
| Highly Skilled/Self Employed/Management | 148 | 51 |
Fig 3Typical Mapper Graph generated from first principal component of German Credit Data.
Nodes are colored to show the levels of bad credit, and sized by number of data points.
Results for different workflows of logistic regression classifiers on German Credit Data, with standard deviations over n = 10 runs.
| Model | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.76 (0.043) | 0.71 (0.033) | 0.73 (0.039) | 0.72 (0.038) |
| No Transformation, SMOTE | 0.76 (0.027) | 0.72 (0.046) | 0.7 (0.041) | 0.71 (0.043) |
| PCA | 0.76 (0.031) | 0.71 (0.041) | 0.71 (0.035) | 0.71 (0.037) |
| PCA, SMOTE | 0.76 (0.049) | 0.7 (0.046) | 0.74 (0.048) | 0.73 (0.048) |
| Mapper, 1 model | 0.68 (0.054) | 0.66 (0.063) | 0.69 (0.057) | 0.68 (0.059) |
| Mapper, SMOTE, 1 model | 0.69 (0.068) | 0.66 (0.060) | 0.72 (0.061) | 0.70 (0.061) |
| Mapper, 2 models, equal weight | 0.73 (0.072) | 0.71 (0.065) | 0.69 (0.068) | 0.70 (0.067) |
| Mapper, SMOTE 2 models, equal weight | 0.72 (0.079) | 0.71 (0.073) | 0.7 (0.066) | 0.70 (0.068) |
| Mapper, 2 models, AUC weight | .72 (0.075) | 0.71 (0.068) | 0.71 (0.063) | 0.71 (0.064) |
| Mapper, SMOTE 2 models, AUC weight | 0.72 (0.078) | 0.7 (0.077) | 0.67 (0.063) | 0.68 (0.067) |
| Mapper, Node PCA, 1 model | 0.74 (0.056) | 0.73 (0.063) | 0.67 (0.061) | 0.69 (0.061) |
| Mapper, Node PCA, SMOTE, 1 model | 0.71 (0.067) | 0.69 (0.054) | 0.68 (0.055) | 0.683 (0.055) |
| Mapper, Node PCA, 2 models, equal weight | 0.74 (0.057) | 0.78 (0.063) | 0.67 (0.069) | 0.703 (0.068) |
| Mapper, Node PCA, SMOTE 2 models, equal weight | 0.72 (0.073) | 0.7 (0.076) | 0.66 (0.067) | 0.67 (0.069) |
| Mapper, Node PCA, 2 models, AUC weight | 0.74 (0.072) | 0.75 (0.059) | 0.67 (0.063) | 0.69 (0.062) |
| Mapper, Node PCA, SMOTE 2 models, AUC weight | 0.74 (0.079) | 0.7 (0.062) | 0.69 (0.071) | 0.69 (0.067) |
Results for different workflows of random forest classifiers on German Credit Data, with standard deviations over n = 10 runs.
| Model | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.75 (0.029) | 0.72 (0.027) | 0.71 (0.038) | 0.71 (0.034) |
| No Transformation, SMOTE | 0.76 (0.033) | 0.7 (0.036) | 0.73 (0.031) | 0.72 (0.033) |
| PCA | 0.75 (0.025) | 0.69 (0.038) | 0.71 (0.029) | 0.7 (0.032) |
| PCA, SMOTE | 0.75 (0.046) | 0.71 (0.049) | 0.72 (0.053) | 0.72 (0.051) |
| Mapper, 1 model | 0.7 (0.053) | 0.75 (0.066) | 0.62 (0.054) | 0.66 (0.056) |
| Mapper, SMOTE, 1 model | 0.71 (0.064) | 0.7 (0.067) | 0.67 (0.075) | 0.68 (0.072) |
| Mapper, 2 models equal weight | 0.71 (0.073) | 0.75 (0.077) | 0.65 (0.073) | 0.68 (0.074) |
| Mapper, SMOTE 2 models—equal weight | 0.69 (0.070) | 0.68 (0.071) | 0.67 (0.087) | 0.68 (0.083) |
| Mapper, 2 models—AUC weight | .71 (0.061) | 0.77 (0.078) | 0.64 (0.074) | 0.68 (0.075) |
| Mapper, SMOTE 2 models—AUC weight | 0.69 (0.086) | 0.67 (0.062) | 0.7 (0.073) | 0.68 (0.070) |
| Mapper, Node PCA, 1 model | 0.67 (0.045) | 0.64 (0.067) | 0.65 (0.071) | 0.65 (0.069) |
| Mapper, Node PCA, SMOTE, 1 model | 0.67 (0.051) | 0.65 (0.062) | 0.66 (0.066) | 0.66 (0.064) |
| Mapper, Node PCA, 2 models equal weight | 0.66 (0.068) | 0.67 (0.059) | 0.64 (0.064) | 0.65 (0.062) |
| Mapper, Node PCA, SMOTE 2 models, equal weight | 0.67 (0.078) | 0.67 (0.071) | 0.67 (0.068) | 0.67 (0.070) |
| Mapper, Node PCA, 2 models AUC weight | 0.67 (0.075) | 0.64 (0.060) | 0.68 (0.058) | 0.65 (0.059) |
| Mapper, Node PCA, SMOTE 2 models, AUC weight | 0.67 (0.073) | 0.68 (0.067) | 0.68 (0.072) | 0.68 (0.071) |
Results for different workflows of SVMs for hospital readmissions data, with standard deviations over n = 10 runs.
| SVM Workflow | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.63 (0.033) | 0.65 (0.037) | 0.6 (0.035) | 0.61 (0.036) |
| No Transformation, SMOTE | 0.59 (0.042) | 0.65 (0.040)) | 0.49 (0.046) | 0.52 (0.044) |
| No Transformation, ROSE | 0.55 (0.083) | 0.80 (0.087) | 0.44 (0.091) | 0.50 (0.090) |
| PCA | 0.64 (0.039) | 0.69 (0.044) | 0.58 (0.038) | 0.60 (0.039) |
| PCA, SMOTE | 0.61 (0.047) | 0.58 (0.043) | 0.58 (0.048) | 0.58 (0.046) |
| PCA, ROSE | 0.62 (0.058) | 0.62 (0.049) | 0.62 (0.054) | 0.62 (0.053) |
| Mapper, No Transformations | 0.53 (0.057) | 0.58 (0.075) | 0.48 (0.068) | 0.49 (0.070) |
| Mapper, No Transformations, SMOTE | 0.57 (0.079) | 0.54 (0.072) | 0.64 (0.076) | 0.62 (0.075) |
| Mapper, No Transformation, ROSE | 0.53 (0.086) | 0.50 (0.081) | 0.64 (0.088) | 0.61 (0.086) |
| Mapper, Node PCA | 0.50 (0.065) | 0.62 (0.073) | 0.49 (0.072) | 0.51 (0.072) |
| Mapper, Node PCA, SMOTE | 0.61 (0.077) | 0.73 (0.083) | 0.53 (0.089) | 0.56 (0.088) |
| Mapper, Node PCA, ROSE | 0.67 (0.092) | 0.77 (0.095) | 0.60 (0.088) | 0.63 (0.089) |
Results for different workflows of random forests for hospital readmissions data, with standard deviations over n = 10 runs.
| RF Workflow | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.60 (0.047) | 0.58(0.042) | 0.57 (0.049) | 0.57 (0.046) |
| No Transformation, SMOTE | 0.52 (0.053) | 0.46 (0.052) | 0.75 (0.055) | 0.70 (0.055) |
| No Transformation, ROSE | 0.5 (0) | 0 (0) | 1(0) | 0.827 (0) |
| PCA | 0.56(0.051) | 0.50 (0.066) | 0.63 (0.068) | 0.61 (0.068) |
| PCA, SMOTE | 0.57 (0.053) | 0.62(0.051) | 0.60 (0.058) | 0.60(0.056) |
| PCA, ROSE | 0.53 (0.072) | 0.54 (0.071) | 0.56 (0.076) | 0.56 (0.075) |
| Mapper, No Transformations | 0.49 (0.078) | 0.46 (0.084) | 0.60 (0.081) | 0.58 (0.082) |
| Mapper, No Transformations, SMOTE | 0.55 (0.087) | 0.58 (0.075) | 0.54 (0.082) | 0.55 (0.080) |
| Mapper, No Transformation, ROSE | 0.51 (0.093) | 0.54(0.116) | 0.51 (0.143) | 0.52 (0.137) |
| Mapper, Node PCA | 0.57 (0.069) | 0.62 (0.076) | 0.62 (0.086) | 0.62 (0.083) |
| Mapper, Node PCA, SMOTE | 0.57 (0.084) | 0.46 (0.095) | 0.71 (0.091) | 0.67 (0.092) |
| Mapper, Node PCA, ROSE | 0.64 (0.110) | 0.65 (0.099) | 0.61 (0.091) | 0.62 (0.097) |
Results for different workflows of SVM classifiers on German Credit Data, with standard deviations over n = 10 runs.
| Model | ROC AUC | Sensitivity | Specificity | Accuracy |
|---|---|---|---|---|
| No Transformation | 0.77 (0.023) | 0.74 (0.035) | 0.71 (0.027) | 0.72(0.029) |
| No Transformation, SMOTE | 0.71 (0.038) | 0.71 (0.049) | 0.67 (0.042) | 0.68 (0.044) |
| PCA | 0.76 (0.028) | 0.68 (0.051) | 0.72 (0.038) | 0.71 (0.041) |
| PCA, SMOTE | 0.76 (0.036) | 0.69 (0.031) | 0.74 (0.043) | 0.73 (0.040) |
| Mapper, 1 model | 0.53 (0.051) | 0.54 (0.055) | 0.56 (0.047) | 0.55(0.049) |
| Mapper, SMOTE, 1 model | 0.54 (0.066) | 0.57 (0.048) | 0.55 (0.068) | 0.56 (0.062) |
| Mapper, 2 models equal weight | 0.52 (0.062) | 0.54 (0.068) | 0.56 (0.061) | 0.55 (0.063) |
| Mapper, SMOTE 2 models—equal weight | 0.54 (0.070) | 0.54 (0.059) | 0.59 (0.061) | 0.58 (0.061) |
| Mapper, 2 models—AUC weight | .52 (0.064) | 0.55 (0.054) | 0.56 (0.072) | 0.56 (0.065) |
| Mapper, SMOTE 2 models—AUC weight | 0.53 (0.077) | 0.59 (0.079) | 0.52 (0.071) | 0.54 (0.074) |
| Mapper, Node PCA, 1 model | 0.72 (0.041) | 0.73 (0.066) | 0.67 (0.059) | 0.69 (0.061) |
| Mapper, Node PCA, SMOTE, 1 model | 0.76 (0.058) | 0.75 (0.052) | 0.67 (0.047) | 0.69 (0.048) |
| Mapper, Node PCA, 2 models equal weight | 0.75 (0.067) | 0.7 (0.075) | 0.69 (0.071) | 0.7 (0.073) |
| Mapper, Node PCA, SMOTE 2 models, equal weight | 0.76 (0.072) | 0.72 (0.063) | 0.71 (0.069) | 0.71 (0.067) |
| Mapper, Node PCA, 2 models AUC weight | 0.75 (0.056) | 0.71 (0.059) | 0.7 (0.067) | 0.71 (0.064) |
| Mapper, Node PCA, SMOTE 2 models, AUC weight | 0.76 (0.081) | 0.73 (0.073) | 0.71 (0.065) | 0.72 (0.068) |