| Literature DB >> 35741437 |
MennattAllah Hassan Attia1, Marwa A Kholief1, Nancy M Zaghloul2, Ivana Kružić3, Šimun Anđelinović4,5, Željana Bašić3, Ivan Jerković3.
Abstract
The adjusted binary classification (ABC) approach was proposed to assure that the binary classification model reaches a particular accuracy level. The present study evaluated the ABC for osteometric sex classification using multiple machine learning (ML) techniques: linear discriminant analysis (LDA), boosted generalized linear model (GLMB), support vector machine (SVM), and logistic regression (LR). We used 13 femoral measurements of 300 individuals from a modern Turkish population sample and split data into two sets: training (n = 240) and testing (n = 60). Then, the five best-performing measurements were selected for training univariate models, while pools of these variables were used for the multivariable models. ML classifier type did not affect the performance of unadjusted models. The accuracy of univariate models was 82-87%, while that of multivariate models was 89-90%. After applying ABC to the crossvalidation set, the accuracy and the positive and negative predictive values for uni- and multivariate models were ≥95%. Sex could be estimated for 28-75% of individuals using univariate models but with an obvious sexing bias, likely caused by different degrees of sexual dimorphism and between-group overlap. However, using multivariate models, we minimized the bias and properly classified 81-87% of individuals. A similar performance was also noted in the testing sample (except for FEB), with accuracies of 96-100%, and a proportion of classified individuals between 30% and 82% in univariate models, and between 90% and 91% in multivariate models. When considering different training sample sizes, we demonstrated that LR was the most sensitive with limited sample sizes (n < 150), while GLMB was the most stable classifier.Entities:
Keywords: adjusted binary classification; machine learning algorithms; optimal training sample size; osteometric sex estimation
Year: 2022 PMID: 35741437 PMCID: PMC9220275 DOI: 10.3390/biology11060917
Source DB: PubMed Journal: Biology (Basel) ISSN: 2079-7737
Descriptive statistics with t-test results and overlapping percentages in the training sample.
| Variables | Males ( | Females ( | Overlapping | ||||
|---|---|---|---|---|---|---|---|
| Mean ± SD (mm) | Range (mm) | Mean ± SD (mm) | Range (mm) |
|
| (%) | |
| FML | 443.78 ± 25.59 | 384.70–502.30 | 406.51 ± 21.20 | 359.24–453.13 | 12.289 | <0.001 * | 27.34 |
| FBL | 441.23 ± 25.76 | 382.12–501.03 | 404.73 ± 21.92 | 358.61–453.17 | 11.821 | <0.001 * | 28.52 |
| FTL | 425.65± 23.77 | 371.24–477.58 | 391.19 ± 21.05 | 342.70–439.45 | 11.890 | <0.001 * | 26.55 |
| MTD | 29.33 ± 2.20 | 22–34.14 | 28.09 ± 2.37 | 21.88–37.10 | 4.205 | <0.001 * | 63.39 |
| VHD | 49.19 ± 3.02 | 38.89–55.77 | 43.19 ± 2.89 | 36.48–50.84 | 15.715 | <0.001 * | 19.57 |
| FVDN | 36.75 ± 2.69 | 27.85–43.25 | 31.98 ± 2.24 | 25.77–37.47 | 14.916 | <0.001 * | 17.56 |
| FNAL | 102.20 ± 6.27 | 87.23–122.38 | 91.14 ± 5.36 | 79.92–103.27 | 14.682 | <0.001 * | 22.51 |
| FBP | 91.28 ± 5.88 | 71.49–108.42 | 81.61 ± 5.08 | 72.06–94.30 | 17.908 | <0.001 * | 23.00 |
| MLD | 32.95 ± 2.44 | 26.32–39.51 | 31.07 ± 2.24 | 26.16–37.41 | 6.231 | <0.001 * | 45.78 |
| FBCB | 74.71 ± 4.31 | 65.10–89.03 | 66.61 ± 4.19 | 58.78–76.38 | 14.777 | <0.001 * | 23.70 |
| FEB | 85.72 ± 4.42 | 76.46–98.06 | 76.70 ± 3.58 | 67.86–89.58 | 17.908 | <0.001 * | 17.15 |
| APDLC | 64.23 ± 3.67 | 53.58–73.84 | 58.32 ± 3.22 | 51.38–67.91 | 13.267 | <0.001 * | 23.82 |
| APDMC | 63.54 ± 3.70 | 53.97–73.21 | 57.79 ± 3.54 | 49.84–67.40 | 12.304 | <0.001 * | 28.69 |
* p < 0.05 was considered significant.
LGOCV classification results without employing the ABC approach in the training sample.
| Variables | Accuracy (%) | Sensitivity (%) | Specificity (%) | PPV (%) | NPV (%) | c-Index |
|---|---|---|---|---|---|---|
| Logistic regression | ||||||
| FEB | 84.44 | 85.89 | 83 | 83.48 | 85.47 | 0.944 |
| VHD | 84.36 | 85.06 | 83.67 | 83.89 | 84.85 | 0.920 |
| FVDN | 86.33 | 85.44 | 87.22 | 86.99 | 85.70 | 0.914 |
| FBCB | 83 | 81.22 | 84.78 | 84.22 | 81.87 | 0.904 |
| FNAL | 82.25 | 83.17 | 81.33 | 81.67 | 82.85 | 0.903 |
| LR1 | 90.08 | 87.67 | 92.50 | 92.12 | 88.24 | 0.968 |
| Discriminant analysis | ||||||
| FEB | 84.89 | 85.28 | 84.50 | 84.62 | 85.16 | 0.944 |
| VHD | 84.72 | 86.44 | 83 | 83.57 | 85.96 | 0.920 |
| FVDN | 86.42 | 85.44 | 87.39 | 87.14 | 85.72 | 0.915 |
| FBCB | 83.06 | 81 | 85.11 | 84.47 | 81.75 | 0.905 |
| FNAL | 82.19 | 81.78 | 82.61 | 82.46 | 81.93 | 0.904 |
| DF1 | 89.58 | 86.22 | 92.94 | 82.44 | 87.09 | 0.959 |
| Boosted glm | ||||||
| FEB | 84.89 | 85.28 | 84.50 | 84.62 | 85.16 | 0.945 |
| VHD | 84.69 | 86.64 | 82.94 | 83.52 | 85.95 | 0.920 |
| FVDN | 86.42 | 85.44 | 87.39 | 87.14 | 85.72 | 0.914 |
| FBCB | 83.94 | 82.89 | 85 | 84.68 | 83.24 | 0.906 |
| FNAL | 82.19 | 81.78 | 82.61 | 82.46 | 81.93 | 0.904 |
| GLMB1 | 89.17 | 86 | 92.33 | 91.81 | 86.83 | 0.961 |
| SVM linear | ||||||
| FEB | 84.50 | 85.83 | 83.17 | 83.60 | 85.45 | 0.944 |
| VHD | 84.42 | 85.28 | 83.56 | 83.83 | 85.02 | 0.919 |
| FVDN | 86.50 | 85.61 | 87.39 | 87.16 | 85.86 | 0.914 |
| FBCB | 83.06 | 81.33 | 84.78 | 84.23 | 81.95 | 0.904 |
| FNAL | 82.22 | 83.17 | 81.28 | 81.62 | 82.84 | 0.904 |
| SVM1 | 89.78 | 88 | 91.56 | 91.24 | 88.41 | 0.958 |
Multivariate feature selection: LR: FVDN + FEB + FNAL + MLD; LDA: FEB + VHD + FVDN + FBCB + FNAL; boosted GLM: FEB + FVDN + VHD + MLD + MTD + FNAL + FBCB + APDLC + FML + FBP + APDMC; SVM: FEB + VHD + FVDN + FBCB + FNAL.
LGOCV classification results with ABC adjustment in the training sample.
| Variables | Posterior Probability Cutoff | % of Classified Cases | Accuracy (%) | Sensitivity (%) | Specificity (%) | PPV (%) | NPV (%) | c-Index | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Males | Females | Males | Females | Overall | |||||||
| Logistic regression | |||||||||||
| FEB | 0.691 | 0.883 | 82.05 | 65.94 | 71.94 | 95.08 | 96.14 | 93.77 | 95.05 | 95.13 | 0.978 |
| VHD | 0.741 | 0.994 | 68.28 | 9.50 | 38.89 | 95.21 | 99.76 | 62.57 | 95.04 | 97.27 | 0.898 |
| FVDN | 0.837 | 0.999 | 55.83 | 3 | 28 | 95.09 | 100 | 3.70 | 95 | 100 | 0.872 |
| FBCB | 0.930 | 0.854 | 33.78 | 51.56 | 42.58 | 95.17 | 92.40 | 96.98 | 95.23 | 95.14 | 0.984 |
| FNAL | 0.832 | 0.866 | 48.50 | 53.44 | 50.97 | 95.03 | 94.50 | 95.52 | 95.05 | 95.02 | 0.981 |
| LR1 | 0.734 | 0.778 | 86.94 | 86.22 | 86.58 | 95.03 | 95.08 | 94.97 | 95.02 | 95.04 | 0.980 |
| Discriminant analysis | |||||||||||
| FEB | 0.669 | 0.899 | 82.5 | 67.89 | 75.18 | 95.05 | 96.03 | 93.86 | 95.00 | 95.11 | 0.977 |
| VHD | 0.794 | 0.991 | 68.61 | 11.78 | 40.19 | 95.09 | 99.43 | 69.81 | 95.05 | 95.48 | 0.920 |
| FVDN | 0.838 | 0.999 | 56.28 | 3 | 29.64 | 95.13 | 100 | 3.70 | 95.11 | 100 | 0.894 |
| FBCB | 0.928 | 0.854 | 34.17 | 52.22 | 43.19 | 95.31 | 92.36 | 97.23 | 95.62 | 95.11 | 0.984 |
| FNAL | 0.880 | 0.870 | 48.72 | 53.89 | 51.31 | 95.13 | 94.53 | 95.67 | 95.18 | 95.08 | 0.981 |
| DF1 | 0.726 | 0.867 | 84.39 | 79.94 | 82.17 | 95.03 | 95.33 | 94.72 | 95.01 | 95.05 | 0.977 |
| Boosted glm | |||||||||||
| FEB | 0.625 | 0.830 | 82.39 | 67.83 | 75.11 | 95.04 | 95.95 | 93.93 | 95.06 | 95.02 | 0.978 |
| VHD | 0.705 | 0.975 | 68.67 | 12.67 | 40.67 | 95.22 | 99.51 | 71.93 | 95.05 | 96.47 | 0.926 |
| FVDN | 0.794 | 0.993 | 56.5 | 4.11 | 30.31 | 95.14 | 100 | 28.38 | 95.05 | 100 | 0.918 |
| FBCB | 0.885 | 0.806 | 39.11 | 53.50 | 46.31 | 95.20 | 93.32 | 96.57 | 95.22 | 95.19 | 0.985 |
| FNAL | 0.848 | 0.839 | 48.44 | 52.44 | 50.44 | 95.15 | 94.84 | 95.44 | 95.06 | 95.24 | 0.982 |
| BGLM1 | 0.639 | 0.815 | 85.89 | 78 | 81.94 | 95.05 | 95.54 | 94.52 | 95.05 | 95.06 | 0.979 |
| SVM linear | |||||||||||
| FEB | 0.672 | 0.864 | 81.89 | 66.06 | 73.97 | 95.08 | 96.07 | 93.86 | 95.10 | 95.06 | 0.977 |
| VHD | 0.723 | 0.990 | 68.22 | 10.06 | 39.14 | 95.03 | 99.51 | 64.64 | 95.02 | 95.12 | 0.908 |
| FVDN | 0.830 | 0.998 | 54.33 | 2.94 | 28.64 | 95.34 | 100 | 9 | 95.32 | 100 | 0.903 |
| FBCB | 0.915 | 0.839 | 34.22 | 52.17 | 43.19 | 95.11 | 92.37 | 96.91 | 95.15 | 95.09 | 0.983 |
| FNAL | 0.871 | 0.849 | 48.78 | 52.39 | 50.58 | 95.17 | 94.76 | 95.55 | 95.19 | 95.14 | 0.981 |
| SVM1 | 0.719 | 0.783 | 82.55 | 78.78 | 80.67 | 95.04 | 95.29 | 94.78 | 95.03 | 95.05 | 0.975 |
Classification results in the testing sample with ABC adjustment (n = 60).
| Variables | % of Classified Cases | Accuracy (%) | Sensitivity (%) | Specificity (%) | PPV (%) | NPV (%) | c-Index | ||
|---|---|---|---|---|---|---|---|---|---|
| Males | Females | Overall | |||||||
| Logistic regression | |||||||||
| FEB | 80 | 83.33 | 81.67 | 91.84 | 95.83 | 88.00 | 88.46 | 95.65 | 0.987 |
| VHD | 63.33 | 13.33 | 38.33 | 100 | 100 | 100 | 100 | 100 | 1 |
| FVDN | 60 | 0 | 30 | 100 | 100 | / | / | / | / |
| FBCB | 40 | 53.33 | 46.67 | 96.43 | 91.67 | 100 | 100 | 94.12 | 0.953 |
| FNAL | 36.66 | 66.67 | 53 | 100 | 100 | 100 | 100 | 100 | 1 |
| LR1 | 90 | 90 | 90 | 98.15 | 96.30 | 100 | 100 | 96.43 | 0.989 |
| Discriminant analysis | |||||||||
| FEB | 80 | 83.33 | 81.67 | 91.84 | 95.83 | 88.00 | 88.46 | 95.65 | 0.987 |
| VHD | 66.33 | 13.33 | 38.33 | 100 | 100 | 100 | 100 | 100 | 1 |
| FVDN | 60 | 0 | 30 | 100 | 100 | / | / | / | / |
| FBCB | 40 | 53.33 | 46.67 | 96.43 | 91.67 | 100 | 100 | 94.12 | 0.953 |
| FNAL | 40 | 66.67 | 53.33 | 100 | 100 | 100 | 100 | 100 | 1 |
| DF1 | 86.67 | 93.33 | 90 | 98.15 | 96.15 | 100 | 100 | 96.55 | 0.990 |
| Boosted glm | |||||||||
| FEB | 80 | 83.33 | 81.67 | 91.84 | 95.83 | 88 | 88.46 | 95.65 | 0.987 |
| VHD | 63.33 | 13.33 | 38.33 | 100 | 100 | 100 | 100 | 100 | 1 |
| FVDN | 63.33 | 0 | 31.67 | 100 | 100 | / | / | / | / |
| FBCB | 40 | 60 | 50 | 96.67 | 91.67 | 100 | 100 | 94.74 | 0.951 |
| FNAL | 40 | 66.67 | 53.33 | 100 | 100 | 100 | 100 | 100 | 1 |
| BGLM1 | 90 | 90 | 90 | 98.15 | 96.30 | 100 | 100 | 96.43 | 0.989 |
| SVM Linear | |||||||||
| FEB | 80 | 83.33 | 81.67 | 91.84 | 95.83 | 88 | 88.46 | 95.65 | 0.987 |
| VHD | 63.33 | 13.33 | 38.33 | 100 | 100 | 100 | 100 | 100 | 1 |
| FVDN | 60 | 0 | 30 | 100 | 100 | / | / | / | / |
| FBCB | 40 | 53.33 | 46.67 | 96.43 | 91.67 | 100 | 100 | 94.12 | 0.953 |
| FNAL | 43.33 | 66.67 | 55 | 96.67 | 92.31 | 100 | 100 | 95.24 | 1 |
| SVM1 | 90 | 93.33 | 91.67 | 98.18 | 96.30 | 100 | 100 | 96.55 | 0.991 |
Figure 1Correlation plot between the sample size and its effect on pp cutoff values and the percentage of classified males and females, as well as the overall classified individuals. Correlation coefficients are shown for statically significant values only at p < 0.05.
Figure 2Influence of the training sample size on the proportion of classified specimens in the testing sample. Several sample sizes of the training dataset are plotted against the proportion of classified individuals (overall and sex specific rates) in the test dataset (n = 60) using different ML models; each algorithm has its own panel.
Figure 3Influence of the training sample size on the classification performance parameters in the testing sample (n = 60). It should be noted that multivariate models were computed using the variable selection procedure. The interrupted line delineates the desired level of 95%.