| Literature DB >> 35603639 |
Yiming Li1, Wei-Wen Hsu2.
Abstract
Imbalanced classification has drawn considerable attention in the statistics and machine learning literature. Typically, traditional classification methods often perform poorly when a severely skewed class distribution is observed, not to mention under a high-dimensional longitudinal data structure. Given the ubiquity of big data in modern health research, it is expected that imbalanced classification in disease diagnosis may encounter an additional level of difficulty that is imposed by such a complex data structure. In this article, we propose a nonparametric classification approach for imbalanced data in longitudinal and high-dimensional settings. Technically, the functional principal component analysis is first applied for feature extraction under the longitudinal structure. The univariate exponential loss function coupled with group LASSO penalty is then adopted into the classification procedure in high-dimensional settings. Along with a good improvement in imbalanced classification, our approach provides a meaningful feature selection for interpretation while enjoying a remarkably lower computational complexity. The proposed method is illustrated on the real data application of Alzheimer's disease early detection and its empirical performance in finite sample size is extensively evaluated by simulations.Entities:
Keywords: AUC; Alzheimer's disease; brain imaging data; class imbalance; group LASSO
Mesh:
Year: 2022 PMID: 35603639 PMCID: PMC9541048 DOI: 10.1002/sim.9442
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.497
Demographic characteristics of selected subjects
| Age (years) | Gender (%) | ||||
|---|---|---|---|---|---|
| Group | n | Mean | SD | Male | Female |
| CN | 237 | 74.5 | 5.6 | 52.7 | 47.3 |
| AD | 30 | 75.4 | 3.9 | 40.0 | 60.0 |
Distribution of number of visits
| Number of subjects | ||
|---|---|---|
| Visits | CN | AD |
| 3 | 68 | 2 |
| 4 | 100 | 4 |
| 5 | 13 | 3 |
| 6 | 10 | 5 |
| 7 | 13 | 2 |
| 8 | 14 | 4 |
| 9 | 10 | 7 |
| 10 | 9 | 3 |
| Total | 237 | 30 |
FIGURE 1Longitudinal trajectories of ADAS‐Cog 13 for cognitively normal subjects and AD patients
FIGURE 2Clinical diagnosis of a CN subject or an AD patient over time. The red box represents the data used for model training. The blue box represents the final diagnosis used as the membership outcome
Classification results (S.E.) for ADNI data with logistic, linear SVM and the proposed method based on 500 Monte Carlo replicates
|
| Linear SVM | Proposed method | ||
|---|---|---|---|---|
| Training set | Sensitivity |
|
|
|
| ( | Specificity |
|
|
|
| Accuracy |
|
|
| |
| AUC |
|
|
| |
| Test set | Sensitivity |
|
|
|
| ( | Specificity |
|
|
|
| Accuracy |
|
|
| |
| AUC |
|
|
|
Abbreviations: logistic, logistic regression with penalty; Linear SVM, support vector machine with linear kernel; (), number of subjects in the CN and AD groups respectively.
Classification results (S.E.) of logistic regression, linear SVM and the proposed method at various imbalance ratios in low‐dimensional setting based on 500 Monte Carlo replicates
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Imbalance ratio | Logistic | SVM | Proposed | Logistic | SVM | Proposed | Logistic | SVM | Proposed | |
| Training | Sensitivity | 0.689 | 0.681 | 0.873 | 0.618 | 0.594 | 0.884 | 0.548 | 0.483 | 0.896 |
| (0.061) | (0.071) | (0.035) | (0.081) | (0.100) | (0.039) | (0.108) | (0.156) | (0.041) | ||
| Specificity | 0.941 | 0.945 | 0.856 | 0.965 | 0.970 | 0.867 | 0.981 | 0.987 | 0.882 | |
| (0.013) | (0.015) | (0.034) | (0.009) | (0.012) | (0.034) | (0.007) | (0.008) | (0.038) | ||
| Accuracy | 0.880 | 0.882 | 0.860 | 0.910 | 0.910 | 0.870 | 0.938 | 0.937 | 0.883 | |
| (0.019) | (0.019) | (0.025) | (0.016) | (0.016) | (0.029) | (0.013) | (0.013) | (0.034) | ||
| AUC | 0.933 | 0.932 | 0.932 | 0.940 | 0.937 | 0.937 | 0.948 | 0.945 | 0.945 | |
| (0.016) | (0.016) | (0.016) | (0.017) | (0.018) | (0.017) | (0.017) | (0.019) | (0.018) | ||
| Test | Sensitivity | 0.672 | 0.658 | 0.833 | 0.594 | 0.564 | 0.828 | 0.507 | 0.435 | 0.819 |
| (0.065) | (0.071) | (0.059) | (0.087) | (0.097) | (0.074) | (0.111) | (0.147) | (0.087) | ||
| Specificity | 0.933 | 0.935 | 0.840 | 0.959 | 0.964 | 0.857 | 0.975 | 0.981 | 0.871 | |
| (0.021) | (0.021) | (0.039) | (0.015) | (0.016) | (0.037) | (0.012) | (0.013) | (0.040) | ||
| Accuracy | 0.870 | 0.869 | 0.838 | 0.900 | 0.899 | 0.852 | 0.927 | 0.925 | 0.866 | |
| (0.018) | (0.018) | (0.026) | (0.016) | (0.016) | (0.028) | (0.015) | (0.015) | (0.032) | ||
| AUC | 0.923 | 0.922 | 0.921 | 0.929 | 0.927 | 0.927 | 0.934 | 0.932 | 0.931 | |
| (0.017) | (0.018) | (0.018) | (0.019) | (0.019) | (0.020) | (0.020) | (0.020) | (0.021) | ||
Note: .
Abbreviation: , number of subjects in health and disease groups respectively.
Classification results (S.E.) of logistic regression, linear SVM and the proposed method at various imbalance ratios in high‐dimensional setting based on 500 Monte Carlo replicates
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Imbalance ratio | Logistic | SVM | Proposed | Logistic | SVM | Proposed | Logistic | SVM | Proposed | |
| Training | Sensitivity | 0.705 | 0.999 | 0.900 | 0.659 | 0.999 | 0.905 | 0.549 | 0.999 | 0.899 |
| (0.221) | (0.001) | (0.035) | (0.293) | (0.001) | (0.052) | (0.360) | (0.001) | (0.062) | ||
| Specificity | 0.997 | 0.999 | 0.894 | 0.999 | 0.999 | 0.891 | 0.999 | 0.999 | 0.888 | |
| (0.005) | (0.001) | (0.037) | (0.002) | (0.001) | (0.050) | (0.001) | (0.001) | (0.061) | ||
| Accuracy | 0.928 | 0.999 | 0.896 | 0.942 | 0.999 | 0.893 | 0.938 | 0.999 | 0.889 | |
| (0.053) | (0.001) | (0.032) | (0.049) | (0.001) | (0.047) | (0.049) | (0.001) | (0.058) | ||
| AUC | 0.982 | 0.999 | 0.957 | 0.982 | 0.999 | 0.952 | 0.944 | 0.999 | 0.946 | |
| (0.022) | (0.001) | (0.020) | (0.051) | (0.001) | (0.034) | (0.135) | (0.001) | (0.044) | ||
| Test | Sensitivity | 0.412 | 0.221 | 0.791 | 0.262 | 0.109 | 0.724 | 0.174 | 0.063 | 0.686 |
| (0.112) | (0.058) | (0.078) | (0.122) | (0.056) | (0.122) | (0.125) | (0.043) | (0.137) | ||
| Specificity | 0.968 | 0.901 | 0.856 | 0.982 | 0.958 | 0.860 | 0.989 | 0.977 | 0.860 | |
| (0.021) | (0.026) | (0.039) | (0.016) | (0.017) | (0.048) | (0.013) | (0.013) | (0.057) | ||
| Accuracy | 0.836 | 0.740 | 0.841 | 0.859 | 0.813 | 0.837 | 0.874 | 0.848 | 0.835 | |
| (0.025) | (0.023) | (0.031) | (0.021) | (0.020) | (0.037) | (0.019) | (0.018) | (0.044) | ||
| AUC | 0.892 | 0.645 | 0.913 | 0.876 | 0.640 | 0.889 | 0.842 | 0.635 | 0.877 | |
| (0.029) | (0.038) | (0.028) | (0.050) | (0.044) | (0.043) | (0.108) | (0.050) | (0.047) | ||
Note: .
Abbreviation: : number of subjects in the health and disease groups respectively.
Classification results (S.E.) with various sample sizes of health and disease groups in low‐dimensional setting based on 500 Monte Carlo replicates
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sample size | Logistic | SVM | Proposed | Logistic | SVM | Proposed | Logistic | SVM | Proposed | |
| Training | Sensitivity | 0.774 | 0.767 | 0.895 | 0.596 | 0.555 | 0.873 | 0.454 | 0.379 | 0.852 |
| (0.097) | (0.108) | (0.049) | (0.205) | (0.250) | (0.071) | (0.226) | (0.278) | (0.086) | ||
| Specificity | 0.954 | 0.957 | 0.894 | 0.976 | 0.982 | 0.864 | 0.985 | 0.991 | 0.846 | |
| (0.014) | (0.015) | (0.045) | (0.008) | (0.010) | (0.074) | (0.007) | (0.008) | (0.080) | ||
| Accuracy | 0.909 | 0.909 | 0.895 | 0.922 | 0.921 | 0.866 | 0.932 | 0.930 | 0.846 | |
| (0.032) | (0.032) | (0.042) | (0.029) | (0.032) | (0.070) | (0.021) | (0.023) | (0.079) | ||
| AUC | 0.953 | 0.952 | 0.951 | 0.927 | 0.919 | 0.925 | 0.906 | 0.885 | 0.902 | |
| (0.033) | (0.035) | (0.034) | (0.064) | (0.081) | (0.064) | (0.076) | (0.110) | (0.077) | ||
| Test | Sensitivity | 0.751 | 0.743 | 0.865 | 0.555 | 0.511 | 0.818 | 0.416 | 0.335 | 0.780 |
| (0.099) | (0.108) | (0.059) | (0.198) | (0.235) | (0.095) | (0.214) | (0.254) | (0.119) | ||
| Specificity | 0.949 | 0.951 | 0.882 | 0.970 | 0.976 | 0.853 | 0.981 | 0.988 | 0.837 | |
| (0.017) | (0.016) | (0.048) | (0.012) | (0.014) | (0.075) | (0.010) | (0.011) | (0.080) | ||
| Accuracy | 0.899 | 0.898 | 0.878 | 0.911 | 0.909 | 0.848 | 0.925 | 0.922 | 0.832 | |
| (0.033) | (0.032) | (0.044) | (0.029) | (0.030) | (0.071) | (0.019) | (0.020) | (0.077) | ||
| AUC | 0.945 | 0.944 | 0.945 | 0.912 | 0.907 | 0.912 | 0.889 | 0.872 | 0.889 | |
| (0.037) | (0.038) | (0.037) | (0.066) | (0.079) | (0.068) | (0.079) | (0.109) | (0.080) | ||
Note: .
Abbreviation: : number of subjects in the health and disease groups respectively.
Classification results (S.E.) with various sample sizes of health and disease groups in high‐dimensional setting based on 500 Monte Carlo replicates
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sample size | Logistic | SVM | Proposed | Logistic | SVM | Proposed | Logistic | SVM | Proposed | |
| Training | Sensitivity | 0.825 | 0.999 | 0.927 | 0.782 | 0.999 | 0.921 | 0.395 | 0.999 | 0.911 |
| (0.121) | (0.001) | (0.033) | (0.234) | (0.001) | (0.039) | (0.369) | (0.001) | (0.053) | ||
| Specificity | 0.992 | 0.999 | 0.924 | 0.999 | 0.999 | 0.918 | 0.999 | 0.999 | 0.904 | |
| (0.007) | (0.001) | (0.032) | (0.001) | (0.001) | (0.035) | (0.001) | (0.001) | (0.053) | ||
| Accuracy | 0.950 | 0.999 | 0.924 | 0.968 | 0.999 | 0.918 | 0.939 | 0.999 | 0.905 | |
| (0.034) | (0.001) | (0.028) | (0.034) | (0.001) | (0.032) | (0.036) | (0.001) | (0.051) | ||
| AUC | 0.987 | 0.999 | 0.972 | 0.993 | 0.999 | 0.968 | 0.842 | 0.999 | 0.956 | |
| (0.013) | (0.001) | (0.015) | (0.024) | (0.001) | (0.019) | (0.224) | (0.001) | (0.034) | ||
| Test | Sensitivity | 0.657 | 0.457 | 0.861 | 0.389 | 0.192 | 0.793 | 0.131 | 0.064 | 0.709 |
| (0.063) | (0.056) | (0.053) | (0.113) | (0.056) | (0.076) | (0.121) | (0.047) | (0.113) | ||
| Specificity | 0.967 | 0.930 | 0.892 | 0.987 | 0.981 | 0.895 | 0.997 | 0.994 | 0.885 | |
| (0.015) | (0.016) | (0.031) | (0.009) | (0.008) | (0.037) | (0.004) | (0.005) | (0.054) | ||
| Accuracy | 0.890 | 0.812 | 0.884 | 0.901 | 0.868 | 0.881 | 0.910 | 0.901 | 0.868 | |
| (0.018) | (0.020) | (0.024) | (0.015) | (0.011) | (0.031) | (0.010) | (0.006) | (0.046) | ||
| AUC | 0.947 | 0.832 | 0.951 | 0.922 | 0.807 | 0.931 | 0.777 | 0.762 | 0.897 | |
| (0.016) | (0.029) | (0.017) | (0.032) | (0.037) | (0.026) | (0.183) | (0.049) | (0.042) | ||
Note: .
Abbreviation: : number of subjects in the health and disease groups respectively.