| Literature DB >> 27612635 |
Xi Hang Cao1, Ivan Stojkovic1,2, Zoran Obradovic3.
Abstract
BACKGROUND: Machine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy.Entities:
Keywords: Classification model; Data normalization; Data scaling; Empirical cumulative distribution function; Generalized logistic function; Outlier
Mesh:
Year: 2016 PMID: 27612635 PMCID: PMC5016890 DOI: 10.1186/s12859-016-1236-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Fitting of the ECDF using the GL algorithm An example showing the approximation of an ECDF using a generalized logistic (GL) function
Fig. 2Behavior of data scaling algorithms with/without outliers. Top panels a–c: when there is no outlier in the data, the behavior of the Min-max algorithm, Z-score algorithm and the GL algorithm is very similar. Bottom panel d–f: when there is an outlier in the data, the behaviors of the Min-max algorithm and Z-score algorithm are significantly affected, but the impact of the outlier on the GL algorithm is neglectable
Fig. 3An 2D illustration on how the GL algorithm can affect the classification accuracy. a raw data without scaling; b data scaled by the Min-max algorithm; c data scaled by the Z-score algorithm; d data scaled by the GL algorithm
Summary of datasets used in experiments (sorted by the no. of subjects in ascending order)
| Dataset | No. of subjects (pos/neg) | Var. type | No. of var. | Task |
|---|---|---|---|---|
|
| 10/10 | DNA methylation | 27578 | diagnose ulcerative colitis |
|
| 14/9 | microarray gene expression | 15009 | diagnose prostate cancer |
|
| 15/11 | microarray gene expression | 15009 | diagnose colon cancer |
|
| 20/7 | microarray gene expression | 15009 | diagnose lung cancer |
|
| 17/15 | microarray gene expression | 15009 | diagnose breast cancer |
|
| 11/27 | microarray gene expression | 7129 | diagnose leukemia |
|
| 20/7 | DNA methylation | 26916 | diagnose colorectal carchinoma |
|
| 14/9 | DNA methylation | 27570 | diagnose gastric cancer |
|
| 21/85 | impedance measurements | 9 | diagnose breast tumor |
|
| 42/84 | wavelet and frequency based measurements | 310 | assessment of treatments in Parkinson |
|
| 88/72 | microarray gene expression | 715 | diagnose DLBCL |
|
| 137/36 | microarray gene expression | 12625 | diagnose bone lesions |
|
| 147/48 | vocal based measurements | 22 | diagnose Parkinson disease |
|
| 212/357 | nuclear feature from image | 30 | diagnose breast tumor |
|
| 414/165 | biochemistry based measurements | 9 | diagnose liver disease |
|
| 268/500 | clinical measurements | 8 | diagnose diabetes |
Results of 16 datasets (16 binary classification tasks) using different data scaling algorithms and classification models
| dataset | Method | None | Minmax | Zscore | GL |
|---|---|---|---|---|---|
| GSE27899IL | LR | NA ± NA | NA ± NA | NA ± NA | NA ± NA |
| SVM | 0.768 ± 0.104 | 0.814 ± 0.084 | 0.814 ± 0.074 |
| |
| Prostate Cancer | LR | 0.464 ± 0.000 | 0.749 ± 0.130 | 0.689 ± 0.156 |
|
| SVM | 0.573 ± 0.198 | 0.725 ± 0.232 | 0.713 ± 0.244 |
| |
| Colon Cancer | LR | 0.500 ± 0.000 | 0.895 ± 0.092 | 0.892 ± 0.082 |
|
| SVM | 0.670 ± 0.184 | 0.940 ± 0.058 | 0.937 ± 0.050 |
| |
| Lung Cancer | LR | 0.450 ± 0.000 | 0.839 ± 0.096 | 0.834 ± 0.108 |
|
| SVM | 0.397 ± 0.274 | 0.716 ± 0.136 | 0.710 ± 0.152 |
| |
| Breast Cancer | LR | 0.324 ± 0.000 | 0.809 ± 0.038 |
| 0.819 ± 0.022 |
| SVM | 0.708 ± 0.158 | 0.793 ± 0.052 | 0.795 ± 0.042 |
| |
| Leukemia | LR | 0.500 ± 0.000 | 0.988 ± 0.014 | 0.990 ± 0.006 |
|
| SVM | 0.935 ± 0.034 | 0.992 ± 0.010 | 0.991 ± 0.008 |
| |
| GSE29490 | LR | NA ± NA | NA ± NA | NA ± NA | NA ± NA |
| SVM | 0.983 ± 0.012 | 0.984 ± 0.034 | 0.985 ± 0.034 |
| |
| GSE25869 | LR | NA ± NA | NA ± NA | NA ± NA | NA ± NA |
| SVM | 0.935 ± 0.024 | 0.937 ± 0.020 | 0.938 ± 0.016 |
| |
| Breast tissue | LR | 0.520 ± 0.006 | 0.961 ± 0.032 |
| 0.940 ± 0.054 |
| SVM | 0.713 ± 0.108 | 0.968 ± 0.006 | 0.970 ± 0.014 |
| |
| LSVT | LR | 0.500 ± 0.000 | 0.875 ± 0.008 | 0.846 ± 0.022 |
|
| SVM | 0.500 ± 0.000 | 0.879 ± 0.012 | 0.863 ± 0.014 |
| |
| DLBCL | LR | 0.601 ± 0.038 | 0.608 ± 0.038 | 0.610 ± 0.048 |
|
| SVM | 0.616 ± 0.050 | 0.622 ± 0.052 | 0.619 ± 0.052 |
| |
| Myeloma | LR | 0.500 ± 0.000 | 0.729 ± 0.044 | 0.739 ± 0.072 |
|
| SVM | 0.573 ± 0.098 | 0.748 ± 0.052 | 0.747 ± 0.054 |
| |
| Parkinsons | LR | 0.875 ± 0.012 | 0.896 ± 0.054 | 0.893 ± 0.058 |
|
| SVM | 0.882 ± 0.010 | 0.875 ± 0.010 | 0.885 ± 0.024 |
| |
| Wdbc | LR | 0.942 ± 0.002 | 0.982 ± 0.004 | 0.978 ± 0.006 |
|
| SVM | 0.990 ± 0.002 | 0.994 ± 0.002 | 0.993 ± 0.004 |
| |
| Indian Liver | LR | 0.680 ± 0.002 | 0.743 ± 0.008 | 0.742 ± 0.008 |
|
| SVM | 0.636 ± 0.068 | 0.696 ± 0.008 | 0.692 ± 0.034 |
| |
| Pima Indians Diabetes | LR | 0.604 ± 0.004 | 0.827 ± 0.004 | 0.827 ± 0.004 |
|
| SVM | 0.826 ± 0.004 | 0.828 ± 0.006 | 0.828 ± 0.006 |
|
The performances are measured by the average Area Under the ROC in 5-fold cross-validations. The means and 95 % confidence intervals are included. Column names: None - no data scaling; Minmax - Min-max algorithm; Z-score - Z-score algorithm; GL - GL algorithm. Best performances are emphasized in bold
Results of 16 datasets (16 binary classification tasks) using different data scaling algorithms and classification models
| dataset | Method | None | Minmax | Zscore | GL |
|---|---|---|---|---|---|
| GSE27899IL | LR | NA ± NA | NA ± NA | NA ± NA | NA ± NA |
| SVM | 0.770 ± 0.054 |
|
|
| |
| Prostate Cancer | LR | 0.609 ± 0.000 | 0.757 ± 0.132 | 0.722 ± 0.078 |
|
| SVM | 0.635 ± 0.078 | 0.748 ± 0.156 | 0.748 ± 0.156 |
| |
| Colon Cancer | LR | 0.577 ± 0.000 | 0.877 ± 0.064 | 0.877 ± 0.064 |
|
| SVM | 0.677 ± 0.178 | 0.900 ± 0.042 | 0.915 ± 0.034 |
| |
| Lung Cancer | LR | 0.741 ± 0.000 | 0.859 ± 0.062 | 0.852 ± 0.052 |
|
| SVM | 0.778 ± 0.052 | 0.859 ± 0.034 | 0.859 ± 0.034 |
| |
| Breast Cancer | LR | 0.773 ± 0.000 | 0.918 ± 0.040 |
|
|
| SVM | 0.827 ± 0.040 | 0.909 ± 0.000 | 0.909 ± 0.000 |
| |
| Leukemia | LR | 0.710 ± 0.000 | 0.956 ± 0.030 | 0.965 ± 0.030 |
|
| SVM | 0.939 ± 0.030 | 0.965 ± 0.030 | 0.965 ± 0.030 |
| |
| GSE29490 | LR | NA ± NA | NA ± NA | NA ± NA | NA ± NA |
| SVM | 0.942 ± 0.034 | 0.954 ± 0.034 | 0.958 ± 0.034 |
| |
| GSE25869 | LR | NA ± NA | NA ± NA | NA ± NA | NA ± NA |
| SVM | 0.891 ± 0.038 | 0.891 ± 0.044 | 0.894 ± 0.034 |
| |
| Breast tissue | LR | 0.778 ± 0.016 | 0.930 ± 0.010 |
| 0.927 ± 0.016 |
| SVM | 0.681 ± 0.220 | 0.932 ± 0.024 | 0.926 ± 0.020 |
| |
| LSVT | LR | 0.500 ± 0.000 | 0.870 ± 0.012 | 0.824 ± 0.038 |
|
| SVM | 0.500 ± 0.000 | 0.873 ± 0.036 | 0.858 ± 0.036 |
| |
| DLBCL | LR | 0.567 ± 0.014 | 0.571 ± 0.014 | 0.579 ± 0.032 |
|
| SVM | 0.594 ± 0.082 | 0.592 ± 0.064 | 0.585 ± 0.044 |
| |
| Myeloma | LR | 0.792 ± 0.000 | 0.805 ± 0.020 | 0.804 ± 0.018 |
|
| SVM | 0.794 ± 0.006 | 0.809 ± 0.014 | 0.807 ± 0.026 |
| |
| Parkinsons | LR | 0.865 ± 0.006 |
| 0.891 ± 0.016 | 0.868 ± 0.006 |
| SVM | 0.880 ± 0.016 |
| 0.877 ± 0.020 | 0.868 ± 0.016 | |
| Wdbc | LR | 0.878 ± 0.002 | 0.965 ± 0.010 | 0.963 ± 0.012 |
|
| SVM | 0.960 ± 0.010 | 0.979 ± 0.004 | 0.976 ± 0.002 |
| |
| Indian Liver | LR | 0.716 ± 0.002 | 0.727 ± 0.014 | 0.733 ± 0.008 |
|
| SVM | 0.719 ± 0.006 | 0.720 ± 0.014 | 0.718 ± 0.010 |
| |
| Pima Indians Diabetes | LR | 0.490 ± 0.070 | 0.738 ± 0.010 | 0.738 ± 0.010 |
|
| SVM | 0.734 ± 0.052 |
| 0.753 ± 0.040 | 0.748 ± 0.034 |
The performances are measured by the average proportion of correct classification in 5-fold cross-validations. The means and 95 % confidence intervals are included. Column names: None - no data scaling; Minmax - Min-max algorithm; Zscore - Z-score algorithm; GL - GL algorithm. Best performances are emphasized in bold