| Literature DB >> 27437026 |
Filippo Biscarini1, Nelson Nazzicari2, Chiara Broccanello3, Piergiorgio Stevanato3, Simone Marini4.
Abstract
BACKGROUND: Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding.Entities:
Keywords: Binomial phenotype; Classification; Genomic predictions; K-nearest neighbours (KNN); Noisy data; Random forest (RF); Ridge logistic regression; Robustness to errors; Sugar beet; Support vector machines (SVM)
Year: 2016 PMID: 27437026 PMCID: PMC4949885 DOI: 10.1186/s13007-016-0136-4
Source DB: PubMed Journal: Plant Methods ISSN: 1746-4811 Impact factor: 4.993
Total classification error (TER), false negative (FNR) and false positive (FPR) rates, and area under the ROC curve (AUC) for increasing proportions of mislabeled observations with the five tested classification models
| misLabels (%) | minFreq | errType | KNN | LR | RF | SVM-Lin | SVM-Rbf |
|---|---|---|---|---|---|---|---|
| 0 | 0.1950 | TER | 0.0000 | 0.0000 | 0.0085 | 0.0000 | 0.0001 |
| FNR | 0.0000 | 0.0000 | 0.0067 | 0.0000 | 0.0001 | ||
| FPR | 0.0020 | 0.0020 | 0.0054 | 0.0020 | 0.0036 | ||
| AUC |
| 0.9980 | 0.9946 | 0.9980 | 0.9961 | ||
| 1 | 0.1870 | TER | 0.0039 | 0.0153 | 0.0092 | 0.0077 | 0.0008 |
| FNR | 0.0042 | 0.0153 | 0.0076 | 0.0078 | 0.0007 | ||
| FPR | 0.0038 | 0.0046 | 0.0036 | 0.0095 | 0.0044 | ||
| AUC | 0.9961 | 0.9954 |
| 0.9905 | 0.9955 | ||
| 2.5 | 0.1870 | TER | 0.0045 | 0.0291 | 0.0102 | 0.0145 | 0.0004 |
| FNR | 0.0049 | 0.0283 | 0.0094 | 0.0139 | 0.0004 | ||
| FPR | 0.0041 | 0.0094 | 0.0032 | 0.0174 | 0.0023 | ||
| AUC | 0.9959 | 0.9905 | 0.9968 | 0.9825 |
| ||
| 5 | 0.2114 | TER | 0.0088 | 0.0897 | 0.0236 | 0.0471 | 0.0047 |
| FNR | 0.0096 | 0.0864 | 0.0213 | 0.0466 | 0.0043 | ||
| FPR | 0.0052 | 0.0484 | 0.0052 | 0.0496 | 0.0073 | ||
| AUC | 0.9948 | 0.9516 |
| 0.9503 | 0.9918 | ||
| 7.5 | 0.2520 | TER | 0.0160 | 0.1431 | 0.0342 | 0.0708 | 0.0087 |
| FNR | 0.0159 | 0.1386 | 0.0307 | 0.0688 | 0.0077 | ||
| FPR | 0.0071 | 0.0920 | 0.0077 | 0.0748 | 0.0116 | ||
| AUC |
| 0.9080 | 0.9921 | 0.9251 | 0.9882 | ||
| 10 | 0.2439 | TER | 0.0292 | 0.2011 | 0.0553 | 0.1111 | 0.0205 |
| FNR | 0.0294 | 0.1963 | 0.0521 | 0.1105 | 0.0188 | ||
| FPR | 0.0100 | 0.1462 | 0.0173 | 0.1134 | 0.0242 | ||
| AUC |
| 0.8538 | 0.9827 | 0.8866 | 0.9754 | ||
| 12.5 | 0.2846 | TER | 0.0396 | 0.2286 | 0.0679 | 0.1275 | 0.0328 |
| FNR | 0.0393 | 0.2247 | 0.0625 | 0.1297 | 0.0285 | ||
| FPR | 0.0139 | 0.1680 | 0.0214 | 0.1277 | 0.0381 | ||
| AUC |
| 0.8320 | 0.9786 | 0.8723 | 0.9614 | ||
| 15 | 0.2927 | TER | 0.0536 | 0.2714 | 0.0924 | 0.1687 | 0.0484 |
| FNR | 0.0533 | 0.2637 | 0.0867 | 0.1687 | 0.0439 | ||
| FPR | 0.0254 | 0.2237 | 0.0358 | 0.1705 | 0.0535 | ||
| AUC |
| 0.7763 | 0.9642 | 0.8292 | 0.9460 | ||
| 17.5 | 0.2764 | TER | 0.0691 | 0.2903 | 0.1098 | 0.1887 | 0.0635 |
| FNR | 0.0692 | 0.2867 | 0.1017 | 0.1889 | 0.0595 | ||
| FPR | 0.0323 | 0.2425 | 0.0549 | 0.1903 | 0.0686 | ||
| AUC |
| 0.7575 | 0.9451 | 0.8097 | 0.9091 | ||
| 20 | 0.2846 | TER | 0.0924 | 0.3095 | 0.1258 | 0.2166 | 0.0835 |
| FNR | 0.0948 | 0.3068 | 0.1207 | 0.2212 | 0.0767 | ||
| FPR | 0.0402 | 0.2608 | 0.0594 | 0.2149 | 0.0906 | ||
| AUC |
| 0.7391 | 0.9406 | 0.7851 | 0.9081 | ||
| 25 | 0.3984 | TER | 0.1334 | 0.3415 | 0.1947 | 0.2550 | 0.1377 |
| FNR | 0.1325 | 0.3344 | 0.1829 | 0.2582 | 0.1141 | ||
| FPR | 0.0800 | 0.2976 | 0.1320 | 0.2559 | 0.1454 | ||
| AUC |
| 0.7024 | 0.8680 | 0.7441 | 0.8532 | ||
| 30 | 0.3659 | TER | 0.2073 | 0.3693 | 0.2522 | 0.3079 | 0.1989 |
| FNR | 0.2079 | 0.3700 | 0.2477 | 0.3156 | 0.1745 | ||
| FPR | 0.1518 | 0.3439 | 0.1930 | 0.3067 | 0.2099 | ||
| AUC |
| 0.6561 | 0.8069 | 0.6933 | 0.7901 | ||
| 40 | 0.4309 | TER | 0.3681 | 0.4382 | 0.3884 | 0.4044 | 0.3551 |
| FNR | 0.3723 | 0.4376 | 0.3916 | 0.4087 | 0.3088 | ||
| FPR | 0.3254 | 0.4223 | 0.3546 | 0.4051 | 0.3639 | ||
| AUC |
| 0.5777 | 0.6453 | 0.5949 | 0.6351 | ||
| 50 | 0.5203 | TER | 0.5111 | 0.5134 | 0.5194 | 0.5130 | 0.5116 |
| FNR | 0.5214 | 0.5120 | 0.5199 | 0.5161 | 0.5238 | ||
| FPR | 0.5208 | 0.5165 | 0.5170 | 0.5147 | 0.5137 | ||
| AUC | 0.4792 | 0.4834 | 0.4830 | 0.4853 |
|
Reported values of classification performance are average validation results from a 5-fold cross-validation scheme repeated 100 times (per model, per mislabel proportion). MinFreq is the frequency of the minority class (low-root vigor). In italic the best performing method (in terms of AUC) for each percentage of noisy lables
KNN K-nearest neighbours, LR ridge logistic regression, RF random forest, SVM-Lin SVM with linear kernel, SVM-Rbf SVM with radial basis function
Fig. 1Average area under the ROC curve (AUC) for the five tested classification models as a function of increasing proportions of mislabeled observations. Averages are for validation values from 5-fold cross-validation repeated 100 times. KNN red line, RF green line, SVM-Lin blue line, LR black line, SVM-Rbf violet line
Fig. 2Distribution of TPR (red) and TNR (blue) in the validation set. TPR and TNR as a function of mislabeled observations, from a 5-fold cross validation repeated 100 times. Results are presented per method
Fig. 3Average area under the ROC curve (AUC) using all 175 SNPs and subsets with the 50 and 30 most informative SNPs. Averages are for validation values from 5-fold cross-validation repeated 100 times. KNN red lines, RF green lines; All 175 SNP solid lines, 50 most informative SNP dashed lines, 30 most informative SNP dotted lines