| Literature DB >> 28230161 |
Dantong Wang1, Simon Fong1, Raymond K Wong2, Sabah Mohammed3, Jinan Fiaidhi3, Kelvin K L Wong4.
Abstract
Outlier detection in bioinformatics data streaming mining has received significant attention by research communities in recent years. The problems of how to distinguish noise from an exception and deciding whether to discard it or to devise an extra decision path for accommodating it are causing dilemma. In this paper, we propose a novel algorithm called ODR with incrementally Optimized Very Fast Decision Tree (ODR-ioVFDT) for taking care of outliers in the progress of continuous data learning. By using an adaptive interquartile-range based identification method, a tolerance threshold is set. It is then used to judge if a data of exceptional value should be included for training or otherwise. This is different from the traditional outlier detection/removal approaches which are two separate steps in processing through the data. The proposed algorithm is tested using datasets of five bioinformatics scenarios and comparing the performance of our model and other ones without ODR. The results show that ODR-ioVFDT has better performance in classification accuracy, kappa statistics, and time consumption. The ODR-ioVFDT applied onto bioinformatics streaming data processing for detecting and quantifying the information of life phenomena, states, characters, variables and components of the organism can help to diagnose and treat disease more effectively.Entities:
Mesh:
Year: 2017 PMID: 28230161 PMCID: PMC5322330 DOI: 10.1038/srep43167
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1ODR-A-ioVFDT Model.
Figure 2ODR Processing Model.
Figure 3ODR-R-ioVFDT Model.
Figure 4ODR-ioVFDT Model.
The time series data was loaded by the sliding window. Outlier will be picked out by the ODR model, and then collected into misclassified database. Clean data will be passed through ioVFDT classifier for decision tree building. The prediction error will be calculate for evaluate the classifer efficiency. The performance where lower than the expectation will send feedback to outlier and classifier model, the model update will be needed.
Input & Output Parameters.
| Input: | |
| | A set of raw data stream |
| | The number of attributes that dataset contains |
| | Sliding window length |
| | The window moving length |
| The restriction of accuracy value | |
| | The heuristic estimation function |
| | The practical solution for tree building |
| Output: | |
| | Outlier dataset detect from dataset R |
| | A decision tree |
| | The accuracy of tree learning process |
The Procedure of ODR.
| 1: | The continuous dataset |
| 2: | |
| 3: | |
| 4: | Compute |
| | |
| 5: | Detect outlier |
| 6: | //based on function (3) & (4) |
| 7: | Collect |
| 8: | Remove |
| 9: | Call function ioVFDT( |
| 10: | //Sent |
| 11: | |
| 12: | |
| 13: | Call function iOVFDT( |
| 14: | |
| 15: | |
| 16: | |
| 17: | End |
| 18: | End |
| 19: | End |
ioVFDT Building.
| 1: | Let iOVFDT be a tree with a single leaf (root) |
| 2: | |
| 3: | Sort example into leaf using iOVFDT |
| 4: | |
| 5: | Compute |
| 6: | Let |
| 7: | Let |
| 8: | Compute Hoeffding bounds |
| 9: | |
| 10: | If |
| 11: | |
| 12: | Add a new leaf with initialized sufficient statistics |
| 13: | End |
| 14: | |
| 15: | End |
| 16: | End |
| 17: | Compute |
| 18: | |
| 19: | Return |
| 20: | End |
Dataset Description.
| Name | Abbreviation | Sample size | No. of attributes | No. of classes |
|---|---|---|---|---|
| sEMG for Basic Hand Movements | EM | 1,800 | 3,000 | 6 |
| Lymphoblastic leukemia | LL | 1,962 | 12,559 | 6 |
| Thyroid Disease Databases | TD | 7,200 | 21 | 3 |
| EEG Eye State | EE | 14,980 | 15 | 2 |
| Diabetes data set | DD | 100,000 | 9 | 2 |
Outlier Collected by Misclassified Database.
| EM | 303 | 114 | 58 | 33 | 16 |
| LL | 294 | 210 | 138 | 84 | 48 |
| TD | 850 | 279 | 129 | 53 | 10 |
| EE | 999 | 499 | 272 | 170 | 55 |
| DD | 3030 | 3030 | 3030 | 3030 | 3030 |
Figure 5Classification Accuracy & Kappa Statistics for Five Algorithms.
ODR Preprocessing for Five Dataset by Five Algorithms.
| (a) | ||||
|---|---|---|---|---|
| EM | KNN | NB | SVM | ioVFDT |
| 22.10 | 25.50 | 25.90 | 29.55 | |
| 20.20 | 25.30 | 27.30 | ||
| 21.80 | 29.83 | |||
| 26.20 | 26.20 | 29.21 | ||
| 22.50 | 26.00 | 26.10 | 29.37 | |
| 89.20 | 90.10 | 90.60 | 90.60 | |
| 89.70 | 89.80 | 90.10 | 88.20 | |
| 90.60 | ||||
| 90.10 | 90.40 | 90.10 | ||
| 89.90 | 90.20 | 90.60 | 90.40 | |
| 90.00 | 90.40 | 90.80 | 91.10 | |
| 93.40 | 94.30 | 4.40 | 95.10 | |
| 4.40 | 94.30 | |||
| 93.00 | 94.10 | 94.10 | ||
| 93.20 | 94.10 | 4.50 | 94.00 | |
| 93.10 | 94.10 | 4.50 | ||
| 93.40 | 94.30 | 4.40 | 95.90 | |
| 94.8 | 9.30 | 99.70 | 95.90 | |
| 94.80 | 9.70 | 99.70 | 93.60 | |
| 94.80 | 93.90 | |||
| 9.80 | 99.70 | 92 | ||
| 94.80 | 9.70 | 99.70 | ||
| 94.80 | 9.50 | 99.70 | 96.6 | |
| 92.80 | 43.80 | 75.80 | 100 | |
| 93.20 | 86.60 | 75.40 | 100.00 | |
Figure 6The ODR-ioVFDT classification accuracy with outlier threshold β changed in five dataset.
Kappa statistics.
| EM | LL | TD | EE | DD | |
|---|---|---|---|---|---|
| ODR-ioVFDT | 37.9 | 4.3 | 65.65 | 82.7 | 100 |
| ioVFDT | 35.81 | 3.7 | 64.57 | 80.98 | 100 |
| KNN | 22.1 | 1.55 | 14.72 | 65.51 | 69.19 |
| NB | 25.5 | 0 | 43.33 | 0 | 7.44 |
| SVM | 25.9 | 55.83 | 0 | 98.21 | 2.39 |
| Reference | Remarks | ||||
| 0.0~20.0% | Slight | ||||
| 21.0~40.0% | Fair | ||||
| 41.0~60.0% | Moderate | ||||
| 61.0~80.0% | Substantial | ||||
| 81.0~100.0% | Almost perfect | ||||
Time Elapsed (s).
| EM | LL | TD | EE | DD | |
|---|---|---|---|---|---|
| ODR-ioVFDT | 7.1 | 44.28 | 1.4 | 0.45 | 2.46 |
| ioVFDT | 7.63 | 51.4 | 1.42 | 0.51 | 3.63 |
| KNN | 294.02 | 1672.48 | 3.13 | 7.55 | 19.63 |
| NB | 4.69 | 20.1 | 0.11 | 0.17 | 1.95 |
| SVM | 2.34 | 7.53 | 0.06 | 0.11 | 1.73 |
Figure 7Time Consume and tree size comparison between ODR-ioVFDT & ioVFDT.
Tree Size for Five Datasets.
| EM | LL | TD | EE | DD | |
|---|---|---|---|---|---|
| ODR-ioVFDT | 76 | 135 | 8 | 4 | 48 |
| ioVFDT | 83 | 156 | 9 | 5 | 44 |