| Literature DB >> 34202616 |
Maria-Elena Mihailescu1, Darius Mihai1, Mihai Carabas2, Mikołaj Komisarek3,4, Marek Pawlicki3, Witold Hołubowicz4, Rafał Kozik3.
Abstract
Cybersecurity is an arms race, with both the security and the adversaries attempting to outsmart one another, coming up with new attacks, new ways to defend against those attacks, and again with new ways to circumvent those defences. This situation creates a constant need for novel, realistic cybersecurity datasets. This paper introduces the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture. The main contribution of this work is a dataset coming from a real-world, academic network. Real-life traffic was collected and, after performing a series of attacks, a dataset was assembled. The dataset contains 44 network features and an unbalanced distribution of classes. In this work, the capability of the dataset for formulating machine-learning-based models was experimentally evaluated. To investigate the stability of the obtained models, cross-validation was performed, and an array of detection metrics were reported. The gathered dataset is part of an effort to bring security against novel cyberthreats and was completed in the SIMARGL project.Entities:
Keywords: dataset; machine learning; network intrusion detection
Mesh:
Year: 2021 PMID: 34202616 PMCID: PMC8272217 DOI: 10.3390/s21134319
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Distribution of network traffic in the CTU-13 dataset for each scenario.
| # | Normal Flows | Background Flows | C&C Flows | Botnet Flows |
|---|---|---|---|---|
| 1 | 30,387 | 2,753,290 | 1026 | 39,993 |
| 2 | 9120 | 1,778,061 | 2102 | 18,839 |
| 3 | 116,887 | 4,566,929 | 63 | 26,759 |
| 4 | 25,268 | 1,094,040 | 49 | 1719 |
| 5 | 4679 | 124,252 | 206 | 695 |
| 6 | 7494 | 546,795 | 199 | 4431 |
| 7 | 1677 | 112,337 | 26 | 37 |
| 8 | 72,822 | 2,875,282 | 1074 | 5052 |
| 9 | 43,340 | 2,525,565 | 5099 | 179,880 |
| 10 | 15,847 | 1,187,592 | 37 | 106,315 |
| 11 | 2718 | 96,369 | 3 | 8161 |
| 12 | 7628 | 315,675 | 25 | 2143 |
| 13 | 31,791 | 1,853,217 | 1202 | 38,791 |
CTU-13 list of features.
| Feature Name | Description |
|---|---|
| StartTime | Start Time flow |
| SrcAddr | Source IP address |
| Sport | Source port |
| DstAddr | Destination IP address |
| Dport | Destination port |
| Proto | Protocol |
| Dir | Direction of communication |
| Dur | Flow total duration |
| State | Protocol state |
| sTos | Source Type of Service |
| dTos | Destination Type of Service |
| TotPkts | Total number of packets exchanged |
| TotBytes | Total number of bytes exchanged |
| SrcBytes | Number of bytes from source |
| Label | Name of type attack |
Representation of the traffic content of the IoT-23 dataset by executed attacks.
| Attack Name | Flows |
|---|---|
| Part-Of-A-Horizontal-PortScan | 213,852,924 |
| Okiru | 47,381,241 |
| Okiru-Attack | 13,609,479 |
| DDoS | 19,538,713 |
| C&C-Heart Beat | 33,673 |
| C&C | 21,995 |
| Attack | 9398 |
| C&C- | 888 |
| C&C-Heart Beat Attack | 883 |
| C&C-File download | 53 |
| C&C-Tori | 30 |
| File download | 18 |
| C&C-Heart Beat File Download | 11 |
| Part-Of-A-Horizontal-PortScan Attack | 5 |
| C&C-Mirai | 2 |
IOT-23 features.
| Feature Name | Description |
|---|---|
| fields-ts | Start Time flow |
| uid | Unique ID |
| id.orig-h | Source IP address |
| id.orig-p | Source port |
| id.resp-h | Destination IP address |
| id.resp-p | Destination port |
| proto | Protocol |
| service | Type of Service (http, dns, etc.) |
| duration | Flow total duration |
| orig-bytes | Source—destination transaction bytes |
| resp-bytes | Destination—source transaction bytes |
| conn-state | Connection state |
| local-orig | Source local address |
| local-resp | Destination local address |
| resp-pkts | Destination packets |
| orig-ip-bytes | Flow of source bytes |
| history orig-pkts | History of source packets |
| missed-bytes | Missing bytes during transaction |
| tunnel-parents | Traffic tunnel |
| resp-ip-bytes | Flow of destination bytes |
| label | Name of type attack |
Representation of the traffic content of the LITNET-2020 dataset by executed attacks.
| Attack Name | Number of Samples | Attacks |
|---|---|---|
| Packet Fragmentation attack | 1,244,866 | 477 |
| Scanning/spread | 6687 | 6232 |
| Reaper worm | 4,377,656 | 1176 |
| Spam bot’s detection | 1,153,020 | 747 |
| Code red worm | 5,082,952 | 1,255,702 |
| Blaster worm | 2,858,573 | 24,291 |
| LANDattack | 3,569,838 | 52,417 |
| HTTP-flood | 3,963,168 | 22,959 |
| TCP SYN-flood | 14,608,678 | 3,725,838 |
| UDP-flood | 606,814 | 59,479 |
| ICMP-flood | 3,863,655 | 11,628 |
| Smurf | 3,994,426 | 59,479 |
Figure 1Network architecture for the “Reconnaissance and Denial-of-Service attacks”.
The list of the collected network features.
| Feature | Description |
|---|---|
| BIFLOW_DIRECTION | Determines who initiated the flow |
| DIRECTION | Determines the direction of flow |
| DST_TO_SRC_SECOND_BYTES | An indicator that determines the flow of bytes per |
| second (dst to src) | |
| FIREWALL_EVEN | Information flag from the firewall |
| FIRST_SWITCHED | Time of appearance of the first flow |
| FLOW_ACTIVE_TIMEOUT | Network traffic activity timeout |
| FLOW_DURATION_MICROSECONDS | Duration of flow expressed in microseconds |
| FLOW_DURATION_MILLISECONDS | Duration of flow expressed in milliseconds |
| FLOW_END_MILLISECONDS | Duration of flow end expressed in milliseconds |
| FLOW_END_SEC | Duration of flow end expressed in seconds |
| FLOW_ID | Unique ID |
| FLOW_INACTIVE_TIMEOUT | Inactivity time of the flow |
| FLOW_START_MILLISECONDS | Duration of flow start expressed in milliseconds |
| FLOW_START_SEC | Duration of flow start expressed in seconds |
| FRAME_LENGTH | Frame length |
| IN_BYTES | Number of incoming bytes |
| IN_PKTS | Number of incoming packets |
| IPV4_DST_ADDR | Destination IP V4 address |
| IPV4_SRC_ADDR | Source IP V4 address |
| L4_DST_PORT | Destination Port |
| L4_SRC_PORT | Source Port |
| LAST_SWITCHED | Time of the last packet |
| MAX_IP_PKT_LEN | The largest length of the observed packet |
| MIN_IP_PKT_LEN | The smallest length of the observed packet |
| OOORDER_IN_PKTS | Number of incoming packets that were out of order |
| OOORDER_OUT_PKTS | Number of outgoing packets that were out of order |
| OUT_BYTES | Outgoing bytes |
| OUT_PKTS | Outgoing packets |
| PROTOCOL | Protocol flag |
| PROTOCOL_MAP | Name of protocol |
| RETRANSMITTED_IN_BYTES | Number of incoming bytes repeated |
| RETRANSMITTED_IN_PKTS | Number of incoming packets repeated |
| RETRANSMITTED_OUT_BYTES | Number of outgoing bytes repeated |
| RETRANSMITTED_OUT_PKTS | Number of outgoing packets repeated |
| SRC_TO_DST_SECOND_BYTES | An indicator that determines the flow of bytes per |
| second (src to dst) | |
| TCP_WIN_MAX_IN | Maximum incoming TCP window size |
| TCP_WIN_MAX_OUT | Maximum outgoing TCP window size |
| TCP_WIN_MIN_IN | Minimum incoming TCP window size |
| TCP_WIN_MIN_OUT | Minimum outgoing TCP window size |
| TCP_WIN_MSS_IN | Incoming TCP segment size |
| TCP_WIN_MSS_OUT | Outgoing TCP segment size |
| TCP_WIN_SCALE_IN | Incoming TCP scale size |
| SRC_TOS | Sets the service type byte on entry to |
| the incoming interface. | |
| L7_PROTO_NAME | Name of the layer 7 protocol |
| TOTAL_FLOWS_EXP | Total number of exported flows |
Figure 2Architecture: Detection Engine.
Figure 3Process of preparing the collected data.
Figure 4Distribution plots of the 15 most important features.
Figure 5Result of feature selection.
Summary of the results for the deep neural network.
| # | Accuracy | Precision | Recall | F1 | BACC | MCC | AUC_ROC |
|---|---|---|---|---|---|---|---|
| 1 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9904 | 0.9873 | 0.9987 |
| 2 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9935 | 0.9914 | 0.9989 |
| 3 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9930 | 0.9908 | 0.9966 |
| 4 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9932 | 0.9910 | 0.9986 |
| 5 | 1 | 1 | 1 | 1 | 0.9955 | 0.9910 | 0.9983 |
| 6 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9949 | 0.9932 | 0.9987 |
| 7 | 1 | 1 | 1 | 1 | 0.9953 | 0.9938 | 0.9982 |
| 8 | 1 | 1 | 1 | 1 | 0.9969 | 0.9959 | 0.9985 |
| 9 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9906 | 0.9874 | 0.9975 |
| 10 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9930 | 0.9907 | 0.9985 |
Summary of the results for the Random Forest Classifier.
| # | Accuracy | Precision | Recall | F1 | BACC | MCC | AUC_ROC |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0.9999 | 0.9999 | 0.9999 |
| 2 | 1 | 1 | 1 | 1 | 0.9999 | 0.9999 | 0.9999 |
| 3 | 1 | 1 | 1 | 1 | 0.9999 | 0.9999 | 0.9999 |
| 4 | 1 | 1 | 1 | 1 | 0.9999 | 0.9999 | 0.9999 |
| 5 | 0.85 | 0.91 | 0.85 | 0.84 | 0.8537 | 0.8290 | 0.9025 |
| 6 | 1 | 1 | 1 | 1 | 0.9999 | 0.9999 | 0.9999 |
| 7 | 0.87 | 0.91 | 0.87 | 0.86 | 0.8709 | 0.8469 | 0.9139 |
| 8 | 1 | 1 | 1 | 1 | 0.9999 | 0.9999 | 0.9997 |
| 9 | 0.85 | 0.91 | 0.85 | 0.84 | 0.8524 | 0.8276 | 0.9016 |
| 10 | 0.78 | 0.88 | 0.78 | 0.72 | 0.7772 | 0.7546 | 0.8514 |
Summary of the results for the Gradient Boosting Classifier.
| # | Accuracy | Precision | Recall | F1 | BACC | MCC | AUC_ROC |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0.9991 | 0.9988 | 0.9991 |
| 2 | 1 | 1 | 1 | 1 | 0.9990 | 0.9987 | 0.9993 |
| 3 | 1 | 1 | 1 | 1 | 0.9993 | 0.9991 | 0.9993 |
| 4 | 1 | 0.99 | 1 | 1 | 0.9985 | 0.9980 | 0.9993 |
| 5 | 0.97 | 0.97 | 0.97 | 0.97 | 0.9660 | 0.9560 | 0.9992 |
| 6 | 0.98 | 0.99 | 0.98 | 0.98 | 0.9849 | 0.9801 | 0.9993 |
| 7 | 0.99 | 0.99 | 0.99 | 0.99 | 0.9927 | 0.9904 | 0.9994 |
| 8 | 1 | 1 | 1 | 1 | 0.9978 | 0.9971 | 0.9995 |
| 9 | 0.95 | 0.96 | 0.95 | 0.95 | 0.9520 | 0.9388 | 0.9991 |
| 10 | 0.98 | 0.98 | 0.98 | 0.98 | 0.9841 | 0.9791 | 0.9993 |
Summary of the results for the AdaBoost Classifier.
| # | Accuracy | Precision | Recall | F1 | BACC | MCC | AUC_ROC |
|---|---|---|---|---|---|---|---|
| 1 | 0.52 | 0.42 | 0.53 | 0.46 | 0.5214 | 0.3919 | 0.7092 |
| 2 | 0.53 | 0.41 | 0.52 | 0.45 | 0.5313 | 0.4051 | 0.7093 |
| 3 | 0.53 | 0.54 | 0.52 | 0.69 | 0.5332 | 0.4070 | 0.7093 |
| 4 | 0.55 | 0.42 | 0.53 | 0.46 | 0.5459 | 0.4239 | 0.7095 |
| 5 | 0.54 | 0.43 | 0.54 | 0.47 | 0.5313 | 0.4051 | 0.7096 |
| 6 | 0.53 | 0.42 | 0.53 | 0.46 | 0.5332 | 0.4070 | 0.7092 |
| 7 | 0.55 | 0.43 | 0.55 | 0.48 | 0.5459 | 0.4239 | 0.7097 |
| 8 | 0.54 | 0.43 | 0.54 | 0.47 | 0.5423 | 0.4190 | 0.7095 |
| 9 | 0.52 | 0.40 | 0.52 | 0.44 | 0.5244 | 0.3965 | 0.7093 |
| 10 | 0.78 | 0.88 | 0.78 | 0.72 | 0.7762 | 0.7535 | 0.7091 |
Comparison of the models used and their prediction results on test data with SMOTE.
| Model | Accuracy | Precision | Recall | F1 | BACC | MCC | AUC_ROC |
|---|---|---|---|---|---|---|---|
| Random Forest | 1.00 | 1.00 | 1.00 | 1.00 | 0.9998 | 0.9996 | 0.9998 |
| AdaBoost | 0.56 | 0.43 | 0.56 | 0.49 | 0.5642 | 0.4456 | 0.7095 |
| GBT | 1.00 | 1.00 | 1.00 | 1.00 | 0.9987 | 0.9980 | 0.9991 |
| DNN | 1.00 | 0.99 | 1.00 | 0.99 | 0.9975 | 0.9936 | 0.9981 |
Comparison of the models used and their prediction results on test data without SMOTE.
| Model | Accuracy | Precision | Recall | F1 | BACC | MCC | AUC_ROC |
|---|---|---|---|---|---|---|---|
| Random Forest | 0.99 | 0.99 | 0.99 | 0.99 | 0.9904 | 0.9873 | 0.9987 |
| AdaBoost | 0.54 | 0.43 | 0.54 | 0.47 | 0.5423 | 0.4190 | 0.7095 |
| GBT | 1.00 | 1.00 | 1.00 | 1.00 | 0.9979 | 0.9977 | 0.9990 |
| DNN | 1.00 | 0.99 | 1.00 | 0.99 | 0.9975 | 0.9936 | 0.9981 |
Statistical analysis of the classifiers by accuracy of the model based on paired Wilcoxon test with p-value 0.05.
| Classifiers | Z-Value | W-Value | Comparison | |
|---|---|---|---|---|
| RandomForest/GBT | 0.0151 | −2.4303 | 21 | Significant at |
| RandomForest/AdaBoost | 0.00001 | −4.7821 | 0 | Significant at |
| RandomForest/DNN | 0.31732 | −1.0032 | 136 | Not significant at |
| AdaBoost/DNN | 0.00001 | −4.7821 | 0 | Significant at |
| AdaBoost/GBT | 0.00001 | −4.7821 | 0 | Significant at |
| DNN/GBT | 0.71884 | −0.362 | 61 | Not significant at |