| Literature DB >> 35408413 |
Rajesh Kumar1, Geetha Subbiah1.
Abstract
Software products from all vendors have vulnerabilities that can cause a security concern. Malware is used as a prime exploitation tool to exploit these vulnerabilities. Machine learning (ML) methods are efficient in detecting malware and are state-of-art. The effectiveness of ML models can be augmented by reducing false negatives and false positives. In this paper, the performance of bagging and boosting machine learning models is enhanced by reducing misclassification. Shapley values of features are a true representation of the amount of contribution of features and help detect top features for any prediction by the ML model. Shapley values are transformed to probability scale to correlate with a prediction value of ML model and to detect top features for any prediction by a trained ML model. The trend of top features derived from false negative and false positive predictions by a trained ML model can be used for making inductive rules. In this work, the best performing ML model in bagging and boosting is determined by the accuracy and confusion matrix on three malware datasets from three different periods. The best performing ML model is used to make effective inductive rules using waterfall plots based on the probability scale of features. This work helps improve cyber security scenarios by effective detection of false-negative zero-day malware.Entities:
Keywords: Shapley value; artificial intelligence; bagging; boosting; computer security; cyber security; machine learning; zero-day malware detection; zero-day vulnerability
Mesh:
Year: 2022 PMID: 35408413 PMCID: PMC9002855 DOI: 10.3390/s22072798
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Top ten software producers with vulnerabilities.
| Sl. No. | Vendor Name | Count of Products | Count of Vulnerabilities | #Vulnerabilities/ |
|---|---|---|---|---|
| 1 | Cisco | 5623 | 4159 | 1 |
| 2 | IBM | 1335 | 5378 | 4 |
| 3 | Oracle | 971 | 8270 | 9 |
| 4 | Microsoft | 665 | 8391 | 13 |
| 5 | Redhat | 430 | 4058 | 9 |
| 6 | Apple | 140 | 5467 | 39 |
| 7 | 128 | 6916 | 54 | |
| 8 | Debian | 109 | 6022 | 55 |
| 9 | Canonical | 49 | 3180 | 65 |
| 10 | Fedora Project | 21 | 2885 | 137 |
Top Operating systems with Vulnerabilities.
| Sl. No. | Product Name | Vendor Name | Count of Vulnerabilities |
|---|---|---|---|
| 1 | Debian Linux | Debian | 5572 |
| 2 | Android | 3875 | |
| 3 | Ubuntu Linux | Canonical | 3036 |
| 4 | Mac Os X | Apple | 2911 |
| 5 | Linux Kernel | Linux | 2722 |
| 6 | Fedora | Fedoraproject | 2538 |
| 7 | iphone OS | Apple | 2522 |
| 8 | Windows 10 | Microsoft | 2459 |
| 9 | Windows Server 2016 | Microsoft | 2233 |
| 10 | Windows 7 | Microsoft | 1954 |
Figure 1File header of a sample with details of components in the PE header.
Figure 2Derivation of D1, D2, D3 datasets from the Malware dataset scale.
Figure 3Block diagram for the method.
Details of D1, D2, D3 datasets.
| Sl. No. | Short Name | Counts | label | Period |
|---|---|---|---|---|
| 1 | D1 | 28,606 | Unidentified | January 2017 |
| 2 | 17,180 | Benign | January 2017 | |
| 3 | 32,761 | Malware | January 2017 | |
| 4 | D2 | 31,394 | Unidentified | February 2017 |
| 5 | 32,820 | Benign | February 2017 | |
| 6 | 27,239 | Malware | February 2017 | |
| 7 | D3 | 15,656 | Unidentified | March 2017 |
| 8 | 25,261 | Benign | March 2017 | |
| 9 | 12,692 | Malware | March 2017 |
Features names in the dataset used.
| Description | Count | Feature Name | ||
|---|---|---|---|---|
| Histogram | 256 | H1–H256 | ||
| Byteentropy | 256 | BEn1–BEn256 | ||
| String Extractor | 104 | Str1–str104 | ||
| General file Info | 10 | size | ||
| vsize | ||||
| has_debug | ||||
| exports | ||||
| imports | ||||
| has_relocations | ||||
| has_resources | ||||
| has_signature | ||||
| has_tls | ||||
| symbols | ||||
| Header file Info | 62 |
|
|
|
| timestamp | 1 | Timestamp | ||
| Machine1–Machine10 | 10 | H/W type hashed | ||
| C_char1–C_char10 | 10 | Characteristics Hashed | ||
| subsystem1–subsystem10 | 10 | Subsystems Hashed | ||
| dll_c1–dll_c10 | 10 | DLL Characteristics hashed | ||
| magic1–magic10 | 10 | Magic | ||
| major_i_verminor_i_ver | 1 | Image version | ||
| 1 | ||||
| major_linker_verminor_linker_ver | 1 | Linker version | ||
| 1 | ||||
| major_os_ver | 1 | OS versions | ||
| minor_os_ver | 1 | |||
| major_ss_ver | 1 | Subsystem version | ||
| minor_ss_ver | 1 | |||
| sizeof_code | 1 | Code size | ||
| Section info | 255 | Feature names | Num of features | Description |
| num_of_sec, | 1 | Number of Sections | ||
| num_of_sec_morethan0, | 1 | Number of sections >0 | ||
| num_sec_no name, | 1 | Count of sections without name | ||
| RX_sec_num, | 1 | Count of sections with read and execute permission | ||
| W_sec_num | 1 | Number of sections with write permission | ||
| sect_size_1–sect_size_50 | 50 | Section size hashed | ||
| sect_entropy_1–sect_entropy_50 | 50 | Section Entropy hashed | ||
| sect_vsize1–sect_vsize50 | 50 | Section vsize hashed | ||
| entry_name1–entry_name50 | 50 | Section name hashed | ||
| sect_char1–sect_char50 | 50 | Section characteristics hashed | ||
| Imports | 1280 | Imp1–Imp1280 | ||
| Exports Info | 128 | exp1-exp128 | ||
Results of XGBoost ML model.
| Accuracy | TP | FP | FN | TN | Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|---|---|---|---|---|
|
| 98.501 | 5528 | 142 | 105 | 10,706 | 0.99 | 0.99 | 0.99 | 10,811 |
|
| 97.875 | 31,906 | 914 | 362 | 26,877 | 0.97 | 0.99 | 0.98 | 27,239 |
|
| 97.507 | 24,455 | 806 | 140 | 12,552 | 0.94 | 0.99 | 0.96 | 12,692 |
|
| 98.703 | 10,686 | 100 | 157 | 8877 | 0.99 | 0.98 | 0.99 | 9034 |
|
| 98.081 | 16,926 | 254 | 704 | 32,057 | 0.99 | 0.98 | 0.99 | 32,761 |
|
| 98.492 | 24,936 | 325 | 247 | 12,445 | 0.97 | 0.98 | 0.98 | 12,692 |
Results of the LightGBM ML model.
| Dataset | Accuracy | TP | FP | FN | TN | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|---|---|---|---|---|
|
| 98.483 | 5520 | 150 | 100 | 10,711 | 0.99 | 0.99 | 0.99 | 10,811 |
|
| 97.640 | 31,831 | 989 | 428 | 26,811 | 0.96 | 0.98 | 0.97 | 27,239 |
|
| 97.201 | 24,341 | 920 | 142 | 12,550 | 0.93 | 0.99 | 0.96 | 12,692 |
|
| 98.526 | 10,658 | 128 | 164 | 8870 | 0.99 | 0.98 | 0.98 | 9034 |
|
| 97.865 | 16,876 | 304 | 762 | 31,999 | 0.99 | 0.98 | 0.98 | 32,761 |
|
| 98.224 | 24,847 | 414 | 260 | 12,432 | 0.97 | 0.98 | 0.97 | 12,692 |
Results of Random Forest ML model.
| Dataset | Accuracy | TP | FP | FN | TN | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|---|---|---|---|---|
|
| 97.761 | 5484 | 186 | 183 | 10,628 | 0.98 | 0.98 | 0.98 | 10,811 |
|
| 96.728 | 31,537 | 1283 | 682 | 26,557 | 0.95 | 0.97 | 0.96 | 27,239 |
|
| 96.134 | 24,033 | 1228 | 239 | 12,453 | 0.91 | 0.98 | 0.94 | 12,692 |
|
| 97.815 | 10,642 | 144 | 289 | 8745 | 0.98 | 0.97 | 0.98 | 9034 |
|
| 96.776 | 16,869 | 311 | 1299 | 31,462 | 0.99 | 0.96 | 0.98 | 32,761 |
|
| 97.660 | 24,778 | 483 | 405 | 12,287 | 0.96 | 0.97 | 0.97 | 12,692 |
Results of Extratree ML model.
| Dataset | Accuracy | TP | FP | FN | TN | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|---|---|---|---|---|
|
| 97.967 | 5504 | 166 | 169 | 10,642 | 0.98 | 0.98 | 0.98 | 10,811 |
|
| 97.352 | 31,798 | 1022 | 568 | 26,671 | 0.96 | 0.98 | 0.97 | 27,239 |
|
| 96.732 | 24,202 | 1059 | 181 | 12,511 | 0.92 | 0.99 | 0.95 | 12,692 |
|
| 98.219 | 10,680 | 106 | 247 | 8787 | 0.99 | 0.97 | 0.98 | 9034 |
|
| 97.138 | 16,933 | 247 | 1182 | 31,579 | 0.99 | 0.96 | 0.98 | 32,761 |
|
| 98.232 | 24,920 | 341 | 330 | 12,362 | 0.97 | 0.97 | 0.97 | 12,692 |
Comparison of Accuracy for LightGBM, XGBoost, Random Forest, and Extratree.
| Dataset | LightGBM | XGBoost | RF | Extratree |
|---|---|---|---|---|
|
| 98.483 | 98.501 | 97.761 | 97.967 |
|
| 97.640 | 97.875 | 96.728 | 97.352 |
|
| 97.201 | 97.507 | 96.134 | 96.732 |
|
| 98.526 | 98.703 | 97.815 | 98.219 |
|
| 97.865 | 98.081 | 96.776 | 97.138 |
|
| 98.224 | 98.492 | 97.660 | 98.232 |
Comparison of False Positives and False Negatives for LightGBM, XGBoost, Random Forest and Extratree.
| Dataset | FP | FN | ||||||
|---|---|---|---|---|---|---|---|---|
| LG | XG | RF | ET | LG | XG | RF | ET | |
|
| 150 | 142 | 186 | 166 | 100 | 105 | 183 | 169 |
|
| 989 | 914 | 1283 | 1022 | 428 | 362 | 682 | 568 |
|
| 920 | 806 | 1228 | 1059 | 142 | 140 | 239 | 181 |
|
| 128 | 100 | 144 | 106 | 164 | 157 | 289 | 247 |
|
| 304 | 254 | 311 | 247 | 762 | 704 | 1299 | 1182 |
|
| 414 | 325 | 483 | 341 | 260 | 247 | 405 | 330 |
Figure 4Bar plot of a FP sample (a), a FN sample (b), a TP sample (c), and a TN sample (d) from the D2 dataset in shap value.
Figure 5Bar plot of a FP sample (a), a FN sample (b), a TP sample (c), and a TN sample (d) from the D2 dataset on probability scale.
Figure 6Waterfall plot of a FP sample (a), a FN sample (b), a TP sample (c), and a TN sample (d) from the D2 dataset in shap value.
Figure 7Waterfall plot of a FP sample (a), a FN sample (b), a TP sample (c), and a TN sample (d) from the D2 dataset on probability scale.
Comparison of features and their probability scale value in all predicted categories by XGBoost model. Bold means topmost feature.
| Sl. No. | Features | Features Included in XGBoost ML Prediction Categories and Their Contribution | |||
|---|---|---|---|---|---|
| False Positive (FP) | False Negative (FN) | True Positive (TP) | True Negative (TN) | ||
| 1 | Imp321 |
| N | N | N |
| 2 | C_char1 | P, 0.16 | P, 0.15 |
| P, 0.08 |
| 3 | C_char4 | P, 0.04 | N | P, 0.01 | P, 0.07 |
| 4 | H33 | P,0.15 | N | P, 0.03 | N |
| 5 | Subsystem9 | 0.04 | N | N | P, |
| 6 | Rx_sec_num | −0.07 | P, | N | +0.05 |
| 7 | Str43 | P,−0.13 | N | P, 0.01 | P, −0.04 |
| 8 | Str104 | P, −0.1 | P, 0.07 | 0.02 | −0.05 |
| 9 | Imports | N | P, −0.13 | N | N |
| 10 | Str1 | N | P, −0.13 | N | Str78. Str47 |
| 11 | Sizeof_code | N | P, −0.10 | N | N |
| 12 | Imp604 | N | N | P, −0.02 | N |
| 13 | BEn253 | BEn50 | N | BEn151 | −0.07 |
| 14 | Other 2342 features | P, 0.16 | P, −0.60 | P, |
|
Comparison with five zero-day malware projects.
| Project | Methodology | Details of Dataset | Result/Accuracy | Consideration for Zero-Day Malware |
|---|---|---|---|---|
| This work | Boosting algorithms: LightGBM, XGBoost | Three datasets D1, D2, D3 of January 2017, February 2017, March 2017 | 98.49 | Train with the dataset from January 2017 and predict for malware from February 2017, March 2017. |
| Yousefi-Azar et al. (2018) | natural language processing and the term frequency | Android: | 97.33 | Dataset from 2016 to build the model and dataset from 2017 as zero-day malware to test against the model. |
| Venkatraman and Alazab (2018) | 1a. Kernel Function | 52 k samples | 98.6% | Unknown Malware |
| (Jung & Kim, 2015) | 1. Extract API call sequence features from both static analysis and dynamic analysis. | Benign 333 (.swf files) | 51% to 100% | 2007 to 2014 dataset for training and 2015 dataset for testing |
| (Alazab et al., 2010) | k–Nearest, Neighbor (kNN), Naïve Bayes (NB), Sequential Minimal Optimization (SMO) algorithm with (SMONormalized PolyKernel, SMOPolyKernel, SMOPuk, and SMO-RBF) kernels. | Benign software 15,480 | 98.6 | 1. Unknown malware |
| (Shafiq et al., 2015) | Ripper and SVM-SMO classifier | Benign software 1447 | 99.2% g AUC | Malware whose signature is not in database/Unknown malware |