| Literature DB >> 35634097 |
Abstract
Malware harms the confidentiality and integrity of the information that causes material and moral damages to institutions or individuals. This study proposed a malware detection model based on API-call graphs and used Graph Variational Autoencoder (GVAE) to reduce the size of graph node features extracted from Android apk files. GVAE-reduced embeddings were fed to linear-based (SVM) and ensemble-based (LightGBM) models to finalize the malware detection process. To validate the effectiveness of the GVAE-reduced features, recursive feature elimination (RFE) and Fisher score (FS) were applied to select informative feature sets with the same sizes as GVAE-reduced embeddings. The results with RFE and FS selections revealed that LightGBM and RFE-selected 50 features achieved the highest accuracy (0.907) and F-measure (0.852) rates. When we used GVAE-reduced embeddings in the classification, there was an approximate increase of %4 in both models' accuracy rates. The same performance increase occurred in F-measure rates which directly indicated the improvement in the discrimination powers of the models. The last conducted experiment that combined the strengths of RFE selection and GVAE led to a performance increase compared to only GVAE-reduced embeddings. RFE selection achieved an accuracy rate of 0.967 in LightGBM with the help of selected 30 relevant features from the combination of all GVAE-embeddings. ©2022 Gunduz.Entities:
Keywords: API-call graphs; Graph Variational Autoencoder; Graph embeddings; Malware detection; Recursive Feature Elimination
Year: 2022 PMID: 35634097 PMCID: PMC9137949 DOI: 10.7717/peerj-cs.988
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Graphical representation of the proposed framework.
Information about datasets.
| Dataset | #instances | Type |
|---|---|---|
| CICMalDroid | 1929 | benign |
| ISCX-AndroidBot-2015 | 1843 | malware |
Confusion matrix for two-class classification.
| Actual/Predicted as | Positive | Negative |
|---|---|---|
| Positive |
|
|
| Negative |
|
|
Parameters of proposed GVAE model.
| Parameter | Value |
|---|---|
| #epoch | 100 |
| #hidden units in GCN | {20, 30, 40, 50} |
| #GCN layers | 2 |
| Dropout rate | 0.2 |
| L2-regularization rate | 0.01 |
| Optimizer | Adam |
Parameter space of LightGBM.
| Parameters | Value |
|---|---|
| number of learners | {100,200,500,1000} |
| learning rate | {0.1,0.01} |
| L2-regularizer | {0.001,0.0001} |
| max_depth | {7,9,11} |
Parameter space of SVM.
| Parameters | Value |
|---|---|
| Kernel Type | {rbf,poly} |
| Regularization (C) | {0.5,0.1,1,2,4,8} |
Classification results with GVAE-reduced node embeddings.
| #Features | LightGBM | SVM | ||
|---|---|---|---|---|
| Accuracy | F-Measure | Accuracy | F-Measure | |
|
| 0.917 | 0.875 | 0.909 | 0.865 |
|
| 0.943 | 0.909 | 0.927 | 0.892 |
|
|
|
|
|
|
|
| 0.937 | 0.901 | 0.927 | 0.889 |
Classification results with raw node features (RFE-selected).
| #Features | LightGBM | SVM | ||
|---|---|---|---|---|
| Accuracy | F-Measure | Accuracy | F-Measure | |
|
| 0.873 | 0.802 | 0.851 | 0.779 |
|
| 0.892 | 0.831 | 0.870 | 0.808 |
|
| 0.901 | 0.846 | 0.883 | 0.823 |
|
|
|
|
|
|
Classification results with raw node features (FS-selected).
| #Features | LightGBM | SVM | ||
|---|---|---|---|---|
| Accuracy | F-Measure | Accuracy | F-Measure | |
|
| 0.861 | 0.791 | 0.841 | 0.767 |
|
| 0.883 | 0.820 | 0.862 | 0.798 |
|
| 0.889 | 0.834 | 0.872 | 0.811 |
|
|
|
|
|
|
Classification results with the combination of GVAE-reduced embedding and RFE selection.
| Model | #Features | Accuracy | F-Measure |
|---|---|---|---|
|
|
|
|
|
|
|
| 0.955 | 0.924 |
Classification results of recent malware studies.
| Models | Feature Set | Accuracy | F-Measure |
|---|---|---|---|
|
| Permissions, API-calls | 0.975 | NA |
|
| n-opcode of classes.dex | 0.965 | NA |
|
| Permissions | 0.930 | NA |
|
| APIs’ permissions | NA | 0.985 |
|
| API-calls | 0.900 | NA |
|
| Risky permissions | 0.974 | 0.974 |
|
| API-call sequences | 0.972 | 0.982 |
|
| API-call graphs | 0.988 | 0.986 |
|
| APIs’ permissions |
|
|
|
| API-call graphs | 0.961 | 0.948 |
|
| API-call graphs |
|
|