| Literature DB >> 33977134 |
Xuejin Zhu1, Jie Huang1,2, Bin Wang3, Chunyang Qi1.
Abstract
The family homology determination of malware has become a research hotspot as the number of malware variants are on the rise. However, existing studies on malware visualization only determines homology based on the global structure features of executable, which leads creators of some malware variants with the same structure intentionally set to misclassify them as the same family. We sought to develop a homology determination method using the fusion of global structure features and local fine-grained features based on malware visualization. Specifically, the global structural information of the malware executable file was converted into a bytecode image, and the opcode semantic information of the code segment was extracted by the n-gram feature model to generate an opcode image. We also propose a dual-branch convolutional neural network, which features the opcode image and bytecode image as the final family classification basis. Our results demonstrate that the accuracy and F-measure of family homology classification based on the proposed scheme are 99.05% and 98.52% accurate, respectively, which is better than the results from a single image feature or other major schemes.Entities:
Keywords: Computer security; Homology determination; Machine learning; Malware visualization
Year: 2021 PMID: 33977134 PMCID: PMC8056249 DOI: 10.7717/peerj-cs.494
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Overview of the proposed method.
Figure 2Construction process of bytecode image.
Bytecode grayscale image size.
| PE file size | Image width |
|---|---|
| <1,024 kb | 512 |
| 1,024–4,096 kb | 1,024 |
| >4,096 kb | 2,048 |
Figure 3Extraction of the opcode stream from malware assembly file.
Opcode image matrix.
| … | |||||
|---|---|---|---|---|---|
| … | |||||
| … | |||||
| … | |||||
| … | … | … | … | … | … |
| … |
Opcode image construction.
| 1: |
| 2: |
| 3: |
| 4: Initialize an |
| 5: |
| 6: |
| 7: |
| 8: |
| 9: |
| 10: |
| 11: |
| 12: |
| 13: |
| 14: |
Figure 4Opcode images of different families.
(A) Lollipop family. (B) Kelihos_ver3 family. (C) Kelihos_ver1 family. (D) Gatak family.
Figure 5Reshape the bytecode image to a fixed size square image.
Figure 6Architecture of the proposed dual-branch CNN model.
The list of dual-branch CNN structure parameters.
| Network layer type | Size | Output dimension |
|---|---|---|
| Input layer | – | (2, 1, 64, 64) |
| Convolutional layer | 2 × (16 @ 3 × 3) filter | (2, 16, 64, 64) |
| Max pooling layer | 2 × 2, stride 2 | (2, 16, 32, 32) |
| Convolutional layer | 2 × (32 @ 3 × 3) filter | (2, 32, 32, 32) |
| Max pooling layer | 2 × 2, stride 2 | (2, 32, 16, 16) |
| Fully connected layer | 512 nodes | (512,1) |
| Output layer | – | 1 |
The sample distribution of the dataset.
| Family | Sample number | Family ID |
|---|---|---|
| Ramnit | 1,541 | 1 |
| Lollipop | 2,478 | 2 |
| Kelihos_ver3 | 2,942 | 3 |
| Vundo | 475 | 4 |
| Simda | 42 | 5 |
| Tracur | 751 | 6 |
| Kelihos_ver1 | 398 | 7 |
| Obfuscator.ACY | 1,228 | 8 |
| Gatak | 1,013 | 9 |
Hyper-parameters to tune the dual-branch CNN model.
| Hyper-parameter | Search space | Best value |
|---|---|---|
| Epoch | 10–50 | 20 |
| Batch training samples | 64–512 | 128 |
| Optimization algorithm | SGD, Adadelta, RMSprop, Adam, Adagrad | Adam |
| Initial learning rate | 0.0001–0.01 | 0.0013 |
| Decay of learning rate | 0.001–0.1 | 0.02 |
| Dropout probability | 0.1–0.9 | 0.5 |
| Loss function | – | categorical_crossentropy |
Effects of image size on classification performance.
| Image size | Accuracy | Precision | Recall | F1-measure | Time (ms) |
|---|---|---|---|---|---|
| 16 | 0.9154 | 0.9037 | 0.8934 | 0.8985 | 12 |
| 32 | 0.9641 | 0.9630 | 0.9672 | 0.9651 | 30 |
| 64 | 0.9905 | 0.9871 | 0.9833 | 0.9852 | 42 |
| 128 | 0.9907 | 0.9846 | 0.9812 | 0.9829 | 180 |
Figure 7Accuracy of malware classification with different size of image.
Impact of different features description on classification performance.
| Image type | Feature | Accuracy | Precision | Recall | F-measure | Time (ms) |
|---|---|---|---|---|---|---|
| bytecode | structural | 0.9201 | 0.9212 | 0.9224 | 0.9218 | 30 |
| opcode | semantic | 0.9617 | 0.9534 | 0.9591 | 0.9562 | 38 |
| bytecode+ opcode | structural+ | 0.9905 | 0.9871 | 0.9833 | 0.9852 | 42 |
Figure 8Accuracy of malware classification with different feature descriptions.
Figure 9Confusion matrix of nine families for different feature descriptions.
(A) Bytecode image feature. (B) Opcode image feature. (C) Bytecode+Opcode image feature.
Proposed model compared with other methods.
| Method | Accuracy | Precision | Recall | F-measure |
|---|---|---|---|---|
| Gist+KNN (image) | 0.8897 | 0.9150 | 0.9122 | 0.9081 |
| LBP+KNN (image) | 0.9110 | 0.9198 | 0.9563 | 0.9137 |
| CNN (image) | 0.9760 | 0.9310 | 0.8871 | 0.9085 |
| Hybrid of CNN and GRU (image) | 0.9651 | 0.9517 | 0.9439 | 0.9478 |
| N-gram + KNN (text) | 0.8908 | 0.8891 | 0.9197 | 0.9119 |
| Proposed method | 0.9905 | 0.9871 | 0.9833 | 0.9852 |