| Literature DB >> 31449533 |
Di Xue1, Jingmei Li1, Weifei Wu1, Qiao Tian1, JiaXiang Wang1.
Abstract
With the exponential increase in malware, homology analysis has become a hot research topic in the malware detection field. This paper proposes MHAS, a malware homology analysis system based on ensemble learning and multifeatures. MHAS generates grayscale images from malware binary files and then uses the opcode tool IDA Pro to extract opcode sequences and system call graphs. Thus, RGB images and M-images are generated on the image matrix. Then, MHAS uses convolutional neural networks (CNNs) as base learners to perform bagging ensemble learning to learn features from the grayscale images, RGB images and M-images. Next, MHAS integrates the nine base learners using voting, learning and selective ensemble (in that order) and maps the integration results to the result matrix. Finally, the result matrix is again integrated using the learning method to obtain the final malware classification result. To verify the accuracy of MHAS, we performed a malware family classification experiment, that included samples of 10 malware families. The results showed that MHAS can reach an accuracy rate of 99.17%, meaning that it can effectively analyze and identify malware families.Entities:
Year: 2019 PMID: 31449533 PMCID: PMC6709908 DOI: 10.1371/journal.pone.0211373
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of MHAS.
Fig 2Grayscale image conversion process.
Fig 3Partition of opcode sequences.
Fig 4Generation of pixel points using opcode sequences.
Fig 5Processing flow of the system call subgraph.
Parameter list of the three CNNs.
| Network layer type | Size | Output size | Output size | Output size |
|---|---|---|---|---|
| Input layer | - | [1,(r, c)] | [3,(256, 256)] | [1,(L, L)] |
| Convolutional Layer | 8 3×3 | [8,(r, c)] | [8,(256, 256)] | [8,(L, L)] |
| Max pooling layer | 2×2,stride 2 | [8,(r/2,c/2)] | [8,(128, 128)] | [8,(L/2,L/2)] |
| Dropout layer | - | [8, (r/2,c/2)] | [8,(128, 128)] | [8, (L/2,L/2)] |
| Convolutional Layer | 16 3×3 Convolution kernel | [16, (r/2,c/2)] | [16,(128,128)] | [16, (L/2,L/2)] |
| Max pooling layer | 2×2,stride 2 | [16, (r/4,c/4)] | [16,(64, 64)] | [16, (L/4,L/4)] |
| Dropout layer | - | [16, (r/4,c/4)] | [16,(64, 64)] | [16, (L/4,L/4)] |
| Convolutional Layer | 32 3×3 Convolution kernel | [32, (r/4,c/4)] | [32,(64, 64)] | [32, (L/4,L/4)] |
| Convolutional Layer | 32 3×3 Convolution kernel | [32, (r/4,c/4)] | [32,(64, 64)] | [32, (L/4,L/4)] |
| Max pooling layer | 2×2,stride 2 | [32, (r/8,c/8)] | [32,(32, 32)] | [32, (L/8,L/8)] |
| Dropout layer | - | [32, (r/8,c/8)] | [32,(32, 32)] | [32, (L/8,L/8)] |
| Convolutional Layer | 64 3×3 Convolution kernel | [64, (r/8,c/8)] | [64,(32, 32)] | [64, (L/8,L/8)] |
| Convolutional Layer | 64 3×3 Convolution kernel | [64, (r/8,c/8)] | [64,(32, 32)] | [64, (L/8,L/8)] |
| Max pooling layer | 2×2,stride 2 | [64, (r/16,c/16)] | [64,(16,16)] | [64, (L/16,L/16)] |
| Dropout layer | - | [64, (r/16,c/16)] | [64,(16,16)] | [64, (L/16,L/16)] |
| Convolutional Layer | 64 3×3 Convolution kernel | [64, (r/16,c/16)] | [64,(16,16)] | [64, (L/16,L/16)] |
| Convolutional Layer | 64 3×3 Convolution kernel | [64, (r/16,c/16)] | [64,(16,16)] | [64, (L/16,L/16)] |
| SPP layer | 3-layer pyramid | 1344-D vector | - | - |
| Max pooling layer | 2×2,stride 2 | - | [64,(8,8)] | [64, (L/32,L/32)] |
| Dropout layer | - | 1344-D vector | [64,(8,8)] | [64, (L/32,L/32)] |
| Fully connected layer | 512 Neurons | [512, (1,1)] | [512, (1,1)] | [512, (1,1)] |
| Fully connected layer | 512 Neurons | [512, (1,1)] | [512, (1,1)] | [512, (1,1)] |
| Fully connected layer | 512 Neurons | [512, (1,1)] | [512, (1,1)] | [512, (1,1)] |
| Soft-max | R classification | [R, (1,1)] | [R, (1,1)] | [R, (1,1)] |
* These layers are common among GRAY-CNN-X, RGB-CNN-X and SYS-CNN-X (X = 1,2,3).
** This layer only belongs to GRAY-CNN-X.
*** This layer is common both RGB-CNN-X and SYS-CNN-X.
g This output size belongs to GRAY-CNN-X.
r This output size belongs to RGB-CNN-X.
s This output size belongs to SYS-CNN-X.
Fig 6Learning process of the MHAS.
Malware sample family.
| Number | Malware family | Quantity |
|---|---|---|
| No.1 | Backdoor.Win32.Bifrose | 30 |
| No.2 | Backdoor.Win32.SdBot | 30 |
| No.3 | Backdoor.Win32.IRCBot | 30 |
| No.4 | Trojan-Downloader.Win32.Lemmy | 30 |
| No.5 | Trojan-Downloader.Win32.IstBar | 30 |
| No.6 | Trojan-Dropper.Win32.Tab | 30 |
| No.7 | Trojan-Dropper.Win32.Eva | 30 |
| No.8 | Virus.Win32.HLLP.Semisoft | 30 |
| No.9 | Virus.Win32.Gpcode | 30 |
| No.10 | Worm.Win32.Deborm | 30 |
Fig 7The influence of the number of the key subgraph vertices on the result.
Fig 8The influence of ensemble strategies on the result (multifeatures).
Fig 9The influence of feature type and quantity on the result.
Fig 10Confusion Matrix of malware family classification by the MHAS.
The confusion matrix values are composed of the true positive rate and the false negative rate of the malware family classification by MHAS. The value of the subdiagonal represents the true positive rate, and the other values indicate the false negative rate. The true positive rate and false negative rate are the average values after 10-fold cross-validation.
The results of the different malware homology analysis methods.
| Analytical method | Training time (s) | Accuracy rate | Standard Deviation |
|---|---|---|---|
| GIST | 379.265 | 85.77% | 9.61 |
| CNN & system call sequences | 55.349 | 89.4% | 10.21 |
| CNN & API sequences | 30.632 | 96.7% | 4.37 |
| DNN & API sequences | 50.343 | 97.06% | 5.25 |
| SNN & grayscale images and opcode sequences | 150.447 | 98.9% | 4.14 |
Fig 11The results of six malware homology analysis methods.
| 1: | |
| 2: | |
| 3: | |
| 4: | |
| 5: | |
| 6: | flag←false; |
| 7: | Update code area |
| 8: | |
| 9: | flag←false; |
| 10: | Update code area |
| 11: | |
| 12: | |
| 13: | |
| 14: | |
| 15: | phrase array |
| 16: | |
| 17: | |
| 18: | opcode array |
| 19: | |
| 20: | |
| 21: | |
| 22: | control flow opcode sequences array |
| 23: | |
| 24: | |
In the static analysis technique, the disassembled instruction sequence does not have a statement that the control flow cannot reach; therefore, Algorithm 1 can traverse all the codes. The x86 assembly instructions "jmp", "jz", "ja", "ret" and "retn", and others are either unconditional or conditional execution transfer statements whose next statement is a split point. The "call" statement is used as to execute a call; control flow returns after the call completes. Therefore, this statement can be treated as an ordinary sequential flow. For example, Fig 3 shows a subsequence division diagram of the disassembled malware in which the first division point is the first instruction because it contains the address referenced in the virtual address “140001584+33”. The second division point is the instruction "xor r13d, r13d" because it is the instruction following the conditional jump instruction "jz short loc_140001589". By analogy, the subsequence partitioning method in Fig 3 can extract four opcode sequences:"MOVXORMOVMOVCALLMOVTESTJZ","XORTESTJLE","MOVMOVCALLADDMOVTESTJ" and "MOVMOVMOVMOVMOVCALLMOVMOVMOVCALLINCCMPJL".
| 1: | Conversion file type: binary file |
| 2: | Extracting feature views using three feature extraction methods; |
| 3: | |
| 4: | |
| 5: | Using the classification of the feature views on the base learner |
| 6: | |
| 7: | |
| 8: | |
| 9: | Using the integration of |
| 10: | |
| 11: | |
| 12: | |
| 13: | Calculate the sum of the elements in |
| 14: | |
| 15: | Query the element |
| 16: | |
| 17: | |
| 18: | Query the element |
| 19: | Randomly select one of the elements, |
| 20: | |
| 21: | |
| 22: | |
| 23: | |