| Literature DB >> 32942607 |
Young-Seob Jeong1, Jiyoung Woo1, SangMin Lee2, Ah Reum Kang1.
Abstract
Malware detection of non-executables has recently been drawing much attention because ordinary users are vulnerable to such malware. Hangul Word Processor (HWP) is software for editing non-executable text files and is widely used in South Korea. New malware for HWP files continues to appear because of the circumstances between South Korea and North Korea. There have been various studies to solve this problem, but most of them are limited because they require a large amount of effort to define features based on expert knowledge. In this study, we designed a convolutional neural network to detect malware within HWP files. Our proposed model takes a raw byte stream as input and predicts whether it contains malicious actions or not. To incorporate highly variable lengths of HWP byte streams, we propose a new padding method and a spatial pyramid average pooling layer. We experimentally demonstrate that our model is not only effective, but also efficient.Entities:
Keywords: HWP; Hangul Word Processor; convolutional neural network; malware detection; spatial pyramid average pooling; spatial pyramid pooling; stretch padding
Year: 2020 PMID: 32942607 PMCID: PMC7570816 DOI: 10.3390/s20185265
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Structure of HWPfiles.
| Type | Name (*Storage or Stream) | Length |
|---|---|---|
| File recognition information | FileHeader | fixed |
| Document information | DocInfo | fixed |
| *BodyText | ||
| Main document | Section 0 | unfixed |
| Section 1 | ||
| Document summary | HWPSummaryInformation | fixed |
| *BinData | ||
| Binary data | BinaryData0 | unfixed |
| BinaryData1 | ||
| Preview text | PrvText | fixed |
| Preview image | PrvImage | unfixed |
| *DocOptions | ||
| Document options | LinkDoc | unfixed |
| DrmLicense | ||
| *Scripts | ||
| Script | DefaultJScript | unfixed |
| JScriptVersion | ||
| *XML Template | ||
| XML template | Schema | unfixed |
| Instance | ||
| *DocHistory | ||
| Document history | VersionLog0 | unfixed |
| VersionLog1 |
Figure 1Malicious/benign byte streams within HWP files.
Figure 2Graphical structure of the proposed model.
Figure 3Comparison between the SPP and the SPAP. (left) The SPP working on a two-dimensional image; (right) the SPAP working on a stream of embedding vectors.
Figure 4Comparison between two padding methods, where the shaded squares represent padding tokens. (left) Conventional padding. (right) Stretch padding.
Statistics of sampled data.
| Total | Malicious | Benign | |
|---|---|---|---|
| Train + Test | 6520 | 3668 | 2852 |
| Train | 5868 | 3265 | 2603 |
| Test | 652 | 403 | 249 |
Experimental results about efficiency, where #Params indicates the number of trainable parameters, FLOPS stands for floating-point operations per second, and Runtime is the time spent for running the model on the test set in seconds.
| #Params | FLOPS | Runtime | |
|---|---|---|---|
| Cons-Conv | 2,056,354 | 4,112,896 | 15.8662 |
| Mal-Conv | 1,043,074 | 2,085,384 | 0.8572 |
| SPAP-Conv | 70,274 | 143,453 | 0.9831 |
Experimental results about effectiveness, where the two values b/m of each cell correspond to benign and malicious cases, respectively.
| Model | F1 (%) | Precision (%) | Recall (%) | |
|---|---|---|---|---|
| stretch | Cons-Conv | 81.32/89.99 | 89.81/85.65 | 74.30/94.79 |
| Mal-Conv | 89.05/92.06 | 81.61/98.58 | 97.99/86.35 | |
| SPAP-Conv | 92.86/95.08 | 87.28/99.46 | 99.20/91.07 | |
| tail | Cons-Conv | 80.00/88.54 | 84.07/86.15 | 76.31/91.07 |
| Mal-Conv | 86.83/90.72 | 80.69/95.86 | 93.98/86.10 | |
| SPAP-Conv | 86.96/92.33 | 89.74/90.67 | 84.34/94.04 |
Effectiveness of the SPAP-Conv with max pooling, where the two values b/m of each cell correspond to benign and malicious cases, respectively.
| F1 (%) | Precision (%) | Recall (%) | |
|---|---|---|---|
| SPAP-Conv with max-pooling | 84.67/92.27 | 98.40/86.21 | 74.30/99.26 |