| Literature DB >> 35591270 |
Lejun Zhang1,2,3, Jinlong Wang1, Weizheng Wang4, Zilong Jin5, Chunhui Zhao6, Zhennao Cai7, Huiling Chen7.
Abstract
Blockchain presents a chance to address the security and privacy issues of the Internet of Things; however, blockchain itself has certain security issues. How to accurately identify smart contract vulnerabilities is one of the key issues at hand. Most existing methods require large-scale data support to avoid overfitting; machine learning (ML) models trained on small-scale vulnerability data are often difficult to produce satisfactory results in smart contract vulnerability prediction. However, in the real world, collecting contractual vulnerability data requires huge human and time costs. To alleviate these problems, this paper proposed an ensemble learning (EL)-based contract vulnerability prediction method, which is based on seven different neural networks using contract vulnerability data for contract-level vulnerability detection. Seven neural network (NN) models were first pretrained using an information graph (IG) consisting of source datasets, which then were integrated into an ensemble model called Smart Contract Vulnerability Detection method based on Information Graph and Ensemble Learning (SCVDIE). The effectiveness of the SCVDIE model was verified using a target dataset composed of IG, and then its performances were compared with static tools and seven independent data-driven methods. The verification and comparison results show that the proposed SCVDIE method has higher accuracy and robustness than other data-driven methods in the target task of predicting smart contract vulnerabilities.Entities:
Keywords: Ensemble Learning; blockchain security; information graph; operation flow; smart contract; vulnerability detection
Mesh:
Year: 2022 PMID: 35591270 PMCID: PMC9105394 DOI: 10.3390/s22093581
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1SCVDIE overview.
Symbols and corresponding descriptions.
| Symbol | Description | Symbol | Description |
|---|---|---|---|
|
| IG of a contractual composition, not specific. |
| The hidden dimension. |
|
| The set of nodes of |
| The linear layer weight. |
|
| The set of edges of |
| The linear layer bias. |
|
| Dataset. |
| Probability of results. |
|
| The |
| Predictive labeling of the sample. |
|
| Data labels corresponding to |
| The probability of the |
|
| The original number of samples. |
| the number of sub-Paths |
|
| The number of nodes of |
| Vector of the |
|
| The node embedding method. |
| Vector of the |
|
| The |
| Integral vector of sub-paths. |
|
| Embedded vectors corresponding to |
| Vector of the |
|
| Value Domain. |
| Vector of the overall path on Transformer |
|
| Embedding dimensions. |
| The combined score of sub-paths. |
|
| The set of |
| The score of the |
|
| Data labels corresponding to |
| Sub-paths result in weights on CNN. |
|
| Overall node information corresponding to |
| Overall path composite score. |
|
| the node semantic embedding. |
| Final prediction results for the sample. |
Figure 2Co-occurrence frequency.
Figure 3SCVDIE-Ensemble.
Distribution of raw data of smart contracts.
| Version | Sol Files | Number | Version | Sol Files | Number |
|---|---|---|---|---|---|
| 0.4.0 | 92 | 238 | 0.4.11 | 737 | 4375 |
| 0.4.1 | 7 | 14 | 0.4.12 | 44 | 291 |
| 0.4.2 | 110 | 438 | 0.4.13 | 404 | 2348 |
| 0.4.3 | 3 | 30 | 0.4.14 | 31 | 130 |
| 0.4.4 | 634 | 2028 | 0.4.15 | 270 | 1652 |
| 0.4.5 | 4 | 4 | 0.4.16 | 650 | 1913 |
| 0.4.6 | 97 | 182 | 0.4.17 | 169 | 796 |
| 0.4.7 | 18 | 34 | 0.4.18 | 1393 | 7910 |
| 0.4.8 | 255 | 1425 | 0.4.19 | 423 | 2845 |
| 0.4.9 | 83 | 196 | 0.4.20 | 1097 | 5143 |
| 0.4.10 | 90 | 276 | 0.4.21 | 188 | 919 |
Figure 4Distribution of each type of vulnerability as a percentage of the overall dataset.
Figure 5The final number of vulnerability contracts was collected.
Figure 6Opcode extraction process.
Operation code category division.
| Type | Instructions |
|---|---|
| Calldata&Codedata | callcode, calldatacopy, callvalue, calldataload, calldatasize, codecopy, codesize, extcodecopy |
| Jump&Stop | stop, jump, junpi, pc, returndatacopy, return, returndatasize, revert, invalid, selfdestruct |
| Memory | mload, mstore, msize, sstore, call, create, delegatecall, staticcall |
| Compute | Add (x, y), addmod (x, y, m), div (x, y), exp (x, y), mod (x, y), mul (x, y), mulmod (x, y), sdiv (x, y), signextend (i, x), smod (x, y, m) |
| Compare | Eq (x, y), iszero (x), gt (x, y), lt (x, y), sgt (x, y), slt (x, y) |
| Block | gasprice, gaslimit, difficulty, number, timestamp, coinbase, blockhash (b), keccak256 |
| Transaction | caller, gas, origin, address, balance |
| Bitoperation | And (x, y), byte (n, x), not (x), or (x, y), shl (x, y), shr (x, y), sar (x, y), xor (x, y) |
| Stack | dup, log, pop, push, swap |
Figure 7Illustration of splitting the source dataset into -fold for pretraining NN models.
List of parameter values used in the NN pretraining and SCVDIE-Ensemble.
| Parameter | Pretraining | Fine Tune |
|---|---|---|
| Initial learning rate, |
|
|
| Mini-batch size | 128 | 64 |
| Momentum,
| 0.9 | 0.9 |
| 0.0001 | 0.0003 | |
| epochs | 20 | 60 |
| Embed Dimension | 180 | 180 |
| Number of convolution kernels | 128 | 128 |
| Dropout Rate | 0.3 | 0.3 |
| Hidden Dimension
| 2048 | 2048 |
Figure 8The fivefold CV process. In addition to the test set, the training set in each CV experiment consists of 80% of the dataset, while the remaining 20% is used to construct the validation set.
Error assessment (%).
| Model | CV 1 | CV 2 | CV 3 | CV 4 | CV 5 | Overall |
|---|---|---|---|---|---|---|
| SCVDIE-Ensemble | 1.332 | 1.160 | 0.948 | 1.137 | 2.205 | 1.419 |
| SCVDIE-CNN | 3.132 | 2.083 | 8.228 | 2.045 | 3.560 | 5.039 |
| SCVDIE-RNN | 9.501 | 5.495 | 1.399 | 6.854 | 2.086 | 4.768 |
| SCVDIE-RCNN | 2.265 | 5.170 | 1.247 | 5.815 | 1.104 | 3.505 |
| SCVDIE-DNN | 3.977 | 1.693 | 4.668 | 4.338 | 6.793 | 3.860 |
| SCVDIE-GRU | 4.703 | 1.389 | 12.271 | 1.105 | 1.445 | 4.117 |
| SCVDIE-Bi-GRU | 2.117 | 2.523 | 8.187 | 3.355 | 6.500 | 3.450 |
| SCVDIE- Transformer | 1.043 | 1.191 | 3.911 | 2.764 | 4.787 | 3.395 |
Performance evaluation.
| Approach | Performance Indicators | ||||
|---|---|---|---|---|---|
| Accuracy (avg.) | Precision (avg.) | Recall (avg.) | F1 (avg.) | Prediction Accuracy (avg.) | |
| SCVDIE-Ensemble | 95.46% | 96.81% | 97.26% | 97.57% | 97.42% |
| SCVDIE-CNN | 92.00% | 91.57% | 91.50% | 88.18% | 90.75% |
| SCVDIE-RNN | 92.29% | 85.44% | 88.68% | 91.26% | 88.87% |
| SCVDIE-RCNN | 89.34% | 90.66% | 90.56% | 90.54% | 90.87% |
| SCVDIE-DNN | 91.46% | 90.53% | 90.19% | 88.70% | 87.12% |
| SCVDIE-GRU | 89.11% | 88.81% | 90.52% | 90.78% | 90.05% |
| SCVDIE-Bi-GRU | 91.19% | 91.88% | 91.87% | 90.89% | 91.19% |
| SCVDIE-Transformer | 96.20% | 90.89% | 90.88% | 89.88% | 91.00% |
| Oyente | 72% | 38.5% | 57.6% | 46.1% | N/A |
| Securify | 57.9% | 39.6% | 48.0% | 45.6% | N/A |
| Mythril | 56.8% | 36.5% | 49.4% | 43.9% | N/A |
Figure 9Model performance comparison.
Comparison with existing methods.
| Model | F1 Score | |
|---|---|---|
| SCVDIE | Common Method | |
| SCVDIE-Ensemble | 97.42% | 90.51% |
| SCVDIE-CNN | 88.18% | 89.57% |
| SCVDIE-RNN | 91.26% | 81.54% |
| SCVDIE-RCNN | 90.54% | 81.06% |
| SCVDIE-DNN | 88.70% | 91.53% |
| SCVDIE-GRU | 90.78% | 89.81% |
| SCVDIE-Bi-GRU | 90.89% | 82.70% |
| SCVDIE-Transformer | 89.88% | 91.90% |
Figure 10Comparison of model errors under different size training datasets.