| Literature DB >> 35626536 |
Ilias Kalouptsoglou1,2, Miltiadis Siavvas1, Dionysios Kehagias1, Alexandros Chatzigeorgiou2, Apostolos Ampatzoglou2.
Abstract
Software security is a very important aspect for software development organizations who wish to provide high-quality and dependable software to their consumers. A crucial part of software security is the early detection of software vulnerabilities. Vulnerability prediction is a mechanism that facilitates the identification (and, in turn, the mitigation) of vulnerabilities early enough during the software development cycle. The scientific community has recently focused a lot of attention on developing Deep Learning models using text mining techniques for predicting the existence of vulnerabilities in software components. However, there are also studies that examine whether the utilization of statically extracted software metrics can lead to adequate Vulnerability Prediction Models. In this paper, both software metrics- and text mining-based Vulnerability Prediction Models are constructed and compared. A combination of software metrics and text tokens using deep-learning models is examined as well in order to investigate if a combined model can lead to more accurate vulnerability prediction. For the purposes of the present study, a vulnerability dataset containing vulnerabilities from real-world software products is utilized and extended. The results of our analysis indicate that text mining-based models outperform software metrics-based models with respect to their F2-score, whereas enriching the text mining-based models with software metrics was not found to provide any added value to their predictive performance.Entities:
Keywords: dataset extension; deep learning; ensemble learning; machine learning; software metrics; text mining; vulnerability prediction
Year: 2022 PMID: 35626536 PMCID: PMC9140602 DOI: 10.3390/e24050651
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.738
Figure 1The basic concept of vulnerability prediction.
Figure 2The architecture of the Stacking classifier.
The statically extracted software metrics.
| Metric | Description |
|---|---|
| CC | Clone Coverage |
| CCL | Clone Classes |
| CCO | Clone Complexity |
| CI | Clone Instances |
| CLC | Clone Line Coverage |
| LDC | Lines of Duplicated Code |
| McCC, CYCL | Cyclomatic Complexity |
| NL | Nesting Level |
| NLE | Nesting Level without else-if |
| CD, TCD | (Total) Comment Density |
| CLOC, TCLOC | (Total) Comments Lines of Code |
| DLOC | Documentation Lines of Code |
| LLOC, TLLOC | (Total) Logical Lines of Code |
| LOC, TLOC | (Total) Lines of Code |
| NOS, TNOS | (Total) Number of Statements |
| NUMPAR, PARAMS | Number of Parameters |
| HOR_D | NR. Of Distinct Halstead Operators |
| HOR_T | Nr of Total Halstead Operators |
| HON_D | NR. Of Distinct Halstead Operands |
| HON_T | Nr of Total Halstead Operands |
| HLEN | Halstead Length |
| HVOC | Halstead Vocabulary Size |
| HDIFF | Halstead Difficulty |
| HVOL | Halstead Volume |
| HEFF | Halstead Effort |
| BUGS | Halstead Bugs |
| HTIME | Halstead Time |
| CYCL_DENS | Cyclomatic Density |
Figure 3The diff file of the Cross-Site Scripting fixing commit of actionhero’s config/errors.js and servers/web.js.
Figure 4The process of constructing the overall dataset of the proposed approaches.
A Bag of Words (BoW) subset of the dataset.
| Function Name | Null | This | Function | Push |
|---|---|---|---|---|
| initFileServer | 17 | 0 | 9 | 4 |
| api.sendFile | 0 | 0 | 1 | 0 |
| <anonymous>.followFileToServe | 2 | 0 | 3 | 0 |
| <anonymous>.sendFile | 6 | 0 | 3 | 3 |
Figure 5The overview of the sequences of text tokens approach.
The chosen Hyper-parameters of the Convolutional Neural Network (CNN) model.
| Hyper-Parameter Name | Hyper-Parameter Value |
|---|---|
| Number of Layers | 3 (Embedding-Convolutional-Dense) |
| Number of Convolutional Layers | 1 (1D CNN) |
| Embedding Size | 300 |
| Number of Filters | 128 |
| Kernel Size | 5 |
| Pooling | Global Max Pooling |
| Weight Initialization Technique | Glorot Uniform (Xavier) |
| Learning Rate | 0.01 |
| Gradient Descent Optimizer | Adam |
| Batch Size | 64 |
| Activation Function | Relu |
| Output Activation Function | Sigmoid |
| Loss Function | Binary cross entropy |
| Maximum Epochs | 100 |
| Early Stopping Patience | 10 |
| Monitoring Metric | Recall |
Evaluation results of software metrics based models.
| Evaluation Metric | KNN | RF | Decision Trees | SVM | Naive Bayes | ANN |
|---|---|---|---|---|---|---|
| Accuracy (%) | 93.10 | 95.16 | 93.19 | 94.34 | 84.40 | 91.72 |
| Precision (%) | 72.85 | 90.42 | 73.10 | 94.62 | 23.75 | 73.65 |
| Recall (%) | 70.60 | 68.05 | 71.07 | 57.40 | 12.06 | 54.06 |
| F1-score (%) | 71.66 | 77.62 | 72.04 | 71.43 | 15.92 | 61.24 |
| F2-score (%) | 71.01 | 71.58 | 71.45 | 62.29 | 13.35 | 56.62 |
Evaluation results of software metrics-based models according to Ferenc et al.
| Evaluation Metric | KNN | RF | Decision Trees | SVM | Naive Bayes | ANN |
|---|---|---|---|---|---|---|
| F1-score (%) | 76 | 71 | 72 | 67 | 15 | 71 |
Evaluation results of BoW models.
| Evaluation Metric | RF | MLP |
|---|---|---|
| Accuracy (%) | 96.64 | 94.13 |
| Precision (%) | 93.16 | 77.82 |
| Recall (%) | 78.57 | 82.65 |
| F1-score (%) | 85.20 | 79.03 |
| F2-score (%) | 81.08 | 80.76 |
Evaluation results of models that are based on sequences of tokens.
| Evaluation Metric | CNN with Word2Vec Embeddings | CNN with FastText Embeddings |
|---|---|---|
| Accuracy (%) | 96.48 | 92.94 |
| Precision (%) | 86.12 | 66.64 |
| Recall (%) | 85.60 | 88.08 |
| F1-score (%) | 85.73 | 75.66 |
| F2-score (%) | 85.62 | 82.58 |
A table with the characteristics of both text mining–based and software metrics–based models.
| Sequences of Tokens | Bag of Words | Software Metrics | |
|---|---|---|---|
| Machine/Deep Learning (ML/DL) | DL | ML | ML |
| Type of model | Neural network | Random Forest | Random Forest |
| Type of input | Embedded Sequences | Tokens occurrences | Numerical values |
| Particular features | Convolutional | 100 trees | 100 trees |
A table with the evaluation scores of both text mining–based and software metrics–based models.
| Evaluation Metric | Sequences of Tokens | Bag of Words | Software Metrics |
|---|---|---|---|
| Accuracy (%) | 96.48 | 96.64 | 95.16 |
| Precision (%) | 86.12 | 93.16 | 90.42 |
| Recall (%) | 85.60 | 78.57 | 68.05 |
| F1-score (%) | 85.73 | 85.20 | 77.62 |
| F2-score (%) | 85.62 | 81.08 | 71.58 |
Figure 6Bar chart with the evaluation metrics of the sequences of tokens, BoW and software metrics approaches.
Figure 7The overview of the approach combining BoW and software metrics.
Figure 8The overview of the approach combining sequences of tokens and software metrics.
Combination of text mining and software metrics.
| Evaluation Metric | Software Metrics and BoW | Software Metrics and Token Sequences |
|---|---|---|
| Accuracy (%) | 96.32 | 72.88 |
| Precision (%) | 93.55 | 30.57 |
| Recall (%) | 75.35 | 68.68 |
| F1-score (%) | 83.43 | 40.84 |
| F2-score (%) | 78.38 | 52.85 |
Figure 9The overview of the voting approach between text mining and software metrics.
Voting classification between text mining and software metrics based models.
| Evaluation Metric | Voting-Soft. Metrics and BoW | Voting-Soft. Metrics and Tokens |
|---|---|---|
| Accuracy (%) | 96.23 | 95.93 |
| Precision (%) | 94.54 | 88.42 |
| Recall (%) | 73.75 | 77.09 |
| F1-score (%) | 82.81 | 82.32 |
| F2-score (%) | 77.11 | 79.09 |
Figure 10The overview of the stacking approach between text mining and software metrics.
Stacking classifier evaluation.
| Evaluation Metric | Stacking-Software Metrics and Text Mining |
|---|---|
| Accuracy (%) | 96.78 |
| Precision (%) | 90.75 |
| Recall (%) | 82.31 |
| F1-score (%) | 86.29 |
| F2-score (%) | 83.86 |