| Literature DB >> 34987569 |
Raghavendra Rao Althar1,2, Debabrata Samanta3, Manjit Kaur4, Abeer Ali Alnuaim5, Nouf Aljaffan5, Mohammad Aman Ullah6.
Abstract
Security of the software system is a prime focus area for software development teams. This paper explores some data science methods to build a knowledge management system that can assist the software development team to ensure a secure software system is being developed. Various approaches in this context are explored using data of insurance domain-based software development. These approaches will facilitate an easy understanding of the practical challenges associated with actual-world implementation. This paper also discusses the capabilities of language modeling and its role in the knowledge system. The source code is modeled to build a deep software security analysis model. The proposed model can help software engineers build secure software by assessing the software security during software development time. Extensive experiments show that the proposed models can efficiently explore the software language modeling capabilities to classify software systems' security vulnerabilities.Entities:
Mesh:
Year: 2021 PMID: 34987569 PMCID: PMC8723857 DOI: 10.1155/2021/8522839
Source DB: PubMed Journal: Comput Intell Neurosci
Algorithm 1Text classification using CNN.
Algorithm 2Classification using bidirectional LSTM and attention layer.
Figure 1Working of sentence encoding.
Algorithm 3Classification using USE with TensorFlow 1.0.
Algorithm 4Neural network language model.
Figure 2BERT architecture.
Algorithm 5BERT for tokenization and feature creation.
Algorithm 6DistilBERT for tokenization.
Figure 3Software security model architecture.
DCNN architecture training specification.
| Parameter | Value |
|---|---|
| Vocabulary size | Based on the vocabulary size of the tokenizer |
| Embedding dimension | 200 |
| Number of filters | 100 |
| Feedforward network units | 256 |
| Number of classes | 2 |
| Dropout rate | 0.2 |
| Epochs | 1 |
| Training | False |
DCNN model construct.
| Layer | Configuration |
|---|---|
| Embedding | Vocabulary size, embedding dimension |
| Convolutional 1D bigram | Kernel size = 2; padding = valid; activation = ReLU |
| Convolutional 1D trigram | Kernel size = 3; padding = valid; activation = ReLU |
| Convolutional 1D fourgram | Kernel size = 4; padding = valid; activation = ReLU |
| Pooling | GlobalMaxPool1D |
| Dense | Activation = ReLU |
| Dropout | 0.2 |
| Last dense (for 2 classes) | Units = 1; activation = sigmoid |
| Last dense (for more than 2 classes) | Units = number of classes; activation = softmax |
| Loss (for 2 classes) | Binary cross-entropy |
| Loss (for multiclass) | Sparse categorical cross-entropy |
Algorithm 7BERT with DCNN model.
BERT question answering system library specifications.
| Modules | Library |
|---|---|
| Tensorflow | Tensorflow hub |
| official.nlp.bert.tokenization | FullTokenizer |
| official.nlp.bert.input_pipeline | create_squad_dataset |
| official.nlp.data.squad_lib | generate_tf_record_from_json_file |
| official.nlp | Optimization |
| official.nlp.data.squad_lib | read_squad_examples |
| official.nlp.data.squad_lib | FeatureWriter |
| official.nlp.data.squad_lib | convert_examples_to_features |
| official.nlp.data.squad_lib | write_predictions |
BERT question answering system training phase specifications.
| Parameter | Value |
|---|---|
| Training data size | 88 641 |
| Number of training batches | 500 |
| Batch size | 1 |
| Number of epochs | 1 |
| Initialized learning rate | 5 |
| Warmup steps | 10% of the number of training batches |
Algorithm 8SQuAD dataset process.
Comparison of all the experiments.
| Experiment | Process | Output |
|---|---|---|
| CNN and FastText embedding | CNN-based processing | Accuracy: 71.89%; precision: 0.88; recall: 0.72; F1-score: 0.77 |
| Bidirectional LSTM with FastText embedding | Bidirectional GRU or LSTM with global attention | Accuracy: 84.33%; precision: 0.91; recall: 0.84; F1-score: 0.87 |
| USE model | USE pretrained model with TF 1.0 | Accuracy: 92.61%; precision: 0.95; recall: 0.93; F1-score: 0.93 |
| NNLM | NNLM-based sentence encoder, with pretrained model | Accuracy: 90.16%; precision: 0.81; recall: 0.90; F1-score: 0.86 |
| BERT | BERT tokenization and TF Keras modeling | Accuracy: 91.39%; precision: 0.92; recall: 0.91; F1-score: 0.88 |
| DistilBERT | DistilBERT-based preprocessing of data | Accuracy: 94.77%; precision: 0.95; recall: 0.95; F1-score: 0.94 |
| BERT | Data preprocessing and tokenization with BERT | Accuracy: 97.44% |
Figure 4Input sequence illustration in BERT.
Figure 5Overview of the code2vec.
Model configuration for BatchProgramClassifier.
| Parameter | Value |
|---|---|
| Hidden dimension | 100 |
| Encoding dimension | 128 |
| Labels | 104 |
| Epochs | 15 |
| Batch size | 64 |
| Maximum tokens | Based on word2vec embeddings |
| Embedding dimension | Based on word2vec embeddings |