| Literature DB >> 32357404 |
Jin Wang1,2, Yangning Tang1, Shiming He1,2, Changqing Zhao1, Pradip Kumar Sharma3, Osama Alfarraj4, Amr Tolba4,5.
Abstract
Log anomaly detection is an efficient method to manage modern large-scale Internet of Things (IoT) systems. More and more works start to apply natural language processing (NLP) methods, and in particular word2vec, in the log feature extraction. Word2vec can extract the relevance between words and vectorize the words. However, the computing cost of training word2vec is high. Anomalies in logs are dependent on not only an individual log message but also on the log message sequence. Therefore, the vector of words from word2vec can not be used directly, which needs to be transformed into the vector of log events and further transformed into the vector of log sequences. To reduce computational cost and avoid multiple transformations, in this paper, we propose an offline feature extraction model, named LogEvent2vec, which takes the log event as input of word2vec to extract the relevance between log events and vectorize log events directly. LogEvent2vec can work with any coordinate transformation methods and anomaly detection models. After getting the log event vector, we transform log event vector to log sequence vector by bary or tf-idf and three kinds of supervised models (Random Forests, Naive Bayes, and Neural Networks) are trained to detect the anomalies. We have conducted extensive experiments on a real public log dataset from BlueGene/L (BGL). The experimental results demonstrate that LogEvent2vec can significantly reduce computational time by 30 times and improve accuracy, comparing with word2vec. LogEvent2vec with bary and Random Forest can achieve the best F1-score and LogEvent2vec with tf-idf and Naive Bayes needs the least computational time.Entities:
Keywords: IoT; device management; log anomaly detection; log event; log template; word2vec
Year: 2020 PMID: 32357404 PMCID: PMC7249657 DOI: 10.3390/s20092451
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Feature Extraction based on NLP.
| Method | Word | Log Event |
|---|---|---|
| Bag-of-words | Forming the log event vector by the occurrence number of words | Forming the log sequence vector by the occurrence number of the log event [ |
| Idf/Tf-idf | Forming the log event vector by the term frequency and weights of words [ | Forming the log sequence vector by the term frequency and weights of the log event [ |
| Word2vec | Forming the word vector by Word2vec [ | – |
Figure 1The framework of log anomaly detection.
List of notations.
| Notation | Definition |
|---|---|
|
| The log data |
|
| The number of lines in log data |
|
| The set of log events |
|
| The number of log events |
|
| The set of log sequences |
|
| The window size which decides the length of a log sequence |
|
| The vector space |
|
| The |
|
| The mapping function of log parsing |
|
| The log event of log message |
|
| The |
|
| The vector of log event |
|
| The prediction of log sequence |
|
| The label of log sequence |
Figure 2Overview of Log Event to vector based log anomaly detection.
Raw log and log event.
| Item | Content |
|---|---|
| Time stamp | 1117848119 |
| Data |
|
| Node |
|
| Time | 2005-06-03-18.21.59.871925 |
| Node repeat |
|
| Message type |
|
| Component |
|
| Level |
|
| Content |
|
| Log event |
|
Figure 3Log event as input of the word2vec model. The target is log event , and the rest log events in the log sequence are taken as input.
Summary of BGL dataset.
| System | #Time Span | #Data Size | #Log Messages | #AnomaliesLog |
|---|---|---|---|---|
| BGL | 7 months | 708M | 4,747,963 | 348,460 |
The component of comparison schemes.
| Steps | Models |
|---|---|
| Word2vec input unit | Word/Log event |
| Coordinate transformation | Bary/Tf-idf |
| Anomaly detection model | Random Forests/Naive Bayes/Neural Networks |
Figure 4Performance of different schemes with bary and tf-idf coordinate transformations. (a) F1-score; (b) AUC.
Figure 5Computational time of the feature extraction and anomaly detection with bary and tf-idf coordinate transformations. (a) computational time of feature extraction; (b) computational time of anomaly detection model training.
Figure 6Computational time of training and testing with bary and tf-idf coordinate transformations. (a) computational time of issuing all predictions in test set; (b) total computational time.
F1-score with different dimensions.
| dim(T) | W-b-RF | W-b-BN | W-b-NN | LE-b-RF | LE-b-BN | LE-b-NN |
|---|---|---|---|---|---|---|
| 5 | 0.821788573 | 0.638914042 | 0.599454543 | 0.848245935 | 0.654693534 |
|
| 10 | 0.826054636 | 0.715535678 | 0.707411424 | 0.827861155 | 0.745818201 |
|
| 20 | 0.834107143 | 0.72259094 | 0.803393267 | 0.879608688 | 0.782222222 |
|
| 50 | 0.785066632 | 0.732549521 | 0.80844075 | 0.877704266 | 0.776404488 |
|
| 100 | 0.814447561 | 0.751401056 | 0.747072721 |
| 0.777671294 | 0.829474969 |
| 200 | 0.811296155 | 0.70370138 | 0.808328189 |
| 0.800541113 | 0.826401595 |
| 500 | 0.761251469 | 0.72595185 | 0.766749974 |
| 0.8009675 | 0.846823786 |
AUC with different dimensions.
| dim(T) | W-b-RF | W-b-BN | W-b-NN | LE-b-RF | LE-b-BN | LE-b-NN |
|---|---|---|---|---|---|---|
| 5 | 0.952121313 | 0.893721708 | 0.851681379 | 0.966854572 | 0.888070647 |
|
| 10 | 0.946359642 | 0.910339212 | 0.883053652 |
| 0.913831019 | 0.940199701 |
| 20 | 0.959049752 | 0.88620832 | 0.940347094 |
| 0.939435145 | 0.950155508 |
| 50 | 0.936857925 | 0.909702317 | 0.932487478 |
| 0.921604084 | 0.941439031 |
| 100 | 0.952108191 | 0.918905966 | 0.912478988 |
| 0.928927259 | 0.911241156 |
| 200 |
| 0.890634611 | 0.903616499 | 0.957165279 | 0.929082466 | 0.921258002 |
| 500 | 0.939609334 | 0.869648164 | 0.882033281 |
| 0.917090636 | 0.92435774 |
Computational time of the feature extraction with different dimensions.
| dim(T) | W-b-RF | W-b-BN | W-b-NN | LE-b-RF | LE-b-BN | LE-b-NN |
|---|---|---|---|---|---|---|
| 5 | 928.467184365 | 928.154774189 | 928.213744521 | 33.440349197 |
| 33.096653652 |
| 10 | 950.101264000 | 949.874471283 | 948.049470901 | 33.298034906 |
| 33.481590986 |
| 20 | 959.311112213 | 955.330675745 | 955.955496573 | 33.588190031 |
| 33.414360476 |
| 50 | 963.718184471 | 965.015199471 | 964.497252941 | 34.691981173 | 33.850553179 |
|
| 100 | 1006.996418762 | 1008.646436644 | 1009.821412706 | 33.849840307 |
| 33.166025877 |
| 200 | 1060.622915459 | 1063.216091013 | 1070.170886092 | 35.303260088 |
| 36.286307726 |
| 500 | 1244.848617029 | 1245.274248505 | 1252.219076300 | 40.350662804 |
| 39.244319487 |
Computational time of the anomaly detection with different dimensions.
| dim(T) | W-b-RF | W-b-BN | W-b-NN | LE-b-RF | LE-b-BN | LE-b-NN |
|---|---|---|---|---|---|---|
| 5 | 0.129946113 | 0.002494335 | 0.091389656 | 0.118011236 |
| 0.092924643 |
| 10 | 0.174678040 | 0.002696085 | 0.088878870 | 0.140994787 |
| 0.098157644 |
| 20 | 0.200891352 | 0.002576542 | 0.111003637 | 0.162158251 |
| 0.103493547 |
| 50 | 0.302109051 | 0.003390408 | 0.131638622 | 0.226914167 |
| 0.140838718 |
| 100 | 0.419996023 | 0.003537750 | 0.628135109 | 0.280465174 |
| 0.576955175 |
| 200 | 0.592480183 | 0.003967857 | 1.132779264 | 0.374702978 |
| 0.917661619 |
| 500 | 0.874522972 | 0.005608940 | 1.402945185 | 0.557642794 |
| 1.660369825 |
Computational time of issuing all prediction in test set with different dimensions.
| dim(T) | W-b-RF | W-b-BN | W-b-NN | LE-b-RF | LE-b-BN | LE-b-NN |
|---|---|---|---|---|---|---|
| 5 | 0.030201614 |
| 0.007597148 | 0.030419779 | 0.008524370 | 0.008317709 |
| 10 | 0.030166960 | 0.007774067 | 0.008397007 | 0.028655720 |
| 0.008075190 |
| 20 | 0.030082989 | 0.007681179 | 0.007875299 | 0.030849648 |
| 0.008102942 |
| 50 | 0.031879997 | 0.007841539 | 0.008520412 | 0.031287003 |
| 0.007965803 |
| 100 | 0.031735134 | 0.008501720 | 0.008821011 | 0.029254770 |
| 0.007657909 |
| 200 | 0.031595659 | 0.009186029 | 0.091766162 | 0.029769993 | 0.007659483 |
|
| 500 | 0.030633450 | 0.010237265 | 0.014094353 | 0.030339813 | 0.009008598 |
|
Total computational time with different dimensions.
| dim(T) | W-b-RF | W-b-BN | W-b-NN | LE-b-RF | LE-b-BN | LE-b-NN |
|---|---|---|---|---|---|---|
| 5 | 928.627332091 | 928.164753020 | 928.312731326 | 33.588780212 |
| 33.197896004 |
| 10 | 950.306108999 | 949.884941435 | 948.146746778 | 33.467685413 |
| 33.587823820 |
| 20 | 959.542086554 | 955.340933466 | 956.074375510 | 33.781197929 |
| 33.525956964 |
| 50 | 964.052173519 | 965.026431417 | 964.637411976 | 34.950182343 | 33.861068296 |
|
| 100 | 1007.448149920 | 1008.658476114 | 1010.458368826 | 34.159560251 |
| 33.750638962 |
| 200 | 1061.246991301 | 1063.229244900 | 1071.395431519 | 35.707733059 |
| 37.210966873 |
| 500 | 1245.753773451 | 1245.290094709 | 1253.636115837 | 40.938645411 |
| 40.913156271 |