| Literature DB >> 34393380 |
Alper Ozcan1,2, Cagatay Catal3, Emrah Donmez4, Behcet Senturk5.
Abstract
Phishing is an attack targeting to imitate the official websites of corporations such as banks, e-commerce, financial institutions, and governmental institutions. Phishing websites aim to access and retrieve users' important information such as personal identification, social security number, password, e-mail, credit card, and other account information. Several anti-phishing techniques have been developed to cope with the increasing number of phishing attacks so far. Machine learning and particularly, deep learning algorithms are nowadays the most crucial techniques used to detect and prevent phishing attacks because of their strong learning abilities on massive datasets and their state-of-the-art results in many classification problems. Previously, two types of feature extraction techniques [i.e., character embedding-based and manual natural language processing (NLP) feature extraction] were used in isolation. However, researchers did not consolidate these features and therefore, the performance was not remarkable. Unlike previous works, our study presented an approach that utilizes both feature extraction techniques. We discussed how to combine these feature extraction techniques to fully utilize from the available data. This paper proposes hybrid deep learning models based on long short-term memory and deep neural network algorithms for detecting phishing uniform resource locator and evaluates the performance of the models on phishing datasets. The proposed hybrid deep learning models utilize both character embedding and NLP features, thereby simultaneously exploiting deep connections between characters and revealing NLP-based high-level connections. Experimental results showed that the proposed models achieve superior performance than the other phishing detection models in terms of accuracy metric.Entities:
Keywords: Deep learning; Machine learning; Phishing; Phishing detection
Year: 2021 PMID: 34393380 PMCID: PMC8349600 DOI: 10.1007/s00521-021-06401-z
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.102
Summary of the related work
| Subasi and Kremic (2020) | Parra et al. (2020) | Aljofey et al. (2020) | Wei et al. (2019) | Our Approach (2021) | |
|---|---|---|---|---|---|
| Dataset | UCI Machine Learning Repository: Phishing Websites Dataset | Detection of IoT botnet attacks N_IoT | Alexa, openphish, spamhaus.org, techhelplist.com etc. | Alexa, hphosts, Joewein, malwaredomains, and phishtank | Ebbu2017, PhishTank, Marchal2014 |
| Method | AdaBoost + SVM and Multiboosting | RNN-LSTM, DNN | Character level CNN | DNN | DNN+BiLSTM |
| Evaluation metrics | Accuracy, F1-Score and ROC Curves | Precision, Recall, F1-Score, Accuracy | Precision, Recall, F1-Score, Accuracy | Accuracy, Execution Time | Accuracy, F1-Score, AUC |
| Accuracy | 97.61% | 94.30% | 95.02% | 86.63% | 99.21% |
Fig. 1Architecture of the Proposed Models
Natural language processing (NLP) features [57]
| NLP Features | |
|---|---|
| Feature | Explanation |
| Raw Word Count | The number of words obtained after parsing the URL by special characters |
| Brand Check for Domain | Is domain of the analyzed URL in the brand name list |
| Average Word Length | The average length of the words in the raw word list |
| Longest Word Length | The length of the longest word in the raw word list |
| Shortest Word Length | The length of the shortest word in the raw word list |
| Standard Deviation | Standard deviation of word lengths in the raw word list |
| Adjacent Word Count | Number of adjacent words processed in the WDM module |
| Average Adjacent Word Length | The average length of the detected adjacent words |
| Separated Word Count | The number of words obtained as a result of decomposing adjacent words |
| Keyword Count | The number of keywords in the URL |
| Brand Name Count | The number of the brand name in the URL |
| Similar Keyword Count | The number of words in the URL that is similar to a keyword |
| Similar Brand Name Count | The number of words in the URL that is similar to a brand name |
| Random Word Count | The number of words in the URL, which is created with random characters |
| Target Brand Name Count | The number of target brand name in the URL |
| Target Keyword Count | The number of target keyword in the URL |
| Other Words Count | The number of words that are not in the brand name and keyword lists but are in the English dictionary (e.g., computer, pencil, notebook etc …) |
| Digit Count (3) | The number of digits in the URL. Calculation of numbers is calculated separately for domain, subdomain and file path |
| Subdomain Count | The Number of subdomains in URL |
| Random Domain | Is the registered domain created with random characters |
| Length (3) | Length is calculated separately for the domain, subdomain and path |
| Known TLD | [“com”, “org”, “net”, “de”, “edu”, “gov”, etc.] are the most widely used TLDs worldwide. Is the registered TLD known one |
| www, com (2) | The expression of “www” and “com” in domain or subdomain is a common occurrence for malicious URLs |
| Puny Code | Puny Code is a standard that allows the browser to decode certain special characters in the address field. Attackers may use Puny Code to avoid detecting malicious URLs |
| Special Character (8) | Within the URL, the components are separated from each other by dots. However, an attacker could create a malicious URL using some special characters {‘-‘, ‘.’, ‘/’, ‘@’, ‘?’, ‘&’, ‘=’, ‘_’} |
| Consecutive Character Repeat | Attackers can make small changes in brand names or keywords to deceive users. These slight changes can be in the form of using the same character more than once |
| Alexa Check (2) | Alexa is the name of a service that places frequently used websites in a certain order according to their popularity. Is the domain in Alexa Top one million list |
Fig. 2Architecture of the DNN model
Cross-validation results of the machine learning algorithms on Ebbu2017 phishing dataset
| Algorithm | Accuracy | AUC | F1-Score |
|---|---|---|---|
| Naive Bayes | 67.06% | 0.6716 | 0.7476 |
| kNN ( | 93.80% | 0.9377 | 0.9358 |
| Adaboost | 95.36% | 0.9534 | 0.9521 |
| Decision Tree | 96.87% | 0.9685 | 0.9678 |
| Ridge regression | 91.24% | 0.9110 | 0.9043 |
| LASSO | 89.97% | 0.8997 | 0.8994 |
| LightGBM | |||
| XGBoost | 97.75% | 0.9774 | 0.9769 |
| Random Forest | 98.09% | 0.9807 | 0.9803 |
Bold values indicate the best overall result for the corresponding algorithm
Cross-validation results of the machine learning algorithms on phishtank dataset
| Algorithm | Accuracy | AUC | F1-Score |
|---|---|---|---|
| Naive Bayes | 73.69% | 0.7280 | 0.7909 |
| kNN ( | 88.03% | 0.8801 | 0.8841 |
| Adaboost | 89.38% | 0.8934 | 0.8980 |
| Decision Tree | 90.23% | 0.9026 | 0.8994 |
| Ridge regression | 85.30% | 0.8521 | 0.8611 |
| LASSO | 73.26% | 0.7327 | 0.7342 |
| LightGBM | |||
| XGBoost | 92.61% | 0.9259 | 0.9269 |
| Random Forest | 93.30% | 0.9331 | 0.9334 |
Bold values indicate the best overall result for the corresponding algorithm
Hyper parameter search space and best hyper parameters
| Hyper parameter | Search space | Value |
|---|---|---|
| Optimizer | adam, adadelta, rmsprop, sgd | adam |
| Activation functions (Hidden layers) | relu, tanh, elu | relu |
| Dropout rate | 0.1–0.5 | 0.3 |
| Epoch | 10, 20, 40, 60 | 40 |
| Batch size | 16, 32, 64, 128 | 128 |
Cross-validation results of the deep learning algorithms on Ebbu2017 phishing dataset
| Algorithm | Accuracy | AUC | F1-Score |
|---|---|---|---|
| DNN | 96.43% | 0.9644 | 0.9627 |
| CNN | 93.25% | 0.9326 | 0.9339 |
| RNN | 97.17% | 0.9718 | 0.9720 |
| LSTM | 98.24% | 0.9826 | 0.9828 |
| BiLSTM | 97.58% | 0.9759 | 0.9761 |
| DNN+LSTM | 98.62% | 0.9864 | 0.9865 |
| DNN+BiLSTM |
Bold values indicate the best overall result for the corresponding algorithm
Cross-validation results of the deep learning algorithms on phishtank dataset
| Algorithm | Accuracy | AUC | F1-Score |
|---|---|---|---|
| DNN | 91.13% | 0.9125 | 0.9139 |
| CNN | 93.33% | 0.9332 | 0.9349 |
| RNN | 97.22% | 0.9720 | 0.9730 |
| LSTM | 98.23% | 0.9822 | 0.9827 |
| BiLSTM | 98.27% | 0.9826 | 0.9831 |
| DNN+LSTM | 98.98% | 0.9901 | 0.9910 |
| DNN+BiLSTM |
Bold values indicate the best overall result for the corresponding algorithm