| Literature DB >> 35645443 |
Cagatay Catal1, Görkem Giray2, Bedir Tekinerdogan3, Sandeep Kumar4, Suyash Shukla4.
Abstract
Phishing attacks aim to steal confidential information using sophisticated methods, techniques, and tools such as phishing through content injection, social engineering, online social networks, and mobile applications. To avoid and mitigate the risks of these attacks, several phishing detection approaches were developed, among which deep learning algorithms provided promising results. However, the results and the corresponding lessons learned are fragmented over many different studies and there is a lack of a systematic overview of the use of deep learning algorithms in phishing detection. Hence, we performed a systematic literature review (SLR) to identify, assess, and synthesize the results on deep learning approaches for phishing detection as reported by the selected scientific publications. We address nine research questions and provide an overview of how deep learning algorithms have been used for phishing detection from several aspects. In total, 43 journal articles were selected from electronic databases to derive the answers for the defined research questions. Our SLR study shows that except for one study, all the provided models applied supervised deep learning algorithms. The widely used data sources were URL-related data, third party information on the website, website content-related data, and email. The most used deep learning algorithms were deep neural networks (DNN), convolutional neural networks, and recurrent neural networks/long short-term memory networks. DNN and hybrid deep learning algorithms provided the best performance among other deep learning-based algorithms. 72% of the studies did not apply any feature selection algorithm to build the prediction model. PhishTank was the most used dataset among other datasets. While Keras and Tensorflow were the most preferred deep learning frameworks, 46% of the articles did not mention any framework. This study also highlights several challenges for phishing detection to pave the way for further research.Entities:
Keywords: Cybersecurity; Deep learning; Machine learning; Malicious URL prediction; Phishing detection; Systematic literature review (SLR)
Year: 2022 PMID: 35645443 PMCID: PMC9125357 DOI: 10.1007/s10115-022-01672-x
Source DB: PubMed Journal: Knowl Inf Syst ISSN: 0219-3116 Impact factor: 2.531
Fig. 1The architecture of a RNN and b RNN over a time step [1]
Fig. 2The architecture of an autoencoder
Fig. 3The architecture of an RBM network
Fig. 4The architecture of a DBN network [54]
Fig. 5Research questions mapped to the machine learning model life cycle
adapted from Amershi et al. [7]
Fig. 6SLR process used in this study
Exclusion criteria
| # | Criterion |
|---|---|
| EC1 | Duplicate papers from multiple sources |
| EC2 | Papers without full text available |
| EC3 | Papers not written in English |
| EC4 | Papers not published in a journal |
| EC5 | Short papers, editorials, issue introductions |
| EC6 | Secondary studies, such as literature review, SMS, SLR |
| EC7 | Papers which do not use deep learning for phishing detection |
| EC8 | Papers which only use traditional ML algorithms |
| EC9 | Papers which do not include empirical results |
Quality assessment criteria
| # | Question |
|---|---|
| Q1 | Are the aims of the study clearly stated? |
| Q2 | Are the scope and context and experimental design of the study clearly defined? |
| Q3 | Are the variables in the study likely to be valid and reliable? |
| Q4 | Is the research process documented adequately? |
| Q5 | Are all the study questions answered? |
| Q6 | Are the negative findings presented? |
| Q7 | Are the main findings stated clearly (regarding creditability, validity, and reliability)? |
| Q8 | Do the conclusions relate to the aim of the purpose of the study, and are they reliable? |
Fig. 7Quality score distribution of the selected papers
The data extraction form
| Field | Categories |
|---|---|
| Journal title | Free text |
| Publication year | Number |
| Paper title | Free text |
| Abstract | Free text |
| ML category | Supervised, Unsupervised, Semi-supervised |
| Data sources | Third Party info on Web site, Company logo, Email, Social media post, URL, the Web site (Content, Code) |
| Evaluation dataset | 5000 Best Websites, Alexa—Top Sites, Common Crawl, Contagio Mobile, Corpus of First Security and Privacy Analytics Anti-Phishing Shared Task, Curlie, Custom, DNS-BH—Malware Domain Blocklist by RiskAnalytics, joewein.de LLC, Malware Domain List, Nazario Phishing Corpus, OpenPhish, PhishTank, The Directory of the Web, The Enron—Spam Datasets, The Spamhaus Project, UCI dataset, Untroubled Software, VirusTotal |
| Feature selection method | Boruta, Correlation-based feature selection, Deep Belief Network, Genetic Algorithm, Greedy Selection algorithm, InfoGain, k-Best Chi2, L1 based Linear Support Vector Machine (L-SVM-L1), Optimal sensitive feature selection algorithm, Principal Component Analysis, Recursive Feature Elimination (RFE), Sparse Random Projection, Variance Threshold (VT), Not mentioned |
| DL Approaches | Autoencoder, CNN, DBN, DNN, Hybrid DL Model, RNN/LSTM |
| Evaluation parameters | Accuracy, AUC, F-measure, FNR, FPR, Precision/PPV, Recall/Sensitivity/TPR, Specificity/TNR |
| Validation method | Cross-validation, Hold-out, Not mentioned |
| Best algorithm | Autoencoder, CNN, DBN, DNN, Hybrid DL Model, Non-ML/DL Approach, RNN/LSTM, Traditional ML, Not mentioned |
| Implementation platform | H2O, Keras, MATLAB, Microsoft Cognitive Toolkit (CNTK), Rstudio, Tensorflow, Theano, Not mentioned |
| Challenges and proposed solutions | Free text |
Fig. 8Number of papers until August 2020
Distribution of papers per journal
| Journal | # of papers | Reference(s) |
|---|---|---|
| IEEE Access | 4 | [ |
| Journal of Intelligent & Fuzzy Systems | 3 | [ |
| Neural Computing and Applications | 3 | [ |
| Security and Communication Networks | 2 | [ |
| Applied Intelligence | 1 | [ |
| Applied Soft Computing | 1 | [ |
| Computer Networks | 1 | [ |
| Computers & Security | 1 | [ |
| Data Technologies and Applications | 1 | [ |
| Electronics | 1 | [ |
| IEEE Internet of Things Journal | 1 | [ |
| IEEE Transactions on Big Data | 1 | [ |
| IET Information Security | 1 | [ |
| Information | 1 | [ |
| Information Security Journal | 1 | [ |
| Information Systems | 1 | [ |
| International Journal of Computational Intelligence and Applications | 1 | [ |
| International Journal of Computer Science, Engineering and Information Technology | 1 | [ |
| International Journal of Network Security | 1 | [ |
| International Journal of Network Security & Its Applications | 1 | [ |
| International Journal of Research in Engineering, Science and Management | 1 | [ |
| International Journal on Artificial Intelligence Tools | 1 | [ |
| Iran Journal of Computer Science | 1 | [ |
| Journal of Ambient Intelligence and Humanized computing | 1 | [ |
| Journal of computing and Information Technology | 1 | [ |
| Journal of Cyber Security Technology | 1 | [ |
| Journal of Enterprise Information Management | 1 | [ |
| Journal of Experimental & Theoretical Artificial Intelligence | 1 | [ |
| Journal of Information Processing | 1 | [ |
| Journal of Network and Computer Applications | 1 | [ |
| Journal of Systems and Information Technology | 1 | [ |
| Neural Networks | 1 | [ |
| Pervasive and Mobile Computing | 1 | [ |
| Sadhana | 1 | [ |
| Sensors | 1 | [ |
Fig. 9Word cloud of the abstracts
Fig. 10Distribution of data sources
Malware datasets and their web pages
| ID | Dataset | Web page |
|---|---|---|
| 1 | 5000 best websites | |
| 2 | Alexa—top sites | |
| 3 | Common crawl | |
| 4 | Contagio mobile | |
| 5 | Corpus of first security and privacy analytics anti-phishing shared task (IWSPA-AP 2018) | |
| 6 | Curlie | |
| 7 | DNS-BH—Malware domain blocklist by risk analytics | |
| 8 | Joewein.de LLC | |
| 9 | Malware domain list | |
| 10 | Nazario phishing corpus | |
| 11 | OpenPhish | |
| 12 | PhishTank | |
| 13 | The directory of the web | |
| 14 | The Enron—Spam Datasets | |
| 15 | The spamhaus project | |
| 16 | UCI dataset | |
| 17 | Untroubled software | |
| 18 | VirusTotal |
Fig. 11Distribution of datasets
Fig. 12Feature selection
Fig. 13Distribution of DL approaches
Fig. 14Distribution of evaluation parameters
Fig. 15Distribution of validation approaches
Fig. 16Distribution of best-performing algorithms
Fig. 17Distribution of DL implementation platforms
The challenges and proposed solutions
| Category type | Challenges (C1 to C14) | Proposed solutions (S1 to S7) | References |
|---|---|---|---|
| Model interpretability | C1. Not interpretable | No solution | [ |
| Model efficiency | C2. Long training time | No solution | [ |
| Model efficiency | C3. Fine-tuning the parameters | S1. Different parameters are used, but this is not a complete solution | [ |
| Model for specific cases | C4. Real-time phishing detection | No solution | [ |
| Model efficiency | C5. Required too much computing resources | No solution S2. Re-scaling the dataset | [ |
| Model efficiency | C6. Feature selection requiring a long time | No solution | [ |
| Model design considerations | C7. Overfitting problem | S3. Use a NN model that contains design risk minimization principle and Monte Carlo algorithm S4. The new index value, decision tree, and local search method used for optimal feature selection S5. Use the optimal feature selection method and neural network | [ |
| Model for specific cases | C8. Detecting phishing websites that use embedded objects such as flash and java scripts to replace textual content | No solution | [ |
| Model design considerations | C9. Multi-label classification | No solution | [ |
| Model for specific cases | C10. Detecting the structural changes in the URL and short phishing URLs (e.g., bitly, goo) | No solution | [ |
| Data/Dataset | C11. Getting an adequate labeled data | No solution | [ |
| C12. Duplicate points in the public datasets | S6. K-medoids algorithm with an incremental method for medoid selection to remove duplicates | [ | |
| C13. Different distributions of the real data and open datasets | S7. Use a small amount of labeled malicious data samples and apply KNN and K-Means algorithms to expand the samples. These expanded samples with high similarity to the manually labeled data are applied in further analysis | [ | |
| C14. Short-lived suspicious websites that are not online at any time | No solution | [ |
Summary of deep learning algorithms
| Deep learning model | Description | Characteristics |
|---|---|---|
| DNN | DNN model varies from the shallow NNs in terms of the number of hidden layers, which are more in DNN | It can be efficiently used for classification and regression problems It can solve more complex problems compared to the shallow network The parameter’s size increases with the increase in the number of input features More hidden layers increase model complexity |
| CNN | Suitable for 2-dimensional input data The convergence speed increases by using ReLU | Neurons need not be fully connected like a conventional artificial neural network It can solve more complex problems compared to the shallow network A large labeled dataset may be needed A large number of layers may be required to form the network |
| RNN | The previous stage’s output is used as input in the current stage The RNN model processes input recurrently using internal memory Well suited for speech recognition applications | Useful for modeling sequence data It can process inputs of varying length The training of the RNN model is difficult than the other models due to its reliance on time RNNs are not capable of storing past information for a long overrun |
| LSTM | Extension of an RNN model that works well in situations where RNN fails Each cell in the LSTM unit comprises input, output, and forget gate Works well for large range inputs | LSTM networks can store past information for a long overrun Capable of solving vanishing gradient problem Performance can be affected by different random weight initialization Vulnerable to Overfitting |
| Autoencoder | Frequently used for feature extraction or dimensionality reduction The target of the autoencoder model is to reconstruct the input using unsupervised learning | No need for labeled data More robust A pre-training stage is needed |
| RBM | It is a generative model that learns the probability distribution over the input data Detect hidden patterns in an unsupervised fashion Neurons in different layers form a bipartite graph | Missing values can be obtained using Gibbs sampling Can solve the problem of noisy data easily during reconstruction The training process is difficult |
| DBN | Comprises of the stack of RBM models Allows both (supervised and unsupervised) training of the network | Follows a sequential learning strategy for network initialization The inferences maximize the likelihood The training process may be expensive, just like RBM |
Summary of related work
| Study | Review type | Focus | Description |
|---|---|---|---|
| Khonji et al. [ | Non-systematic | Phishing detection techniques | Discusses the high-level overview of detection approaches, offensive defense approaches, correction approaches, and prevention approaches Detection at hour zero and low false-positive rate are critical measures Phishing solutions based on ML techniques are the most promising |
| Varshney et al. [ | Non-systematic | Web phishing detection | Discusses various web phishing detection techniques and research gaps in web phishing detection They found that the search engine-based phishing detection techniques are the most suitable solutions |
| Dou et al. [ | Systematic | Software-based web phishing detection | They have analyzed software-based phishing detection techniques They found that the true-positive rate (TPR), false-positive rate (FPR), and measures related to TPR and FPR are providing valuable performance PhishTank dataset is the most frequently used dataset, and phishing detection by blacklists toolbars are the most commonly used techniques |
| Goel and Jain [ | Non-systematic | Mobile phishing attacks | They have discussed mobile phishing attacks and their solutions They have also provided a taxonomy of phishing defense mechanisms that will help users to understand the topic |
| Mohammed Harun Babu et al. [ | Non-systematic | Deep Learning for cyber-security | Deep Learning for cyber-security |
| Sahoo et al. [ | Non-systematic | Malicious URL detection using machine learning | Discusses the applications of ML techniques for malicious URL detection along with their requirements and challenges to develop the ML-based solutions They found that online ML algorithms are finding attention due to the huge size of training data Feature selection is also important for the performance of ML techniques |
| Wong [ | Non-systematic | Malicious web content detection using Deep Learning | They reviewed various heuristics-based, ML-based, and DL-based methods for malicious web content detection They found that the DL techniques along with feature extraction provide an effective solution |
| Ferreira [ | Non-systematic | Malicious URL detection using machine learning | Discusses the applications of ML techniques for the detection and prevention of malicious URLs They discussed lexical and content-based features for ML-based malicious URL detection They also discussed Naïve Bayes, Support Vector Machine, and Online algorithms for malicious URL detection |
| Berman et al. [ | Non-systematic | Deep Learning for cyber-security | They discussed various DL techniques and reviewed studies that are using DL techniques to provide solutions for cyber-security attacks They found that the performance of different approaches varies based on the problem domain The TPR of the best performing DL technique for identifying malicious domain names is 96.01%-99.86%, whereas the TPR for network intrusion detection is 92.33%-100% They have also provided a taxonomy of phishing defense mechanisms that will help users to understand the topic |
| Zuraiq and Alkasassbeh [ | Non-systematic | Phishing detection techniques | Reviewed various Content-Based, Heuristic-Based, and Fuzzy Rule-Based phishing detection approaches based on various datasets They found that no approach is better in every situation |
| Kiruthiga and Akila [ | Non-systematic | Phishing website detection using machine learning | Discusses the applications of ML techniques for the phishing website detection Naive Bayes, Decision Tree, Support Vector Machine, and Random Forest are the most frequently used algorithms PhishScore and PhishChecker systems have also been proposed for the detection |
| Singh and Meenu [ | Non-systematic | Phishing website detection using machine learning | Discusses the applications of ML techniques for phishing website detection along with various approaches protection approaches They found that the ML algorithms have achieved an approximate 99% accuracy for phishing website detection by combining 30 features |
| Benavides et al. [ | Systematic | Phishing attack solutions using Deep Learning | Discusses the applications of DL techniques for phishing website detection Classified and analyzed various DL-based solutions DL algorithms have not been explored enough for phishing website detection Still, there is a need to identify one algorithm that can be useful in phishing website detection |