| Literature DB >> 35614133 |
Ali Aljofey1,2, Qingshan Jiang3, Abdur Rasool1,2, Hui Chen1,2, Wenyin Liu4, Qiang Qu1, Yang Wang5.
Abstract
Today's growing phishing websites pose significant threats due to their extremely undetectable risk. They anticipate internet users to mistake them as genuine ones in order to reveal user information and privacy, such as login ids, pass-words, credit card numbers, etc. without notice. This paper proposes a new approach to solve the anti-phishing problem. The new features of this approach can be represented by URL character sequence without phishing prior knowledge, various hyperlink information, and textual content of the webpage, which are combined and fed to train the XGBoost classifier. One of the major contributions of this paper is the selection of different new features, which are capable enough to detect 0-h attacks, and these features do not depend on any third-party services. In particular, we extract character level Term Frequency-Inverse Document Frequency (TF-IDF) features from noisy parts of HTML and plaintext of the given webpage. Moreover, our proposed hyperlink features determine the relationship between the content and the URL of a webpage. Due to the absence of publicly available large phishing data sets, we needed to create our own data set with 60,252 webpages to validate the proposed solution. This data contains 32,972 benign webpages and 27,280 phishing webpages. For evaluations, the performance of each category of the proposed feature set is evaluated, and various classification algorithms are employed. From the empirical results, it was observed that the proposed individual features are valuable for phishing detection. However, the integration of all the features improves the detection of phishing sites with significant accuracy. The proposed approach achieved an accuracy of 96.76% with only 1.39% false-positive rate on our dataset, and an accuracy of 98.48% with 2.09% false-positive rate on benchmark dataset, which outperforms the existing baseline approaches.Entities:
Mesh:
Year: 2022 PMID: 35614133 PMCID: PMC9133026 DOI: 10.1038/s41598-022-10841-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Comparison of machine learning based phishing detection approaches.
| Approach | Description | Dataset | Limitations |
|---|---|---|---|
| Jain and Gupta[ | This approach filters phishing websites at client side based on handcrafted URL features, hyperlinks features, and identity keywords features using Random Forest | A private dataset of 2141 phishing webpages and 1918 benign webpages | It extracts manually designed URL features, which need human effort Identity features are language dependent where top key words are extracted from website |
| Jain and Gupta[ | Proposed an anti-phishing approach using logistic regression, which relies on various hyperlink features extracted from the HTML content of webpage | A private dataset of 1428 phishing and 1116 benign webpages | Limited dataset The feature set completely depends on the webpage content which fails when content is replaced by Images |
| Rao and Pais[ | Authors developed a two level filtering technique to detect phishing sites using enhanced blacklist and heuristic features | A public dataset of 5438 benign and 4097 Phishing webpages | The benign sites always go through two level filtering |
| Jain and Gupta[ | An approach to classify the websites based on two level authentications: search engine and hyperlink information | A private dataset of 2000 benign and 2000 phishing webpages | Fails at first level when newly constructed benign sites do not appear in top search results Fails when content of webpage is replaced by an image |
| Sahingoz et al.[ | Use NLP based features, word vectors, and hybrid features, and then seven different machine learning algorithms are used to classify the URLs | A public dataset of 36,400 benign URLs and 37,175 phishing URLs | Inability to handle unseen characters in URLs The method may fail to detect the shorter URLs |
| Rao et al.[ | This technique proposes manually crafted URL features and TF-IDF based features and with the use of these features classifies the URLs by using random forest classifier | A public dataset of 85,409 benign URLs and 40,668 phishing URLs | Extracts hand-crafted URL features, which need human effort and additional maintenance labor costs The model may fails when phishing sites hosted on free or compromised hosting servers |
| Aljofey et al.[ | A fast deep learning model based on the URL, which uses character-level CNN, is proposed for phishing detection | A private dataset of 157,626 benign URLs and 161,016 phishing URLs | It completely depends on the URL of the website It does not interest if the URL of the website is alive or if there is an error |
| Le et al.[ | This technique applies CNN networks to both characters and words of the URL string for malicious URL detection | A private dataset of 4,683,425 benign URLs and 9,366,850 malicious URLs | Since the deep learning model implemented with both word-level and character-level embedding, it requires sufficient memory |
| Xiao et al.[ | Proposed a technique named CNN–MHSA, which combines convolutional neural network (CNN) and multi-head self-attention (MHSA) mechanism together to learn features in URLs and detect phishing | A private dataset where 45,000 are benign and 43,984 are phishing | The URL length parameter may affect the robustness of the model |
| Zheng et al.[ | Proposed a new Highway Hierarchical Neural Network (HDP-CNN) to detect phishing URLs. This method uses word-level embedding along with character-level embedding to exhibit better performance | A private dataset contains 344,794 benign URLs and 71,556 phishing URLs | The problem of severe data imbalance is probably causing the model to overfit on large datasets |
| Rao et al.[ | A machine leaning technique that uses word embedding algorithms to generate a feature vector using plain text and domain text extracted from the webpage content | A public dataset consists of 5438 phishing websites and 5076 benign websites with their URLs | The technique is language dependent It fails when content of webpage is replaced by an image |
| Guo et al.[ | A phishing detection approach that creates heterogeneous information networks based on domain nodes, page resource nodes, and relationships between hyperlinks | A public dataset contains 29,496 phish samples and 30,649 benign samples | The approach may exhibit poor performance when the webpage contains a few number of hyperlinks |
| Proposed approach | A machine learning approach that consists of a hybrid feature set including URL character sequence, different hyperlink features, and TF-IDF character level features from the plaintext and noisy part of the given webpage's HTML | A public data set consisting of 27,280 phishing URLs with HTML codes and 32,972 benign pages | The plain text-based feature of a webpage is language-based Need for accessing the HTML source code of webpage |
Figure 1General architecture of the proposed approach.
Features used in the proposed approach.
| Category | No | Name |
|---|---|---|
| URL based features | F1 | Character sequences vectors |
| Textual content features | F2 | TF-IDF vector N-gram chars |
| Hyperlink information | F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, and F13 | Script_files, CSS_files, img_files, a_files, a_Null_hyperlinks, Null_hyperlinks, Total_hyperlinks, Internal_hyperlinks, External_hyperlinks, External/Internal_hyperlinks, and Error_hyperlinks |
| Login form information | F14 and F15 | Total_forms and Suspicious_form |
Figure 2Phishing detection algorithm.
Figure 3The process of generating text features.