Literature DB >> 33195768

Datasets for phishing websites detection.

Grega Vrbančič1, Iztok Fister1, Vili Podgorelec1.   

Abstract

Phishing stands for a fraudulent process, where an attacker tries to obtain sensitive information from the victim. Usually, these kinds of attacks are done via emails, text messages, or websites. Phishing websites, which are nowadays in a considerable rise, have the same look as legitimate sites. However, their backend is designed to collect sensitive information that is inputted by the victim. Discovering and detecting phishing websites has recently also gained the machine learning community's attention, which has built the models and performed classifications of phishing websites. This paper presents two dataset variations that consist of 58,645 and 88,647 websites labeled as legitimate or phishing and allow the researchers to train their classification models, build phishing detection systems, and mining association rules.
© 2020 The Author(s).

Entities:  

Keywords:  Classification; Computer security; Optimization; Phishing websites

Year:  2020        PMID: 33195768      PMCID: PMC7642806          DOI: 10.1016/j.dib.2020.106438

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the Data

These data consist of a collection of legitimate, as well as phishing website instances. Each website is represented by the set of features that denote whether the website is legitimate or not. Data can serve as input for the machine learning process. Machine learning and data mining researchers can benefit from these datasets, while also computer security researchers and practitioners. Computer security enthusiasts can find these datasets interesting for building firewalls, intelligent ad blockers, and malware detection systems. This dataset can help researchers and practitioners easily build classification models in systems preventing phishing attacks since the presented datasets feature the attributes which can be easily extracted. Finally, the provided datasets could also be used as a performance benchmark for developing state-of-the-art machine learning methods for the task of phishing websites classification.

Data Description

The presented dataset was collected and prepared for the purpose of building and evaluating various classification methods for the task of detecting phishing websites based on the uniform resource locator (URL) properties, URL resolving metrics, and external services. The attributes of the prepared dataset can be divided into six groups: attributes based on the whole URL properties presented in Table 1,
Table 1

Dataset attributes based on URL.

Nr.AttributeFormatDescriptionValues
1qty_dot_urlNumber of ”.” signsNumeric
2qty_hyphen_urlNumber of ”-” signsNumeric
3qty_underline_urlNumber of ”_” signsNumeric
4qty_slash_urlNumber of ”/” signsNumeric
5qty_questionmark_urlNumber of ”?” signsNumeric
6qty_equal_urlNumber of ”=” singsNumeric
7qty_at_urlNumber of ”@” signsNumeric
8qty_and_urlNumber of ”&” signsNumeric
9qty_exclamation_urlNumber of ”!” signsNumeric
10qty_space_urlNumber of ” ” signsNumeric
11qty_tilde_urlNumber of ˜signsNumeric
12qty_comma_urlNumber of ”,” signsNumeric
13qty_plus_urlNumber of ”+” signsNumeric
14qty_asterisk_urlNumber of ”*” signsNumeric
15qty_hashtag_urlNumber of ”#” signsNumeric
16qty_dollar_urlNumber of ”$” signsNumeric
17qty_percent_urlNumber of ”%” signsNumeric
18qty_tld_urlTop level domain character lengthNumeric
19length_urlNumber of charactersNumeric
20email_in_urlIs email presentBoolean[0, 1]
Dataset attributes based on URL. attributes based on the domain properties presented in Table 2,
Table 2

Dataset attributes based on domain URL.

Nr.AttributeFormatDescriptionValues
1qty_dot_domainNumber of ”.” signsNumeric
2qty_hyphen_domainNumber of ”-” signsNumeric
3qty_underline_domainNumber of ”_” signsNumeric
4qty_slash_domainNumber of ”/” signsNumeric
5qty_questionmark_domainNumber of ”?” signsNumeric
6qty_equal_domainNumber of ”=” signsNumeric
7qty_at_domainNumber of ”@” signsNumeric
8qty_and_domainNumber of ”&” signsNumeric
9qty_exclamation_domainNumber of ”!” signsNumeric
10qty_space_domainNumber of ” ” signsNumeric
11qty_tilde_domainNumber of ”signsNumeric
12qty_comma_domainNumber of ”,” signsNumeric
13qty_plus_domainNumber of ”+” signsNumeric
14qty_asterisk_domainNumber of ”*” signsNumeric
15qty_hashtag_domainNumber of ”#” signsNumeric
16qty_dollar_domainNumber of ”$” signsNumeric
17qty_percent_domainNumber of ”%” signsNumeric
18qty_vowels_domainNumber of vowelsNumeric
19domain_lengthNumber of domain charactersNumeric
20domain_in_ipURL domain in IP address formatBoolean[0, 1]
21server_client_domain”server” or ”client” in domainBoolean[0, 1]
Dataset attributes based on domain URL. attributes based on the URL directory properties presented in Table 3,
Table 3

Dataset attributes based on URL directory.

Nr.AttributeFormatDescriptionValues
1qty_dot_directoryNumber of ”.” signsNumeric
2qty_hyphen_directoryNumber of ”-” signsNumeric
3qty_underline_directoryNumber of ”_” signsNumeric
4qty_slash_directoryNumber of ”/” signsNumeric
5qty_questionmark_directoryNumber of ”?” signsNumeric
6qty_equal_directoryNumber of ”=” signsNumeric
7qty_at_directoryNumber of ”@” signsNumeric
8qty_and_directoryNumber of ”&” signsNumeric
9qty_exclamation_directoryNumber of ”!” signsNumeric
10qty_space_directoryNumber of ” ” signsNumeric
11qty_tilde_directoryNumber of ”signsNumeric
12qty_comma_directoryNumber of ”,” signsNumeric
13qty_plus_directoryNumber of ”+” signsNumeric
14qty_asterisk_directoryNumber of ”*” signsNumeric
15qty_hashtag_directoryNumber of ”#” signsNumeric
16qty_dollar_directoryNumber of ”$” signsNumeric
17qty_percent_directoryNumber of ”%” signsNumeric
18directory_lengthNumber of directory charactersNumeric
Dataset attributes based on URL directory. attributes based on the URL file properties presented in Table 4,
Table 4

Dataset attributes based on URL file name.

Nr.AttributeFormatDescriptionValues
1qty_dot_fileNumber of ”.” signsNumeric
2qty_hyphen_fileNumber of ”-” signsNumeric
3qty_underline_fileNumber of ”_” signsNumeric
4qty_slash_fileNumber of ”/” signsNumeric
5qty_questionmark_fileNumber of ”?” signsNumeric
6qty_equal_fileNumber of ”=” signsNumeric
7qty_at_fileNumber of ”@” signsNumeric
8qty_and_fileNumber of ”&” signsNumeric
9qty_exclamation_fileNumber of ”!” signsNumeric
10qty_space_fileNumber of ” ” signsNumeric
11qty_tilde_fileNumber of ”signsNumeric
12qty_comma_fileNumber of ”,” signsNumeric
13qty_plus_fileNumber of ”+” signsNumeric
14qty_asterisk_fileNumber of ”*” signsNumeric
15qty_hashtag_fileNumber of ”#” signsNumeric
16qty_dollar_fileNumber of ”$” signsNumeric
17qty_percent_fileNumber of ”%” signsNumeric
18file_lengthNumber of file name charactersNumeric
Dataset attributes based on URL file name. attributes based on the URL parameter properties presented in Table 5, and
Table 5

Dataset attributes based on URL parameters.

Nr.AttributeFormatDescriptionValues
1qty_dot_paramsNumber of ”.” signsNumeric
2qty_hyphen_paramsNumber of ”-” signsNumeric
3qty_underline_paramsNumber of ”_” signsNumeric
4qty_slash_paramsNumber of ”/” signsNumeric
5qty_questionmark_paramsNumber of ”?” signsNumeric
6qty_equal_paramsNumber of ”=” signsNumeric
7qty_at_paramsNumber of ”@” signsNumeric
8qty_and_paramsNumber of ”&” signsNumeric
9qty_exclamation_paramsNumber of ”!” signsNumeric
10qty_space_paramsNumber of ” ” signsNumeric
11qty_tilde_paramsNumber of ”signsNumeric
12qty_comma_paramsNumber of ”,” signsNumeric
13qty_plus_paramsNumber of ”+” signsNumeric
14qty_asterisk_paramsNumber of ”*” signsNumeric
15qty_hashtag_paramsNumber of ”#” signsNumeric
16qty_dollar_paramsNumber of ”$” signsNumeric
17qty_percent_paramsNumber of ”%” signsNumeric
18params_lengthNumber of parameters charactersNumeric
19tld_present_paramsTLD1present in parametersBoolean[0, 1]
20qty_paramsNumber of parametersNumeric
Dataset attributes based on URL parameters. attributes based on the URL resolving data and external metrics presented in Table 6.
Table 6

Dataset attributes based on resolving URL and external services.

Nr.AttributeFormatDescriptionValues
1time_responseDomain lookup time responseNumeric
2domain_spfDomain has SPF 2Boolean[0, 1]
3asn_ipASN 3Numeric
4time_domain_activationDomain activation time (in days)Numeric
5time_domain_expirationDomain expiration time (in days)Numeric
6qty_ip_resolvedNumber of resolved IPsNumeric
8qty_nameserversNumber of resolved NS4Numeric
9qty_mx_serversNumber of MX 5serversNumeric
10ttl_hostnameTime-To-Live (TTL)Numeric
11tls_ssl_certificateHas valid TLS 6/SSL 7certificateBoolean[0, 1]
12qty_redirectsNumber of redirectsNumeric
13url_google_indexIs URL indexed on GoogleBoolean[0, 1]
14domain_google_indexIs domain indexed on GoogleBoolean[0, 1]
15url_shortenedIs URL shortenedBoolean
16phishingIs phishing websiteBoolean[0, 1]
Dataset attributes based on resolving URL and external services. The first group is based on the values of the attributes on the whole URL string, while the values of the following four groups are based on the particular sub-strings, as presented in Figure 1. The last group attributes are based on the URL resolve metrics as well as on the external services such as Google search index.
Fig. 1

Separation of the whole URL string into sub-strings.

Separation of the whole URL string into sub-strings. The dataset in total features 111 attributes excluding the target phishing attribute, which denotes whether the particular instance is legitimate (value 0) or phishing (value 1). We prepared two variations of the dataset, the one where the total number of instances is 58,645 and the balance between the target classes in more or less balanced with 30,647 instances labeled as phishing websites and 27,998 instances labeled as legitimate. The second variant of the dataset is comprised of 88,647 instances with 30,647 instances labeled as phishing and 58,000 instances labeled as legitimate, the purpose of which is to mimic the real-world situation where there are more legitimate websites present. The distribution between the classes of both dataset variants is presented in Figure 2.
Fig. 2

The distribution between classes for both dataset variations. The dataset_full denotes the larger dataset, while the dataset_small denotes the smaller dataset variation. The target class 0 denotes legitimate websites while the target class 1 denotes the phishing websites.

The distribution between classes for both dataset variations. The dataset_full denotes the larger dataset, while the dataset_small denotes the smaller dataset variation. The target class 0 denotes legitimate websites while the target class 1 denotes the phishing websites.

Experimental Design, Materials and Methods

In the process of preparing the phishing websites datasets variants presented in [2], we followed common steps which were also used in the dataset preparation process of similar datasets presented by Mohammad et al. [3] and Abdelhamid et al. [4].1234567 In the manner of such preparation process, we firstly collected a list of a total of 30,647 confirmed phishing URLs from the Phishtank [5] website. On the other hand, the list of legitimate URLs was obtained from Alexa ranking website8 from which we gathered 58,000 legitimate website URLs. Additionally, we have also obtained the list of 27,998 community labeled and organized URLs [1], which are the URLs pointing to the objectively reported news and are in that manner also legitimate. From the URL lists of phishing and legitimate websites, we prepared, as already presented, two variants of the dataset. The smaller, more balanced dataset dataset_small comprises instances of extracted features from Phishtank URLs and instances of extracted features from community labeled and organized URLs representing legitimate ones. On the other hand, the larger, more unbalanced dataset consists of all of the instances from the dataset_small and the additional instances of extracted features from Alexa top sites URL list. The complete process of extracting the features from the list of collected website addresses was conducted automatically, using a Python script. The extracting process is outlined in Algorithm 1. Such procedure was conducted in total two times, each time given different set of website addresses as already described. The final outcome reflects in two csv files containing extracted features. The csv files are handy and easy to work with various tools and programming libraries.
Algorithm 1

Feature extraction process

Feature extraction process

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.
SubjectComputer Science
Specific subject areaArtificial Intelligence
Type of datacsv file
How data were acquiredData were acquired through the publicly available lists of phishing and legitimate websites, from which the features presented in the datasets were extracted.
Data formatRaw: csv file
Parameters for data collectionFor the phishing websites, only the ones from the PhishTank registry were included, which are verified from multiple users. For the legitimate websites, we included the websites from publicly available, community labeled and organized lists [1], and from the Alexa top ranking websites.
Description of data collectionThe data is comprised of the features extracted from the collections of websites addresses. The data in total consists of 111 features, 96 of which are extracted from the website address itself, while the remaining 15 features were extracted using custom Python code.
Data source locationWorldwide
Data accessibilityRepository name: Mendeley Data Data identification number: 10.17632/72ptz43s9v.1 Direct URL to data: https://doi.org/10.17632/72ptz43s9v.1
Related research articleVrbančič, Grega, Iztok Fister Jr, and Vili Podgorelec. “Parameter setting for deep neural networks using swarm intelligence on phishing websites classification.” International Journal on Artificial Intelligence Tools 28.06 (2019): 1960008. DOI:10.1142/S021821301960008X
  2 in total

1.  Phishing Website Detection Based on Deep Convolutional Neural Network and Random Forest Ensemble Learning.

Authors:  Rundong Yang; Kangfeng Zheng; Bin Wu; Chunhua Wu; Xiujuan Wang
Journal:  Sensors (Basel)       Date:  2021-12-10       Impact factor: 3.576

2.  Applications of deep learning for phishing detection: a systematic literature review.

Authors:  Cagatay Catal; Görkem Giray; Bedir Tekinerdogan; Sandeep Kumar; Suyash Shukla
Journal:  Knowl Inf Syst       Date:  2022-05-23       Impact factor: 2.531

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.