| Literature DB >> 33195768 |
Grega Vrbančič1, Iztok Fister1, Vili Podgorelec1.
Abstract
Phishing stands for a fraudulent process, where an attacker tries to obtain sensitive information from the victim. Usually, these kinds of attacks are done via emails, text messages, or websites. Phishing websites, which are nowadays in a considerable rise, have the same look as legitimate sites. However, their backend is designed to collect sensitive information that is inputted by the victim. Discovering and detecting phishing websites has recently also gained the machine learning community's attention, which has built the models and performed classifications of phishing websites. This paper presents two dataset variations that consist of 58,645 and 88,647 websites labeled as legitimate or phishing and allow the researchers to train their classification models, build phishing detection systems, and mining association rules.Entities:
Keywords: Classification; Computer security; Optimization; Phishing websites
Year: 2020 PMID: 33195768 PMCID: PMC7642806 DOI: 10.1016/j.dib.2020.106438
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Dataset attributes based on URL.
| Nr. | Attribute | Format | Description | Values |
|---|---|---|---|---|
| 1 | qty_dot_url | Number of ”.” signs | Numeric | |
| 2 | qty_hyphen_url | Number of ”-” signs | Numeric | |
| 3 | qty_underline_url | Number of ”_” signs | Numeric | |
| 4 | qty_slash_url | Number of ”/” signs | Numeric | |
| 5 | qty_questionmark_url | Number of ”?” signs | Numeric | |
| 6 | qty_equal_url | Number of ”=” sings | Numeric | |
| 7 | qty_at_url | Number of ”@” signs | Numeric | |
| 8 | qty_and_url | Number of ”&” signs | Numeric | |
| 9 | qty_exclamation_url | Number of ”!” signs | Numeric | |
| 10 | qty_space_url | Number of ” ” signs | Numeric | |
| 11 | qty_tilde_url | Number of | Numeric | |
| 12 | qty_comma_url | Number of ”,” signs | Numeric | |
| 13 | qty_plus_url | Number of ”+” signs | Numeric | |
| 14 | qty_asterisk_url | Number of ”*” signs | Numeric | |
| 15 | qty_hashtag_url | Number of ”#” signs | Numeric | |
| 16 | qty_dollar_url | Number of ”$” signs | Numeric | |
| 17 | qty_percent_url | Number of ”%” signs | Numeric | |
| 18 | qty_tld_url | Top level domain character length | Numeric | |
| 19 | length_url | Number of characters | Numeric | |
| 20 | email_in_url | Is email present | Boolean | [0, 1] |
Dataset attributes based on domain URL.
| Nr. | Attribute | Format | Description | Values |
|---|---|---|---|---|
| 1 | qty_dot_domain | Number of ”.” signs | Numeric | |
| 2 | qty_hyphen_domain | Number of ”-” signs | Numeric | |
| 3 | qty_underline_domain | Number of ”_” signs | Numeric | |
| 4 | qty_slash_domain | Number of ”/” signs | Numeric | |
| 5 | qty_questionmark_domain | Number of ”?” signs | Numeric | |
| 6 | qty_equal_domain | Number of ”=” signs | Numeric | |
| 7 | qty_at_domain | Number of ”@” signs | Numeric | |
| 8 | qty_and_domain | Number of ”&” signs | Numeric | |
| 9 | qty_exclamation_domain | Number of ”!” signs | Numeric | |
| 10 | qty_space_domain | Number of ” ” signs | Numeric | |
| 11 | qty_tilde_domain | Number of ”signs | Numeric | |
| 12 | qty_comma_domain | Number of ”,” signs | Numeric | |
| 13 | qty_plus_domain | Number of ”+” signs | Numeric | |
| 14 | qty_asterisk_domain | Number of ”*” signs | Numeric | |
| 15 | qty_hashtag_domain | Number of ”#” signs | Numeric | |
| 16 | qty_dollar_domain | Number of ”$” signs | Numeric | |
| 17 | qty_percent_domain | Number of ”%” signs | Numeric | |
| 18 | qty_vowels_domain | Number of vowels | Numeric | |
| 19 | domain_length | Number of domain characters | Numeric | |
| 20 | domain_in_ip | URL domain in IP address format | Boolean | [0, 1] |
| 21 | server_client_domain | ”server” or ”client” in domain | Boolean | [0, 1] |
Dataset attributes based on URL directory.
| Nr. | Attribute | Format | Description | Values |
|---|---|---|---|---|
| 1 | qty_dot_directory | Number of ”.” signs | Numeric | |
| 2 | qty_hyphen_directory | Number of ”-” signs | Numeric | |
| 3 | qty_underline_directory | Number of ”_” signs | Numeric | |
| 4 | qty_slash_directory | Number of ”/” signs | Numeric | |
| 5 | qty_questionmark_directory | Number of ”?” signs | Numeric | |
| 6 | qty_equal_directory | Number of ”=” signs | Numeric | |
| 7 | qty_at_directory | Number of ”@” signs | Numeric | |
| 8 | qty_and_directory | Number of ”&” signs | Numeric | |
| 9 | qty_exclamation_directory | Number of ”!” signs | Numeric | |
| 10 | qty_space_directory | Number of ” ” signs | Numeric | |
| 11 | qty_tilde_directory | Number of ”signs | Numeric | |
| 12 | qty_comma_directory | Number of ”,” signs | Numeric | |
| 13 | qty_plus_directory | Number of ”+” signs | Numeric | |
| 14 | qty_asterisk_directory | Number of ”*” signs | Numeric | |
| 15 | qty_hashtag_directory | Number of ”#” signs | Numeric | |
| 16 | qty_dollar_directory | Number of ”$” signs | Numeric | |
| 17 | qty_percent_directory | Number of ”%” signs | Numeric | |
| 18 | directory_length | Number of directory characters | Numeric |
Dataset attributes based on URL file name.
| Nr. | Attribute | Format | Description | Values |
|---|---|---|---|---|
| 1 | qty_dot_file | Number of ”.” signs | Numeric | |
| 2 | qty_hyphen_file | Number of ”-” signs | Numeric | |
| 3 | qty_underline_file | Number of ”_” signs | Numeric | |
| 4 | qty_slash_file | Number of ”/” signs | Numeric | |
| 5 | qty_questionmark_file | Number of ”?” signs | Numeric | |
| 6 | qty_equal_file | Number of ”=” signs | Numeric | |
| 7 | qty_at_file | Number of ”@” signs | Numeric | |
| 8 | qty_and_file | Number of ”&” signs | Numeric | |
| 9 | qty_exclamation_file | Number of ”!” signs | Numeric | |
| 10 | qty_space_file | Number of ” ” signs | Numeric | |
| 11 | qty_tilde_file | Number of ”signs | Numeric | |
| 12 | qty_comma_file | Number of ”,” signs | Numeric | |
| 13 | qty_plus_file | Number of ”+” signs | Numeric | |
| 14 | qty_asterisk_file | Number of ”*” signs | Numeric | |
| 15 | qty_hashtag_file | Number of ”#” signs | Numeric | |
| 16 | qty_dollar_file | Number of ”$” signs | Numeric | |
| 17 | qty_percent_file | Number of ”%” signs | Numeric | |
| 18 | file_length | Number of file name characters | Numeric |
Dataset attributes based on URL parameters.
| Nr. | Attribute | Format | Description | Values |
|---|---|---|---|---|
| 1 | qty_dot_params | Number of ”.” signs | Numeric | |
| 2 | qty_hyphen_params | Number of ”-” signs | Numeric | |
| 3 | qty_underline_params | Number of ”_” signs | Numeric | |
| 4 | qty_slash_params | Number of ”/” signs | Numeric | |
| 5 | qty_questionmark_params | Number of ”?” signs | Numeric | |
| 6 | qty_equal_params | Number of ”=” signs | Numeric | |
| 7 | qty_at_params | Number of ”@” signs | Numeric | |
| 8 | qty_and_params | Number of ”&” signs | Numeric | |
| 9 | qty_exclamation_params | Number of ”!” signs | Numeric | |
| 10 | qty_space_params | Number of ” ” signs | Numeric | |
| 11 | qty_tilde_params | Number of ”signs | Numeric | |
| 12 | qty_comma_params | Number of ”,” signs | Numeric | |
| 13 | qty_plus_params | Number of ”+” signs | Numeric | |
| 14 | qty_asterisk_params | Number of ”*” signs | Numeric | |
| 15 | qty_hashtag_params | Number of ”#” signs | Numeric | |
| 16 | qty_dollar_params | Number of ”$” signs | Numeric | |
| 17 | qty_percent_params | Number of ”%” signs | Numeric | |
| 18 | params_length | Number of parameters characters | Numeric | |
| 19 | tld_present_params | TLD | Boolean | [0, 1] |
| 20 | qty_params | Number of parameters | Numeric |
Dataset attributes based on resolving URL and external services.
| Nr. | Attribute | Format | Description | Values |
|---|---|---|---|---|
| 1 | time_response | Domain lookup time response | Numeric | |
| 2 | domain_spf | Domain has SPF | Boolean | [0, 1] |
| 3 | asn_ip | ASN | Numeric | |
| 4 | time_domain_activation | Domain activation time (in days) | Numeric | |
| 5 | time_domain_expiration | Domain expiration time (in days) | Numeric | |
| 6 | qty_ip_resolved | Number of resolved IPs | Numeric | |
| 8 | qty_nameservers | Number of resolved NS | Numeric | |
| 9 | qty_mx_servers | Number of MX | Numeric | |
| 10 | ttl_hostname | Time-To-Live (TTL) | Numeric | |
| 11 | tls_ssl_certificate | Has valid TLS | Boolean | [0, 1] |
| 12 | qty_redirects | Number of redirects | Numeric | |
| 13 | url_google_index | Is URL indexed on Google | Boolean | [0, 1] |
| 14 | domain_google_index | Is domain indexed on Google | Boolean | [0, 1] |
| 15 | url_shortened | Is URL shortened | Boolean | |
Fig. 1Separation of the whole URL string into sub-strings.
Fig. 2The distribution between classes for both dataset variations. The dataset_full denotes the larger dataset, while the dataset_small denotes the smaller dataset variation. The target class 0 denotes legitimate websites while the target class 1 denotes the phishing websites.
Algorithm 1Feature extraction process
| Subject | Computer Science |
| Specific subject area | Artificial Intelligence |
| Type of data | csv file |
| How data were acquired | Data were acquired through the publicly available lists of phishing and legitimate websites, from which the features presented in the datasets were extracted. |
| Data format | Raw: csv file |
| Parameters for data collection | For the phishing websites, only the ones from the PhishTank registry were included, which are verified from multiple users. For the legitimate websites, we included the websites from publicly available, community labeled and organized lists |
| Description of data collection | The data is comprised of the features extracted from the collections of websites addresses. The data in total consists of 111 features, 96 of which are extracted from the website address itself, while the remaining 15 features were extracted using custom Python code. |
| Data source location | Worldwide |
| Data accessibility | Repository name: Mendeley Data Data identification number: 10.17632/72ptz43s9v.1 Direct URL to data: |
| Related research article | Vrbančič, Grega, Iztok Fister Jr, and Vili Podgorelec. “Parameter setting for deep neural networks using swarm intelligence on phishing websites classification.” International Journal on Artificial Intelligence Tools 28.06 (2019): 1960008. DOI: |