| Literature DB >> 35611122 |
Paul K Mvula1, Paula Branco1, Guy-Vincent Jourdan1, Herna L Viktor1.
Abstract
Due to the rapid technological advances that have been made over the years, more people are changing their way of living from traditional ways of doing business to those featuring greater use of electronic resources. This transition has attracted (and continues to attract) the attention of cybercriminals, referred to in this article as "attackers", who make use of the structure of the Internet to commit cybercrimes, such as phishing, in order to trick users into revealing sensitive data, including personal information, banking and credit card details, IDs, passwords, and more important information via replicas of legitimate websites of trusted organizations. In our digital society, the COVID-19 pandemic represents an unprecedented situation. As a result, many individuals were left vulnerable to cyberattacks while attempting to gather credible information about this alarming situation. Unfortunately, by taking advantage of this situation, specific attacks associated with the pandemic dramatically increased. Regrettably, cyberattacks do not appear to be abating. For this reason, cyber-security corporations and researchers must constantly develop effective and innovative solutions to tackle this growing issue. Although several anti-phishing approaches are already in use, such as the use of blacklists, visuals, heuristics, and other protective solutions, they cannot efficiently prevent imminent phishing attacks. In this paper, we propose machine learning models that use a limited number of features to classify COVID-19-related domain names as either malicious or legitimate. Our primary results show that a small set of carefully extracted lexical features, from domain names, can allow models to yield high scores; additionally, the number of subdomain levels as a feature can have a large influence on the predictions.Entities:
Keywords: Cybersecurity; Hoeffding trees; Machine learning; Online learning; Phishing attacks; Supervised learning
Year: 2022 PMID: 35611122 PMCID: PMC9119958 DOI: 10.1016/j.eswa.2022.117553
Source DB: PubMed Journal: Expert Syst Appl ISSN: 0957-4174 Impact factor: 8.665
Fig. 1Anti-phishing Development Areas. Varshney et al. (2016).
List of COVID-19-related keywords.
| corona | covid | ncov | wuhan |
| ncov-19 | virus | covid-19 | covid19 |
| sars | wuhanvirus | novelvirus | chinavirus |
List of features and their descriptions.
| Name | Type | Description |
|---|---|---|
| Length | Numeric | Two features to count the number of words and the number of characters in the domain name. |
| Containing “–” | Numeric | Checks whether the domain contains a hyphen. |
| Entropy of domain | Numeric | Three features to calculate Shannon’s Entropy on different parts of the domain name. |
| Tranco rank | Numeric | Checks whether a domain is on the Tranco ( |
| Ratio of the longest word | Numeric | This feature matches the longest word that a domain name contains and normalizes it by dividing by the length of the domain name. |
| Typo-squatting | Numeric | Checks whether a domain name contains typos by comparing it to a list of misspelled domain names. We generate the list of misspelled domain names from a given list of keywords using dnstwist |
| Freenom TLD | Numeric | Checks whether the top-level domain belongs to Freenom |
| Numbers other than “19” | Numeric | Checks whether a domain contains numbers other than 19. Hao et al. ( |
| Number of subdomain levels | Numeric | Counts the number of subdomains in the domain name. For example, if a domain name is “ |
| Label | Numeric | The label of the domain, Class 0 constitutes the malicious domain names and class 1, the legitimate domain names. |
Available at https://tranco-list.eu/list/9622.
https://github.com/elceef/dnstwist.
https://www.freenom.com/en.
Fig. 2(left) and (right) architectures.
Different undersampling techniques.
| Method | Ratio | Fit-time (s) | G-Mean | F-1 | ROC AUC |
|---|---|---|---|---|---|
| NM | 20:80 | 0.391 | 0.930 | 0.892 | 0.935 |
| CNN | 65:35 | 2164.10 | 0.830 | 0.884 | 0.836 |
| ENN | 10:90 | 0.961 | 0.956 | 0.921 | 0.957 |
| TL | 10:90 | 0.897 | 0.933 | 0.880 | 0.937 |
Fig. 3Data distribution before (left) and after (right) NM under-sampling.
Average scores of classifiers without “Subdomain Levels” feature.
| Classifier | Fit time | F-1 | Precision | Accuracy | TPR | TNR |
|---|---|---|---|---|---|---|
| DTC | 85.9324 | 83.8446 | 93.8064 | 89.2683 | 94.9418 | |
| RFC | 1.3252 | 91.1245 | 91.7768 | 96.3627 | 97.7206 | |
| GBM | 1.5327 | 90.0024 | 94.3979 | 96.1680 | 86.1179 | 98.6809 |
| XGBoost | 0.9023 | 91.2849 | 92.1589 | 96.4447 | 90.8309 | 97.8487 |
| SVM | 1.2799 | 90.2950 | 96.3217 | 85.6049 | ||
| MLP | 13.1045 | 93.8070 | 89.4470 | 98.4441 |
Average scores of classifiers with “Subdomain Levels” feature.
| Classifier | Fit time | F-1 | Precision | Accuracy | TPR | TNR |
|---|---|---|---|---|---|---|
| DTC | 84.8376 | 81.3734 | 93.1096 | 90.0364 | 93.8779 | |
| RFC | 1.170378 | 91.0747 | 90.3244 | 96.2602 | 97.2400 | |
| GBM | 1.344766 | 95.3655 | 89.4990 | 98.8986 | ||
| XGBoost | 0.889119 | 91.6890 | 91.8256 | 96.5984 | 91.8810 | 97.7780 |
| SVM | 1.335580 | 91.9326 | 96.8955 | 88.3454 | ||
| MLP | 9.574650 | 91.3232 | 92.5737 | 96.5318 | 90.3187 | 98.0853 |
Fig. 410-fold cross-validation mean ROC-AUC.
10-fold cross-validation -values.
| SVM | DTC | RFC | GBM | XGBoost | MLP | |
|---|---|---|---|---|---|---|
| SVM | 1.000000 | 0.491476 | ||||
| DTC | 0.491476 | 1.000000 | 0.071581 | 0.067157 | 0.064616 | 0.092037 |
| RFC | 0.071581 | 1.000000 | 0.941467 | 0.824637 | 0.544074 | |
| GBM | 0.067157 | 0.941467 | 1.000000 | 0.755018 | 0.493152 | |
| XGBoost | 0.064616 | 0.824637 | 0.755018 | 1.000000 | 0.420513 | |
| MLP | 0.092037 | 0.544074 | 0.493152 | 0.420513 | 1.000000 |
Performance of HT on ST1 to ST6.
| Stream | Accuracy | Precision | Recall | F-1 | G-Mean | |
|---|---|---|---|---|---|---|
| ST1 | 0.8755 | 0.9095 | 0.8654 | 0.8869 | 0.9261 | |
| ST2 | 0.9715 | 0.8276 | 0.8693 | 0.8187 | 0.8433 | 0.8991 |
| ST3 | 0.9693 | 0.9621 | 0.8805 | 0.9195 | 0.9343 | |
| ST4 | 0.9583 | 0.8653 | 0.9309 | 0.8544 | 0.8910 | 0.9170 |
| ST5 | 0.9371 | 0.8743 | ||||
| ST6 | 0.9134 | 0.8269 | 0.9377 | 0.8859 | 0.9111 | 0.9130 |
Examples of misspelled domains names generated from the domain “covid.com”.
| Original | Addition | Bitsquatting | Homoglyph | Insertion | Repetition |
|---|---|---|---|---|---|
| covid.com | covidb.com | aovid.com | covld.com | copvid.com | ccovid.com |