| Literature DB >> 25685511 |
Jayaram Raghuram1, David J Miller1, George Kesidis2.
Abstract
We propose a method for detecting anomalous domain names, with focus on algorithmically generated domain names which are frequently associated with malicious activities such as fast flux service networks, particularly for bot networks (or botnets), malware, and phishing. Our method is based on learning a (null hypothesis) probability model based on a large set of domain names that have been white listed by some reliable authority. Since these names are mostly assigned by humans, they are pronounceable, and tend to have a distribution of characters, words, word lengths, and number of words that are typical of some language (mostly English), and often consist of words drawn from a known lexicon. On the other hand, in the present day scenario, algorithmically generated domain names typically have distributions that are quite different from that of human-created domain names. We propose a fully generative model for the probability distribution of benign (white listed) domain names which can be used in an anomaly detection setting for identifying putative algorithmically generated domain names. Unlike other methods, our approach can make detections without considering any additional (latency producing) information sources, often used to detect fast flux activity. Experiments on a publicly available, large data set of domain names associated with fast flux service networks show encouraging results, relative to several baseline methods, with higher detection rates and low false positive rates.Entities:
Keywords: Algorithmically generated domain names; Anomaly detection; Domain name modeling; Fast flux; Malicious domain names
Year: 2014 PMID: 25685511 PMCID: PMC4294760 DOI: 10.1016/j.jare.2014.01.001
Source DB: PubMed Journal: J Adv Res ISSN: 2090-1224 Impact factor: 10.479
Fig. 1Plots of empirical PMF of the number of substrings, total length, length of the second substring, and length of the third substring estimated on a data set of normal domain names and on a data set of attack domain names.
Fig. 2ROC curves for the test statistics based on the joint distribution of character sequences in the substrings parsed out of the domain names.
Fig. 3ROC curves for the test statistics based on the distribution of the number of substrings, the total length, the length of the individual substrings, and the joint distribution of characters.
Fig. 4ROC curves for the test statistics based on the modeling of substrings with word occurrences from a word list.
Examples of valid and attack test set domain names shown to illustrate some of the challenges in this detection problem.
| Parsed domain name | Valid or attack | |
|---|---|---|
| nkotb | 0.090852 | Valid |
| kdo od govern | 0.090903 | Attack |
| sua od years | 0.090997 | Attack |
| epupz | 0.091044 | Valid |
| asxetos | 0.092950 | Valid |
| ngo duck half | 0.094218 | Attack |
| cqu od federal | 0.094246 | Attack |
| loser boi music blog spot | 0.094316 | Valid |
| cool veg if exot | 0.094363 | Attack |
| images wun bit ip | 0.094422 | Attack |
| circle mat i me pav | 0.094657 | Attack |
| bauex per ten forum | 0.094719 | Valid |
| kreuz | 0.110932 | Valid |