| Literature DB >> 31211254 |
Emmanuel Gbenga Dada1, Joseph Stephen Bassi1, Haruna Chiroma2, Shafi'i Muhammad Abdulhamid3, Adebayo Olusola Adetunmbi4, Opeyemi Emmanuel Ajibuwa5.
Abstract
The upsurge in the volume of unwanted emails called spam has created an intense need for the development of more dependable and robust antispam filters. Machine learning methods of recent are being used to successfully detect and filter spam emails. We present a systematic review of some of the popular machine learning based email spam filtering approaches. Our review covers survey of the important concepts, attempts, efficiency, and the research trend in spam filtering. The preliminary discussion in the study background examines the applications of machine learning techniques to the email spam filtering process of the leading internet service providers (ISPs) like Gmail, Yahoo and Outlook emails spam filters. Discussion on general email spam filtering process, and the various efforts by different researchers in combating spam through the use machine learning techniques was done. Our review compares the strengths and drawbacks of existing machine learning approaches and the open research problems in spam filtering. We recommended deep leaning and deep adversarial learning as the future techniques that can effectively handle the menace of spam emails.Entities:
Keywords: Analysis of algorithms; Computer privacy; Computer science; Computer security; Deep learning; Machine learning; Naïve Bayes; Neural networks; Spam filtering; Support vector machines
Year: 2019 PMID: 31211254 PMCID: PMC6562150 DOI: 10.1016/j.heliyon.2019.e01802
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Fig. 1The volume of spam emails 4th quarter 2016 to 1st quarter 2018.
Fig. 2Pictorial Representation of the Structure of this paper.
Summary of previous reviews in email spam filtering.
| Previous Reviews | Email Spam | Machine Learning | Comparative Analysis | Simulation Tool & Environment | Dataset Corpus | Architecture | Parameters | Period Covered |
|---|---|---|---|---|---|---|---|---|
| Lueg | 2000–2005 | |||||||
| Wang | 1995–2005 | |||||||
| Li et al. | 1997–2006 | |||||||
| Cormack | 2000–2008 | |||||||
| Sanz et al. | 2000–2008 | |||||||
| Dhanaraj and Karthikeyani | 1994–2013 | |||||||
| Bhowmick and Hazarika | 2004–2013 | |||||||
| Laorden | √ | √ | 2002–2014 | |||||
| Our Review | 2000–2018 |
Fig. 3Email server spam filtering architecture.
Publicly available email spam corpus.
| Dataset name | Number of messages | Rate of spam | Year of creation | References | |
|---|---|---|---|---|---|
| Spam | Non-spam | ||||
| Spam archive | 15090 | 0 | 100% | 1998 | Almeida and yamakami |
| Spambase | 1813 | 2788 | 39% | 1999 | Sakkis et al |
| Lingspam | 481 | 2412 | 17% | 2000 | Sakkis et al |
| PU1 | 481 | 618 | 44% | 2000 | Attar et al |
| Spamassassin | 1897 | 4150 | 31% | 2002 | Apache spamassassin |
| PU2 | 142 | 579 | 20% | 2003 | Zhang et al |
| PU3 | 1826 | 2313 | 44% | 2003 | Zhang et al |
| PUA | 571 | 571 | 50% | 2003 | Zhang et al |
| Zh1 | 1205 | 428 | 74% | 2004 | Zhang et al |
| Gen spam | 31,196 | 9212 | 78% | 2005 | Cormack and lynam |
| Trec 2005 | 52,790 | 39,399 | 57% | 2005 | Androutsopoulos et al |
| Biggio | 8549 | 0 | 100 | 2005 | Biggio et al |
| Phishing corpus | 415 | 0 | 100 | 2005 | Abu-nimeh et al |
| Enron-spam | 20170 | 16545 | 55% | 2006 | Koprinska et al |
| Trec 2006 | 24,912 | 12,910 | 66% | 2006 | Androutsopoulos et al |
| Trec 2007 | 50,199 | 25,220 | 67% | 2007 | Debarr and wechsler |
| Princeton spam image Benchmark | 1071 | 0 | 100% | 2007 | Wang et al |
| Dredze image spam Dataset | 3297 | 2021 | 62% | 2007 | Dredze, gevaryahu and elias-bachrach |
| Hunter | 928 | 810 | 53% | 2008 | Gao et al |
| Spamemail | 1378 | 2949 | 32% | 2010 | Csmininggroup |
Levels of cost sensitivity of model.
| λ | Maximum Tolerance Level | Significance of having such cost sensitivity? |
|---|---|---|
| 999 | 0.999 | Filtered messages are thrown away and no additional processing is carried out. |
| 99 | 0.9 | Filtering a non-spam message is slightly penalized above allowing a spam message to go through. It is used to demonstrate that it is cumbersome re-sending a filtered spam message than deleting it manually. |
| 1 | 0.5 | If the receiver is not concerned as regards missing a non-spam message. |
Fig. 4Architecture of neural network (NN) Classifier.
Fig. 5Rough Set (RS) email filtering process workflow from user mailbox.
Fig. 6Decision Tree Algorithm for email spam filtering.
Summary of published papers that attempted spam filtering using Machine Learning techniques.
| Reference | Dataset Description | Proposed Technique | Compared Algorithm(s) | Performance Metrics | Limitation(s) |
|---|---|---|---|---|---|
| Karthika and Visalakshi | Spambase dataset | Hybridised ACO and SVM | Hybridised ACO and SVM with KNN, NB and SVM. | Accuracy, precision and recall | Very low performance. |
| Awad and Foqaha | Spambase dataset | Combined Radial Basis Function Neural Networks (RBFNN) with PSO algorithm (HC-RBFPSO) | PSO, RBFNN, MLP and ANN | Accuracy | Time taken to build MLP is very high. |
| Sharma and Suryawanshi | Spambase dataset | kNN Classification with Spearmen Correlation | kNN with spearman and kNN with Euclidean | Accuracy, precision, recall, and F-measure. | Low performance |
| Awad and ELseuofi | SpamAssassin | Bayesian classification, k-NN, ANNs, SVMs, Artificial Immune System and Rough sets | Bayesian classification, k-NN, ANNs, SVMs, Artificial Immune System and Rough sets. | Recall, precision and accuracy | Many of the state-of-the-art spam classification techniques were not examined. |
| Rajamohana, Umamaheswari and Abirami | Dataset built by built by Ott et al. (2011) | Adaptive binary flower pollination algorithm (ABFPA) | ABFPA, BPSO, SFLA for feature selection while ABFPA, NB and kNN | Global best positions. | Standard evaluation metrics were not used to evaluate the performance of the proposed method. |
| Alkaht and Al Khatib | Randomly collected emails | Multi-stage Neural Networks for filtering spam | NN, MLP and Perceptron | Accuracy. | The method was not evaluated using standard email corpus. The training was time consuming |
| Sharma, Prajapat and Aslam | TREC07 dataset | MLP | MLP and NB | Accuracy, precision, and recall | Low performance |
| Mousavi and Ayremlou | Selected emails. | NB algorithm for spam classification | Not compared | Precision and Recall | No meaningful contribution to knowledge. Also, the performance of the method was not compared with other existing methods |
| Dhanaraj and Palaniswami | CSDMC2010 dataset | Firefly and Bayes classifiers | Firefly, NB, NN and PSO algorithm. | Sensitivity, specificity and accuracy | Low performance. |
| Choudhary and Dhaka | Words in data dictionary | GA | Not compared | Not stated | Performance not compared with other technique. |
| Palanisamy, Kumaresan and Varalakshmi | Ling dataset | Negative selection and PSO | NSA, PSO, SVM, NB and DFS-SVM | Accuracy | Only accuracy of the method was used in assessing it performance. |
| Shrivastava and Bindu | 2248 emails | GA with Heuristic Fitness Function | Not compared | Classification accuracy. | Accuracy of the method not compared with that of other techniques. |
| Zavvar, Rezaei and Garavand | Spambase dataset | PSO, ANN and SVM | PSO, SOM, kNN and SVM | AUC | The method only have AUC value as the only performance metrics. |
| Idris and Mohammad | Datasets from Machine Learning and intelligent system | AIS | Not compared | False positive rate. | No standard metric to evaluate its performance neither was the effectiveness compared with any other standard spam filtering method. |
| Sosa | 2200 e-mails from several senders to various receivers | Forward feature selection using a single-layer ANN as classificator with double cross-validation with 5-Fold | Not compared | Classification accuracy. | The effectiveness of the method was not measured with other known technique. |
| Karthika, Visalakshi and Sankar | Spambase | GA-NB and ACO-NB | GA-NB and ACO-NB | Accuracy, recall, precision and F-measure. | The is no improvement gain in the proposed algorithm compared to the existing approaches. |
| Bhagyashri and Pratap | SpamAssasin | Bayes Algorithm | Not compared | Precision, recall and accuracy. | The performance of the method was not compared with other standard algorithms. |
| Zhao and Zhang | Spambase | Rough Set | RS and NB | Classification Accuracy, Precision and Recall. | Low performance |
| Kumar and Arumugan | Collected emails | Probabilistic neural network for classification of spam mails while Particle Swarm Optimization is used for feature selection | PNN, BLAST and NB | Specificity and sensitivity. | Low performance |
| Akinyelu and Adewumi | 2000 phishing and ham mails | Random Forest | Compared with Fette et al | False Positive and False Negative | Adequate performance metrics not used to evaluate the effectiveness of the method. |
| Akshita | PU1, PU2, PU3, PUA and Enron Spam | Deep Learning for Java (DL4J) Deep Networks | Dense MLP, SDAE and DBN | Accuracy, Recall, Precision and F1 | Time consuming training |
| 1: Find Email Message class labels. |
| 1: |
| 1: |
| 1: |
| 1: |
| 1: |
| 1: |
| 1: |
| 1: |
| 1: |