| Literature DB >> 35983156 |
Abstract
e-mail service providers and consumers find it challenging to distinguish between spam and nonspam e-mails. The purpose of spammers is to spread false information by sending annoying messages that catch the attention of the public. Various spam identification techniques have been suggested and evaluated in the past, but the results show that the more research in this regard is required to enhance accuracy and to reduce training time and error rate. Thus, this research proposes a novel machine learning-based hybrid bagging method for e-mail spam identification by combining two machine learning methods: random forest and J48 (decision tree). The proposed framework categorizes the e-mail into ham and spam. The database is split into multiple sets and provided as input to each method in this procedure. Moreover, tokenization, stemming, and stop word removal are performed in the preprocessing stage. Further, correlation feature selection (CFS) is employed in this research to select the required features from the preprocessed data. The effectiveness of the presented method is evaluated in terms of true-negative rates, accuracy, recall, precision, false-positive rate, f-measure, and false-negative rate; the outcomes of three studies are compared. According to the results, the presented hybrid bagged model-based SMD technology achieved 98 percent accuracy.Entities:
Mesh:
Year: 2022 PMID: 35983156 PMCID: PMC9381222 DOI: 10.1155/2022/2500772
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Common e-mail filter procedure.
Figure 2Mechanism process of RF.
Figure 3Decision tree structure.
Figure 4System design.
Figure 5e-mail classification based on spam mail identification.
Mail database.
| Database | Random forest | Decision tree |
|---|---|---|
| Training ham mails | 180 | 120 |
| Training spam mails | 130 | 170 |
| Testing ham mails | 150 | 50 |
| Testing spam mails | 110 | 90 |
| Average mails | 500 | 500 |
| Overall | 1000 | |
Figure 6Bagging approach.
Working process.
| Parameter | Example |
|---|---|
| I/P | Subject: A new way to shop! Get newpass free for a year & enjoy benefits across brands! Continue to earn a minimum 5% Newcoins! Terms and conditions applied. Click here for more detail |
| Tokenization | “Subject” “:” “A” “new” “way” “to” “shop” “!” “Get” “newpass” “free” “for” “a” “year” “&” “enjoy” “benefits” “across” “brands” “!” “Continue” “to” “earn” “minimum” “5%” “Newcoins” “!” “Terms” “and” “condition” “applied” “.” “Click” “here” “for” “more” “detail” |
| Stop word elimination | “new” “way” “shop” “Get” “newpass” “free” “year” “enjoy” “benefits” “across” “brands” “Continue” “earn” “minimum” “5%” “Newcoins” “Terms” “condition” “applied” “.” “Click” “here” “more” “detail” |
| Stemming | “new” “way” “shop” “Get” “newpass” “free” “year” “enjoy” “benefits” “across” “brands” “Continue” “earn” “minimum” “5%” “Newcoins” “Terms” “condition” “applied” “.” “Click” “here” “more” “detail” |
| Outcome | Spam mail |
SMD evaluation measure.
| Assessment parameter | Specification | Model |
|---|---|---|
| Precision | The efficacy of the classifier is defined by precision |
|
| Accuracy | The proportion of positive forecasted value to the overall set |
|
| Recall | The positively labeled information provided by the classification out of the entire data |
|
|
| Overall quality is demonstrated by the classifier's ability to produce efficient beneficial results. | 2 × |
| True-negative rate ( | Spam mails managed to identify as a percentage of all spam mails. |
|
| False-negative rate ( | It detects the number of spam e-mails that have been missed. |
|
| False-positive rate ( | The number of spam e-mails mistakenly detected as a proportion of overall spam mails |
|
| True positive ( | The sum of ham electronic mails that were accurately detected. | — |
| False negative ( | The sum of ham mails that have been mistakenly classified as spam. | — |
| False positive ( | The sum of spam messages that were mistakenly recognized as ham. | |
| True negative ( | The sum of spam e-mails that were appropriately detected.- | — |
Analysis outcome (1).
| Parameter | Random forest | Decision tree | Hybrid bagging |
|---|---|---|---|
| Accuracy | 84 | 92 | 88 |
| Precision | 85 | 94 | 90 |
| Recall | 82 | 90 | 86 |
|
| 90 | 85 | 88 |
Figure 7SMD parameter outcome on accuracy, precision, recall, and F-score.
Analysis outcome (2).
| Parameter | Random forest | Decision tree | Hybrid bagging |
|---|---|---|---|
| True positive | 82 | 90 | 86 |
| False positive | 15 | 5 | 11 |
| True negative | 87 | 93 | 84 |
| False negative | 20 | 10 | 14 |
Figure 8Analysis outcome of true and false positive and negative.
Analysis outcome (3).
| Parameter | Random forest | Decision tree | Hybrid bagging |
|---|---|---|---|
| True-positive rate | 87 | 95 | 89 |
| False-positive rate | 20 | 10 | 16 |
| False-negative rate | 15 | 5 | 12 |
Figure 9Analysis outcome of TN, FN, and FP.