| Literature DB >> 35371228 |
Safaa S I Ismail1, Romany F Mansour1, Rasha M Abd El-Aziz2, Ahmed I Taloba3.
Abstract
In this modern era, each and everything is computerized, and everyone has their own smart gadgets to communicate with others around the globe without any range limitations. Most of the communication pathways belong to smart applications, call options in smartphones, and other multiple ways, but e-mail communication is considered the main professional communication pathway, which allows business people as well as commercial and noncommercial organizations to communicate with one another or globally share some important official documents and reports. This global pathway attracts many attackers and intruders to do a scam with such innovations; in particular, the intruders generate false messages with some attractive contents and post them as e-mails to global users. This kind of unnecessary and not needed advertisement or threatening mails is considered as spam mails, which usually contain advertisements, promotions of a concern or institution, and so on. These mails are also considered or called junk mails, which will be reflected as the same category. In general, e-mails are the usual way of message delivery for business oriented as well as any official needs, but in some cases there is a necessity of transferring some voice instructions or messages to the destination via the same e-mail pathway. These kinds of voice-oriented e-mail accessing are called voice mails. The voice mail is generally composed to deliver the speech aspect instructions or information to the receiver to do some particular tasks or convey some important messages to the receiver. A voice-mail-enabled system allows users to communicate with one another based on speech input which the sender can communicate to the receiver via voice conversations, which is used to deliver voice information to the recipient. These kinds of mails are usually generated using personal computers or laptops and exchanged via general e-mail pathway, or separate paid and nonpaid mail gateways are available to deal with certain mail transactions. The above-mentioned e-mail spam is considered in many past researches and attains some solutions, but in case of voice-based e-mail aspect, there will be no options to manage such kind of security parameters. In this paper, a hybrid data processing mechanism is handled with respect to both text-enabled and voice-enabled e-mails, which is called Genetic Decision Tree Processing with Natural Language Processing (GDTPNLP). This proposed approach provides a way of identifying the e-mail spam in both textual e-mails and speech-enabled e-mails. The proposed approach of GDTPNLP provides higher spam detection rate in terms of text extraction speed, performance, cost efficiency, and accuracy. These all will be explained in detail with graphical output views in the Results and Discussion.Entities:
Mesh:
Year: 2022 PMID: 35371228 PMCID: PMC8970896 DOI: 10.1155/2022/7710005
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Sample spam e-mail.
Figure 2E-mail spam filter procedure model.
Figure 3Genetic procedure: schematic structure.
Description of dataset attributes.
| Attributes | Type | Description |
|---|---|---|
| 1–48 | char_freq_CHAR | The number of characters in an e-mail that are the same as CHAR. |
| 49–54 | capital_run_length_average | The average length of consecutive capital letter sequences |
| 55 | capital_run_length_longest | Longest consecutive capital letter sequence length |
| 56 | capital_run_length_longest | Longest consecutive capital letter sequence length |
| 57 | capital_run_length_total | Overall capital letters in e-mail |
| 58 | Class attribute | Indicating if an e-mail is classified as spam with class label (1) or not spam with class label (0) |
Confusion Matrix.
| Metrics | Ham | Spam |
|---|---|---|
| Ham | TN | FP |
| Spam | FN | TP |
Note. TN: True Negative (ham predicted as ham), TP: True Positive (spam predicted as spam), FP: False Positive (spam predicted as ham), FN: False Negative (ham predicted as spam).
Proposed algorithm efficiency and frequency comparison with several traditional algorithms.
| Algorithm | Frequency (%) | Accuracy (%) |
|---|---|---|
| Naïve Bayes | 81 | 80 |
| Support Vector Machine | 90 | 90 |
| Nearest Neighbor | 89 | 89 |
| J48 | 89.7 | 89.7 |
| GDTPNLP | 95.9 | 98.6 |
Figure 4Proposed algorithm GDTPNLP accuracy and frequency comparison with different traditional algorithms.
Figure 5Graphical representation of textual feature extraction accuracy.
Textual feature extraction accuracy.
| Data size (bytes) | SVM | GDTPNLP |
|---|---|---|
| 100 | 45 | 85 |
| 200 | 56 | 92 |
| 300 | 62 | 93 |
| 400 | 65 | 95 |
| 500 | 69 | 99.6 |
Figure 6Graphical representation of speech-to-text conversion accuracy.
Speech-to-text conversion accuracy.
| Voice data (dB) | With NLP | Without NLP |
|---|---|---|
| 100 | 99.8 | 85 |
| 200 | 99 | 92 |
| 300 | 98.6 | 93 |
| 400 | 99.5 | 95 |
| 500 | 99.8 | 99.6 |
Figure 7Graphical representation of classification accuracy.
Classification accuracy.
| Data size (bytes) | SVM | DT classifier |
|---|---|---|
| 100 | 89.6 | 99.8 |
| 200 | 72.5 | 98.6 |
| 300 | 68.2 | 98.5 |
| 400 | 59.5 | 98.3 |
| 500 | 51.6 | 98.1 |
CPU execution time for e-mail spam detection.
| Features | Computation time (s) |
|---|---|
| Words | 12196.1 |
| 3 grams | 44605.3 |
| 4 grams | 87519.42 |
Figure 8Buffer size vs. CPU time.