| Literature DB >> 35957337 |
Majdi K Qabalin1, Muawya Naser1, Mouhammd Alkasassbeh1.
Abstract
Smartphones are an essential part of all aspects of our lives. Socially, politically, and commercially, there is almost complete reliance on smartphones as a communication tool, a source of information, and for entertainment. Rapid developments in the world of information and cyber security have necessitated close attention to the privacy and protection of smartphone data. Spyware detection systems have recently been developed as a promising and encouraging solution for smartphone users' privacy protection. The Android operating system is the most widely used worldwide, making it a significant target for many parties interested in targeting smartphone users' privacy. This paper introduces a novel dataset collected in a realistic environment, obtained through a novel data collection methodology based on a unified activity list. The data are divided into three main classes: the first class represents normal smartphone traffic; the second class represents traffic data for the spyware installation process; finally, the third class represents spyware operation traffic data. The random forest classification algorithm was adopted to validate this dataset and the proposed model. Two methodologies were adopted for data classification: binary-class and multi-class classification. Good results were achieved in terms of accuracy. The overall average accuracy was 79% for the binary-class classification, and 77% for the multi-class classification. In the multi-class approach, the detection accuracy for spyware systems (UMobix, TheWiSPY, MobileSPY, FlexiSPY, and mSPY) was 90%, 83.7%, 69.3%, 69.2%, and 73.4%, respectively; in binary-class classification, the detection accuracy for spyware systems (UMobix, TheWiSPY, MobileSPY, FlexiSPY, and mSPY) was 93.9%, 85.63%, 71%, 72.3%, and 75.96%; respectively.Entities:
Keywords: machine learning; privacy; random forest; spying systems; spyware; spyware dataset; stalkerware
Mesh:
Year: 2022 PMID: 35957337 PMCID: PMC9371186 DOI: 10.3390/s22155765
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Number of cyberattacks against Kaspersky Android mobile solutions users 2020–2021 [5].
Figure 2Spyware detection approaches.
Summary of the reviewed state-of-art in this area of study.
| Paper | Dataset Type | Permissions/Network Traffic | Data Analysis |
|---|---|---|---|
| Conti et al. [ | Limited with only one spying system | Network traffic-based | Machine learning-based |
| Ali-Gombe et al. [ | Malware only | Permissions-based | Statistical-based |
| Saad et al. [ | No dataset | Permissions | Fuzzy logic |
| Carlsson et al. [ | No dataset | Permissions-based | Statistical-based |
| Abualola et al. [ | Generic dataset | Internal binary code-based | Statistical-based |
| Pierazzi et al. [ | Generic dataset | Permissions-based | Machine learning-based |
| Han et al. [ | Malware only | Internal binary code-based | Machine learning-based |
| Kaur et al. [ | No dataset | Internal binary code-based | Machine learning-based |
| Vanjire et al. [ | Malware only | Internal binary code-based | Machine learning-based |
| Sutter et al. [ | No dataset | Internal binary code-based | Machine learning-based |
| Malik et al. [ | Limited with simple spyware | Network traffic-based | Machine learning-based |
| Anshul et al. [ | Malware only | Network traffic-based | Machine learning-based |
| Taylor et al. [ | Generic dataset | Network traffic-based | Machine learning-based |
Dataset identification details.
| Parameter | Value |
|---|---|
| Dataset Title | Android spyware |
| Data Type | PCAP files, CSV files |
| Data Class | Multivariate |
| Data Source | Android-based spyware rools |
| Applications Targeted | FlexiSPY, Mobilespy, mSPY, TheWiSPY, and UMobix |
| Data Format | PCAP files |
| Number of Files | 24 files |
| Total Data Size | 350 MB |
| Collection Strategy | Unified activity list |
| Data Scope | OSI layers 2–7; data link, network, transport, session, presentation, and application |
| Deployment Approach | Hybrid, host-based, and network-based |
| Time Constraints | Unified time interval |
| Number of Classes | Three classes: normal class, installation class, and operation class |
| File Integrity | MD5 |
| License Type | CC BY 4.0 |
| Data Privacy Compliance | GDPR, PDPC |
| Dataset Validation Technique | Confusion matrix |
| Data Collection Tool | PCAPDroid |
| Data Conversion Tool | CICFlowMeter |
| Data Preparation Tool | Tamr Unify |
| Data ML Analyzer Tool | Weka |
| Other Tools | Dr. Fone Root |
Spyware applications adopted in this research.
| System | Cost | Compatibility |
|---|---|---|
| mSPY | USD 240 | Rooted, non-rooted |
| uMobix | USD 320 | Rooted |
| MobileSPY | USD 230 | Rooted |
| FlexiSPY | USD 285 | Rooted, non-rooted |
| TheWiSPY | USD 325 | Rooted, non-rooted |
High-level summary of the applications’ features.
| System | Spying Scope | Platform | Upload | Sniffing |
|---|---|---|---|---|
| mSPY |
Social media apps Keylogger OS activity Update history Applications manifest Phone calls Microphone | Java, Kotlin | Periodic-b ased with fixed time interval. | Events-based |
| uMobix |
Social media apps Keylogger OS activity Applications manifest Phone calls Microphone | PhoneGap, Java | Adjustable periodic | Adjustable in terms of periodic or event-based. |
| MobileSPY |
Social media apps Keylogger OS activity Update history Applications Phone calls SIM tracker | React Native, Java | Non-adjustable periodic | Events-based |
| FlexiSPY |
Social media apps Keylogger OS activity Update history Applications manifest Phone calls | Pure Java | Periodic-based with fixed time interval | Events-based |
| TheWiSPY |
Social media apps Keylogger OS activity Phone calls Microphone | React Native, Java | Adjustable periodic | Adjustable in terms of periodic or event-based. |
Dataset files list and description.
| File Name | System Name | File Size | MD5 Hash | Data Tag |
|---|---|---|---|---|
| Normal_Traffic.pcap | SmartPhone Normal Traffic | 78.81 MB | 0151d5fc110f6f7a97ee52be29c99c9a | Normal Traffic |
| uMobix_Installation.pcap | uMobix | 14.37 MB | adab9d323fe85115a8cd8b38fcf45b0a | uMobix Inst. |
| uMobix_Traffic.pcap | uMobix | 16.28 MB | d1a8bbe1e6c0ad85ddb3ae3f0386cf83 | uMobix Traffic |
| TheWiSPY_Installation.pcap | TheWiSPY | 53.24 MB | 3b45d0ae1f1c9ca9c6b4542a6c956280 | TheWiSPY Inst. |
| TheWISPY_Traffic.pcap | TheWiSPY | 21.36 MB | f742fe72b9591a6a66e662eafc991c1b | TheWiSPY Traffic |
| mSPY Installation Process.pcap | mSPY | 11.34 MB | a5c90fbbefeb789fcacce36fd69a830a | mSPY Inst. |
| Mspy Traffic- Part1.pcap | mSPY | 25.94 MB | 298d7830454d522524c1fa6e98df9a99 | mSPY Traffic |
| Mspy Traffic- Part2.pcap | mSPY | 20.32 MB | 4ba6bd67e977126087a542715cf8143e | mSPY Traffic |
| MobileSpy_Traffic.pcap | MobileSPY | 12.76 MB | 8d7ec5fef06a896708dc486c6004e9c3 | MobileSPY Traffic |
| Mobilespy_Intallation_01.pcap | MobileSPY | 8.41 MB | 74200634455d33d5501622213f1ee8d0 | MobileSPY Inst |
| FlexSPY_Traffic.pcap | FlexiSPY | 22.32 MB | 3baf2d16713f8d94b1ff723061b8de09 | FlexiSPY Traffic |
| FlexiSPY_Installation.pcap | FlexiSPY | 16.78 MB | 570a6ddfffd72bb4f132823174cade66 | FlexiSPY Inst |
Figure 3Data distribution and volume.
Network protocol distribution according to class.
| Protocol | Class A | Class B | Class C |
|---|---|---|---|
| TCP | 77,679 | 105,347 | 88,833 |
| QUIC | 1411 | 196 | 39,160 |
| TLSv1.3 | 13,432 | 5853 | 20,590 |
| TLSv1.2 | 22,551 | 6078 | 9643 |
| UDP | 4020 | 196 | 1393 |
| DNS | 848 | 566 | 1940 |
| GQUIC | 161 | 67 | 1101 |
| HTTP | 33 | 18 | 45 |
Figure 4Dataset network traffic protocol distribution.
Activity list adopted during data collection.
| Activity | Activity Type | Description |
|---|---|---|
| Unlocking screen | Operating system security | Unlocking screen, which will trigger an event for spying system to log this event and sent it to control panel, also collecting pin key using keylogger. |
| Using instant messaging apps (WhatsApp, WeChat, Facebook, QQ, Snapchat, Telegram) | Data exchange, triggering OS APIs related to network infrastructure. | Simulate a real conversation between two accounts for each app and monitor the spying process to collect exchanged data between the spying client and control panel. |
| Opening camera and activating voice recording through dashboard panel. | Sensors related, camera and microphone handled under sensors APIs on the Android operating system. | Use the dashboard to open the camera and microphone to start eavesdropping. |
| Using encrypted end-to-end calls through messaging apps (WhatsApp, Telegram) | Sensors related; notifications related. | Spying systems do rely on notifications to sniff encrypted messages. |
| File exchange activity | Memory related | Receiving new files from Bluetooth and other communication infrastructure. |
Figure 5Class A data collection methodology workflow.
Figure 6Class B data collection methodology workflow.
Activity list adopted during data collection.
| System | Code Name |
|---|---|
| mSPY | Update services |
| uMobix | Play services |
| MobileSPY | Settings |
| FlexiSPY | Sync Services |
| TheWiSPY | System Settings |
Figure 7Class B data collection methodology workflow.
Dataset benchmark parameters.
| Benchmark Specification | Value |
|---|---|
| Defined rules | A dataset that can be used to build models capable of detecting spyware on Android efficiently and effectively. |
| Dataset quality | All training samples generated through real-world process without simulation tools. |
| Dataset quantity | Almost 14,000 instances collected. |
| Dataset diversity | The selection of the targeted spyware systems was made after reviewing previous research in this field, which belongs to different companies. |
| Dataset efficiency | A two-phase data collection adopted for each spyware system to provide more efficiency samples (installation, operation). |
| Dataset eligibility | Dataset eligibility has been tested using random forest algorithm after analyzing data using CICflowmeter, results has been listed in details and confirmed using machine learning. |
| Dataset consistency | To guarantee a consistency dataset we adopted a unified activity list that has been applied with respect to time for every spyware listed. |
| Dataset accessibility | Dataset will be published online. |
| Dataset documentation | Dataset is fully documented. |
| Dataset rules testing | A model built using machine learning based on the dataset to detect android spyware, results ranged between 72% and 93% with proper analysis and explanation. |
Figure 8Random forest algorithm for decision tree simplification.
Figure 9K-fold cross validation process [39].
Network traffic features set.
| Feature Name | Description |
|---|---|
| Src Port | Packet source port |
| Dst Port | Packet destination Port |
| Protocol | Packet protocol |
| Flow duration | Duration of the flow in microseconds |
| total Fwd Packet | Total packets in the forward direction |
| total Bwd packets | Total packets in the backward direction |
| total Length of Fwd Packet | Total size of packet in forward direction |
| total Length of Bwd Packet | Total size of packet in backward direction |
| Fwd Packet Length Min | Minimum size of packet in forward direction |
| Fwd Packet Length Max | Maximum size of packet in forward direction |
| Fwd Packet Length Mean | Mean size of packet in forward direction |
| Fwd Packet Length Std | Standard deviation of packet size in forward direction |
| Bwd Packet Length Min | Minimum size of packet in backward direction |
| Bwd Packet Length Max | Maximum size of packet in backward direction |
| Bwd Packet Length Mean | Mean size of packet in backward direction |
| Bwd Packet Length Std | Standard deviation of packet size in backward direction |
| Flow Bytes/s | Number of flow bytes per second |
| Flow Packets/s | Number of flow packets per second |
Figure 10Proposed detection model flow diagram.
Figure 11Classification accuracy results.
Classification results.
| Parameter | FlexiSPY | MobileSPY | mSPY | TheWiSPY | UMobix | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Binary | Multi | Binary | Multi | Binary | Multi | Binary | Multi | Binary | Multi | |
| Total Instances | 3186 | 3169 | 3050 | 2227 | 2298 | |||||
| Correctly Classified | 2306 | 2207 | 2253 | 2199 | 2239 | 2239 | 1907 | 1865 | 2160 | 2096 |
| Incorrectly Classified | 880 | 979 | 916 | 970 | 811 | 811 | 320 | 362 | 138 | 229 |
| Correct Percent | 72.30% | 69.20% | 71% | 69.30% | 73.40% | 73.40% | 85.60% | 83% | 93.90% | 90% |
| Incorrec Percent | 27.60% | 30.70% | 28.90% | 30.60% | 26.50% | 26.50% | 14.30% | 16.20% | 6% | 9.90% |
| Relative absolute error | 70.20% | 71.40% | 74.20% | 75.40% | 65% | 67.30% | 66.10% | 70.20% | 36.10% | 45.90% |
| Root relative squared error | 85.40% | 86.20% | 87.50% | 88.10% | 81% | 82.60% | 81% | 84.10% | 54.40% | 64.30% |
Multi-class confusion matrix results.
| T* | FlexiSPY | MobileSPY | mSPY | TheWiSPY | UMobix | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3186 | 3169 | 3050 | 2227 | 2298 | |||||||||||
| A | B | C | A | B | C | A | B | C | A | B | C | A | B | C | |
| Class A* | 1446 | 282 | 32 | 1419 | 330 | 11 | 1503 | 235 | 22 | 1697 | 42 | 21 | 1723 | 32 | 5 |
| Class B* | 432 | 676 | 35 | 512 | 765 | 9 | 382 | 658 | 23 | 165 | 136 | 14 | 88 | 321 | 20 |
| Class C* | 126 | 72 | 85 | 71 | 37 | 15 | 91 | 58 | 78 | 102 | 18 | 32 | 24 | 60 | 25 |
T*—Total Instances. Class A*—Normal traffic. Class B*—Spy application operation traffic. Class C*—Spy application installation traffic.
Detailed accuracy results.
| FlexiSPY | MobileSPY | mSPY | TheWiSPY | UMobix | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Normal | Malicious | Normal | Malicious | Normal | Malicious | Normal | Malicious | Normal | Malicious | |
| Total Instances | 3186 | 3169 | 3050 | 2227 | 2298 | |||||
| *TP Rate | 0.794 | 0.637 | 0.792 | 0.61 | 0.826 | 0.699 | 0.955 | 0.486 | 0.975 | 0.825 |
| *FP Rate | 0.363 | 0.206 | 0.39 | 0.208 | 0.331 | 0.174 | 0.514 | 0.045 | 0.175 | 0.025 |
| F-Measure | 0.761 | 0.674 | 0.753 | 0.652 | 0.799 | 0.702 | 0.913 | 0.587 | 0.961 | 0.865 |
*TP—True Positive. *FP—False Positive.
Literature comparison list.
| Paper | Spyware Systems | Deployment Approach | Analysis Technique | Analysis Algorithm | Dataset Availability | Results |
|---|---|---|---|---|---|---|
| Conti et al. [ | Cerberus, mSPY, TruthSPY | Network-based | Dynamic technique | RF, k-NN | Not available | RF 85%, k-NN 65%, and 47% for LR |
| Malik et al. [ | Generic embedded spyware tools not monitoring systems | Network-based | Static technique | RF | Not available | 63% |
| This research | UMobix, TheWiSPY, MobileSPY, FlexiSPY, and mSPY | Hybrid approach | Hybrid approach | RF | Available under CC BY 4.0 | 79% for the binary-class classification and 77% for the multi-class classification |