Literature DB >> 35742272

NLP-Based Digital Forensic Analysis for Online Social Network Based on System Security.

Abstract

Social media evidence is the new topic in digital forensics. If social media information is correctly explored, there will be significant support for investigating various offenses. Exploring social media information to give the government potential proof of a crime is not an easy task. Digital forensic investigation is based on natural language processing (NLP) techniques and the blockchain framework proposed in this process. The main reason for using NLP in this process is for data collection analysis, representations of every phase, vectorization phase, feature selection, and classifier evaluation. Applying a blockchain technique in this system secures the data information to avoid hacking and any network attack. The system's potential is demonstrated by using a real-world dataset.

Entities: Chemical

Keywords: blockchain; digital forensics; machine learning; natural language processing

Mesh：

Year: 2022 PMID： 35742272 PMCID： PMC9222863 DOI： 10.3390/ijerph19127027

Source DB: PubMed Journal: Int J Environ Res Public Health ISSN： 1660-4601 Impact factor: 4.614

1. Introduction

Social media is generally used to communicate on the internet through various channels, to collaborate with different users, and to share information. The shared content supports researchers in investigating the potential of the criminal process. Social media does not have any limitations for content sharing related to victims, suspects, and witnesses [1,2]. The websites and applications are used to facilitate the sharing of content between connected networks. One of the social structures is the online social network (OSN), which which includes platforms such as Twitter or Facebook [3,4,5,6]. Forensic data extraction from social media platforms has become a considerable research problem [7,8,9]. Conventional digital forensics collects most of the information, which is a huge art of proof. Nevertheless, the extraction process is not practical on the OSN regarding the nature of the highly distributed network, shared content, and data size. Data collection from the individual subjects without any acceptable reason is almost unmanageable, and because of privacy laws, limited access is permitted [10,11,12]. Forensic data collection connects to the system operator for the formatting issue and data authenticity. The available digital forensics (DF) methods entail many challenges in cyber-physical systems. This includes the difficulties of data access, data originating from various locations, the traceability and transparency of evidence, and huge-volume data analysis. During the past few years, a large number of researchers have focused on forensic analysis based on cloud computing [13,14,15], evidence modeling [16,17,18], and assisting the community of law enforcement. Blockchain technology is a distributed ledger system that collects and saves the proper records in the decentralized format of a peer-to-peer network. The stored data are based on a timestamp block, and directly link with the chain based on proof of trust [19,20,21]. The advantages of applying blockchain in the DF system are to provide the digital evidence the accessibility of self-verification for ensuring the hash function and evidence chain verification. This process guarantees the system transparency, security, and immutability in case of examination. In this paper, we propose a digital forensic platform using the integrated method of NLP and blockchain, feature selection using machine learning techniques, network vectorization, and system security analysis. Moreover, the presented system focuses on the relationship between content communication and individuals. The system uses the supervised NLP for topic extraction and applies the feature selection for topic ranking to find the highly weighted topics. Regarding the ranked topics, the classifiers can train with the famous algorithms and generators, and the output will be effective classifiers that can modify various metrics for further investigation. The main contributions of this paper are summarized as below: This research applies natural language processing techniques for the detailed data analysis approach; One of the important aspects of this research is the multi data source input, which makes this process more competitive with other research results; The main focus of this research is a system security method which stores the OSN information in blockchain framework. The remainder of this process is summarized in Section 2, which reviews the recent related literature and the current state of the art. Section 3 presents the details of the proposed digital forensics approach and a performance evaluation. Section 4 presents details for the results, the implementation of the proposed digital forensics analysis, and finally, the conclusion.

2. Related Work

In this section, the state of the art in DF is presented in detail. The main focus is on two parts. One is digital forensics challenges in blockchain, and the other is the forensic attainment of social media content.

2.1. Digital Forensics Challenges in Blockchain

In DF, hash functions are applied to maintain digital integrity and generate the digital digest to avoid changes of digital assets [22,23]. Nevertheless, in the applications related to DF, the main focus is on disk drive integrity and the validation of data. The biggest concern is the hash validation and verification for special files such as images. The DF approach depends on the investigators’ experience [24,25]. Some of the challenges related to the existing DF are presented as trustworthiness, integrity, provenance improvement, scalability, and availability [26,27]. In terms of trustworthiness, the system is supposed to check the trust if insider threats to the blockchain environment improve the trust of evidence [28]. Regarding integrity, the system checks the events examinations and items in the digital investigation. A traditional investigation provides forensic activities and supports the data, tools, etc. The improvement of provenance fetching at the top of hash functionality gives the information of hash validation to examine the system’s behavior with creating the hash tree. The scalability in the hash tree is able to support nodes of the system, and it is capable of the hash digest in the deep level [29,30,31]. Every blockchain node contains the whole hash information, guaranteeing accuratcy. This aspect saves the digital data for investigating forensic events.

2.2. Forensic Attainment of Social Media Content

The DF attainment involves steps of proofing the criminal cases regarding location, security, data, etc. [32]. The data provided from social media is more understandable and easy to access for users [33,34,35]. To use this type of data, it is required to follow the legal and formal process [36]. This process is performed by a highly skilled person with sufficient knowledge of technical and legal matters [37,38,39]. The artifacts of DF identify the critical sources for social media evidence [40,41]. Thus, lots of research materials focus on the attainment of forensic evidence; the extraction of forensic information from social media concentrates on the identification of specific devices and detects the traces found by the device from the web browsers or media applications [42,43,44,45]. To collect the forensic data, the requirements are defined as relevant data collection from multiple websites, metadata collection from social media information, and certifying the data integration in the forensic collection [46,47]. DF footage is mostly used for the comparative analysis of images and objects to find the relative subjects to provide the opinion findings [48,49,50,51]. Table 1 shows the existing state of the art in forensic data analysis. The main focus is on the research approach, limitations, and advantages analyzed in DF. The selected works show various DF analyses based on machine learning, video clustering, chat data encryption, instant message analysis, etc.

Table 1

Comparison of the recent state-of-the-art DF methods.

Author	ProposedApproach	Advantages	Limitations
Choi et al.(2019) [52]	Digital forensicanalysis for theKakao Talk encrypteddata.	Data recoverywithout requiringa password froma user.	Difficult to protectuser-sensitiveinformation.
Zhang et al.(2018) [53]	Digital forensicanalysis forsmart phone instantmessaging.	Investigating thehistory of usermessages in fourAndroid mobileapplications.	Limitation incommunicationmode for one-to-one contacts.
Du et al.(2020) [54]	Future of artificialintelligence ininvestigation ofdigital forensics.	Survey of automatedevidence-processingmethods based on AItechniques.	Image data withlow qualityis difficult to trainand process further.
Xiao et al.(2019) [55]	Analysis of video-based evidenceinvestigationof digital forensics.	Identification offorensics and link-establishment toinvestigate theobjects.	Difficult to identifyhuman face recognition,motion detection, etc.
Jadir et al.(2018) [56]	Digital forensicenhancement fordocument clustering,	Enhancing thedocument clusteringperformance forpartitioning thecriminal reports andtext dataset.	Challenges of processingif the data numbersincrease.

3. Proposed NLP-Based Digital Forensic Analysis for Online Social Network

This section briefly presents the integration of NLP techniques with blockchain. Figure 1 shows the overview of the proposed digital forensic analysis in terms of the NLP and blockchain approach. The main goal of this system is to improve the security of the DF analysis regarding the information shared on social media.

Figure 1

Overview of the proposed digital forensic analysis.

NLP techniques are applied to analyze the collected dataset from every aspect to provide meaningful information to the proposed system. This process has five main layers: a processing layer, an interface layer, an analysis layer, a data layer, and a knowledge layer. The responsibility of the processing layer is to identify and acquire the system input. Inputs are from social networks and incident notifications for which identification requires the incident specification and incident boundary specification. The identification, extraction, data collection, the parser, and preservation are required before moving forward to acquisition. Completing this process, the forensic data is the input of the data layer. In this layer, the hybrid data mapping is used for global and local ontology and storing the data. The next layer is the analysis layer for which the interface query is an NLP semantic interface, and the analysis operators are correlation events and document analysis, location analysis, the relatedness of findings, frequency analysis, and relationship analysis. The analysis report moves to the interface layer, which contains the timelines, tweet cloud, temporal graph, interaction graph, frequency chart, and location chart. Next is the knowledge layer, where the relationship between the extracted dataset and its linking is processed. After completing the NLP steps and data processing, the analyzed dataset is ready to save into the blockchain framework. The main reason for using the blockchain framework is to secure the collected dataset with limited access to avoid hacking or attack. The blockchain framework contains the data collection, investigation, and verification processes, which provides the verified data to a court for defense and prosecution.

3.1. NLP-Based Digital Forensic Analysis

The presented knowledge model is an event-based system that prepares the social media analysis based on electronic forensics. The ontology technique is applied for representing the related knowledge of OSNs. The detailed explanations show the automated method process and provide formal information through ontologies for the true validation and automated techniques. The investigation of forensic models is from the collection of semi-automated processes. This model contribution provides the boundary identification of data collection from the social media distribution network. The model gives the limitations of forensic data in terms of appropriate parameters and automated collection. Figure 2 describes the details of every layer for the forensic data-analysis process: the automated operators, semantic querying, the rules of the ontology and taxonomy processes, and the identification of the data interchange regarding the defined layers processed for analysis of forensic data system. The data layer contains the parser, the data profile, the content, the data from the network, and activity data. The knowledge layer contains the local/global ontology process and mapping details. The analysis layer gives the information and further processes the operators, timeline, interaction, temporal patterns, and correlation analysis, and finally, the interface layer shows the user applications and interfaces.

Figure 2

Multi-layered implementation process.

The vectorization process in this approach is based on the latent Dirichlet allocation (LDA) to group some of the topics out of the data. LDA is a famous topic-discovery or topic-categorization approach that clearly separates the content into the clusters of similar data [57]. Each cluster contains similar information and the same direction in terms of meaning and content similarity or probability. Equation (1) presents the estimated topics t based on LDA and edge , which transforms to the vector that provides the probability of for every topic. For deciding the topic t, the perplexity model of LDA was applied. This model shows the model’s performance and how well it works. Equation (2) shows the process of the perplexity evaluation, where is the words’ probability output from the LDA model, and i is the number of words. Thus, the presented approach evaluates the LDA model perplexity for the vectorization. Equation (3) presents the vertices , which can be vectorized with a vector in m edges. Every vertex can have various numbers; in Equation (4), the vertex normalization is evaluated for the topic distribution. Based on the various generated vectors’ sizes, the last step is high dimensional. The presented system evaluates the feature relevancy composition to reduce the dimensions. The CFR algorithm is applied for the feature selection regarding the information that can discover topics’ degree of impact. Table 2 shows the details of the feature selection of the presented system.

Table 2

Vertices before and after feature selection.

Before Feature Selection
Samples	topic 1	topic 2	topic 3	topic 4	topic 5	y (label)
Vector 1	1	0	0	0	0	0
Vector 2	0	0	0.7	0.2	0.4	1
Vector 3	0.5	0.3	0.2	0	0.4	0
Vector 4	0	0.6	0	0	0.6	1
After Feature Selection
Vectors	topic 1	topic 2	topic 3	y (label)
Vector 1	1	0	0	0
Vector 2	0	0	0.2	1
Vector 3	0.5	0.3	0	0
Vector 4	0	0.6	0	1

3.2. Blockchain-Based Digital Forensic Analysis

The blockchain approach for the digital forensics process is used to secure the forensic data in terms of transparency and performance. As shown in Figure 3, each entity links together, e.g., users, devices, evidence items, etc. The significant part to guarantee the digital evidence integrity is based on the hierarchy level in an investigation of chains.

Figure 3

Blockchain-based evidence identification.

There are three main processes defined for the DF investigation: applying a smart contract to perform the evidence analysis automatically, e.g., email analysis, signature or file analysis, etc, and providing better auditability by improving the investigation transparency, thereby reducing the costs and used resources and increasing connection stability between third parties.

4. Experimental Results and Development Environment

This section describes the details of the collected dataset, the system performance evaluation, the experiments, and the results of NLP and blockchain in forensic data analysis. Table 3 shows the details of the development environment for the digital forensic analysis.

Table 3

Development environment.

Module	Component	Description
MachineLearning	Operating System	Microsoft Windows 10
	CPU	Intel (R) Core (TM)i7-8700@3.20 GHz
	Main Memory	16GB RAM
	Core Programming Language	Python
	IDE	PyCharm Professional 2020
	ML Algorithm	Random Forest
BlockchainFramework	Operating System	Ubuntu Linux 18.04 LTS
	Docker Engine	Version 18.06.1-ce
	Docker Composer	Version 1.13.0
	IDE	Composer Playground
	Programming Language	Node.js

4.1. Data Representation and Collection

The data collected in this system is from online social media (OSN), which implements the knowledge model by using ontologies and semantic web processes. The data collection is from famous social media websites, such as Facebook and Twitter, including comments, shares, news broadcasts, etc. The number of users in this environments is high and information sharing is very fast and impressive. Regarding this process, the records of fake shared information is also very high. In this process, we have used 80% of the collected dataset for the training set and 20% for the testing set. Table 4 presents the details of the collected dataset for this approach.

Table 4

Data information.

Data Type	Total Records
Facebook	5000
Twitter	6500
Blogs	6600
News	5500
Training Set	80%
Testing Set	20%

The data layer is responsible for normalizing the provided data and storing them in persistent memory. This memory implements a design that is based on web schema. The unstructured data analysis requires a developed and customized tool for further processing. The analysis layer presents the analysis operators for the automatic process of social media contexts. The computerized analysis method is applied for quick data analysis and evaluation in this process. The decision-making process is a representative evaluation of the human examiner regarding various crimes in the social network evidence. Figure 4 shows the details of data classification and automation solutions.

Figure 4

Multi-layered data processing.

Table 5 presents the details of the operators for the data analysis process. Eight operators use the subject and object correlation to analyze the dataset’s contents.

Table 5

Analysis operators list of the following processes.

Name of Operators	Details
Tweet Cloud	Object correlation method to provide thefast overview of users’ tweet topics.
Hashtag Cloud	Object correlation based on hashtags ofuser tweets.
Interaction Graph	Subject and object correlation for sortingcontacts between the social graph of userswith the highest communication frequency.
Interaction FrequencyAnalysis	Subject and objective correlation to performthe frequency analysis between two users andidentify the relationship of the users’communication.
Views Similarity	Rule-based correlation for nearest user-opinionidentification.
Trace Operator	Linking the evidence to the entity.
Temporal ActivityGraph	Using temporal correlation to analyze the useractivity patterns in a defined period.
Geo-location ActivityGraph	Object correlation for sorting the location basedon the tagged online content.

4.2. Performance Evaluation of the Proposed Online Digital Forensic Analysis

This part presents the performance evaluation of the proposed online forensic analysis. We have defined three metrics of precision P, recall R, and F-measure . In this process, we used the Random Forest algorithm to analyze this process and compare our results with the Decision Tree, Naive Bayes, Logistic Regression, and Support Vector Machine algorithms. The main reason for using Random Forest in this process is its good performance in terms of classification, as compared to the other algorithms [58]. Table 6 shows the details of each classifier’s performance for each fold. Equations (5)–(7) show the details of precision, recall, and f-measure in this process.

Table 6

Different classifiers’ performance evaluation for each fold.

Fold	Metrics	Decision Tree	NaiveBayes	LogisticRegression	RandomForest	Support VectorMachine
1	P	0.6595	0.8254	0.9486	0.9846	0.6487
	R	0.7511	0.9700	0.7111	0.6911	0.7348
	F1	0.6667	0.6174	0.8611	0.7811	0.7794
2	P	0.7198	0.3541	0.6736	0.8947	0.4955
	R	0.5511	0.7511	0.4711	0.5948	0.6564
	F1	0.6944	0.5656	0.5611	0.7182	0.5836
3	P	0.6111	0.6993	0.8793	0.9831	0.6939
	R	0.6311	0.9300	0.7511	0.8334	0.8479
	F1	0.6825	0.6111	0.7622	0.7939	0.7749
4	P	0.5968	0.7986	0.8611	0.9444	0.6232
	R	0.8711	0.8711	0.7911	0.7746	0.7498
	F1	0.6929	0.6477	0.7994	0.8337	0.6949
5	P	0.6374	0.4058	0.7929	0.8478	0.5498
	R	0.5111	0.8711	0.7111	0.6964	0.7699
	F1	0.5990	0.6566	0.7633	0.7982	0.6479

Figure 5 shows the perplexity records achieved from LDA and records 210 out of the 250 tested topics. Regarding this process, the data was vectorized for the 210 topics. In Figure 5, the x-axis presents the number of topics extracted from this process and the y-axis presents the perplexity of each topic category.

Figure 5

LDA-based preplexity records.

The next step is the cross-validation process. For each five-fold output, the classifier builds the m topics in the training set and validates the highest performance record. Table 6 gives the details of defined three classifiers per fold, and Table 7 shows the further evaluations of the metrics’ average scores.

Table 7

Average score of different classifiers.

Metrics	DecisionTree	NaiveBayes	LogisticRegression	RandomForest	Support VectorMachine
P	0.6449	0.5367	0.6393	0.0.9443	0.6279
R	0.6631	0.9191	0.6871	0.6943	0.7432
F1	0.6673	0.6197	0.7334	0.7611	0.6745

The presented system shows the benefits of feature selection in this process. Table 8 shows the details of the analysis with and without feature selection. The improvement of Random Forest’s performance is very visible. From the perspective of digital investigators, the feature selection is suitable to sort the related topics.

Table 8

Records with and without feature selection.

#	Metrics	DecisionTree	NaiveBayes	LogisticRegression	RandomForest	SupportVectorMachine
With featureselection	P	0.6449	0.6367	0.8513	0.9443	0.6279
	R	0.6631	0.9191	0.6871	0.6943	0.7432
	F1	0.6673	0.6197	0.7534	0.7611	0.6745
Without featureselection	P	0.6293	0.6176	0.8122	0.8321	0.4574
	R	0.6171	0.5351	0.6791	0.5467	0.6831
	F1	0.6372	0.5779	0.6974	0.6998	0.5445

The other benefit of applying sorted topics is identifying the communication between networks with a significant volume of data. Figure 6 shows the test set of 106 topics’ probability for each fold.

Figure 6

Topic probability analysis records.

4.3. Security Analysis of Online Digital Forensic Based on Blockchain

Security in forensic data is one of the most important and challenging aspects in this area. We have used the blockchain framework for online digital forensic security analysis to improve this system’s transparency and rate of the trust according to the following steps: The first step is digital evidence identification. The aim of this is to identify the digital fingerprint of evidence. Furthermore, one fingerprint is generated to examine the event for every certain claim; Based on the timestamp and additional information, the fingerprint records are written into the evidence block and appended to the blockchain; In the blockchain network, every participant holds a copy of the evidence blockchain. Figure 7 shows the JSON script for the evidence block.

Figure 7

Evidence block in JSON script.

Figure 8 presents the details of using blockchain in forensic analysis. There are four main sections, namely, data acquisition, identification, analysis, and presentation. Regarding the timeline, the transactional evidence record is in the blockchain framework. In data acquisition section, all the related information is saved in the blockchain. In the identification section, suspicious files are saved in the blockchain too. In the analysis stage, by using the hash function, various file types are analyzed and stored in blockchain. The presentation stage writes all the reports and findings to the blockchain.

Figure 8

Details of the process of the blockchain framework for forensic analysis.

5. Conclusions

Social media communication is an important source of evidence for criminal investigations, such as fake news or fake election investigations. In this paper, we proposed the integration of NLP techniques with blockchain to improve the security and performance of online digital forensics. In terms of NLP, the LDA topic modeling, feature extraction, and data analysis were applied for a detail analysis of the collected dataset. The collected information is from multi-source social media platforms, which provides more opportunity for results comparisons in various aspects, as compared with other state-of-the-art approaches. The Random Forest algorithm was applied on a real-world dataset and compared with the other four algorithms, namely, the Decision Tree, Naive Bayes, Logistic Regression, and Support Vector Machine algorithms. The main reason for selecting Random Forest for this system is the higher performance of this algorithm in classification tasks and related processes. The concept of blockchain in this system is to improve system security and trace process changes. The defined system is processed in the Hyperledger Fabric framework. The blockchain framework gives the opportunity to the system process of saving and securing the results, as well as all the digital forensic processess and data, with details. Future studies in this topic can apply the presented system to cybercriminal activities and fraud to overcome the recent issues in this field.

2 in total

1. Flood characterization based on forensic analysis of bridge collapse using UAV reconnaissance and CFD simulations.

Authors: Marianna Loli; Stergios Aristoteles Mitoulis; Angelos Tsatsis; John Manousakis; Rallis Kourkoulis; Dimitrios Zekkos
Journal: Sci Total Environ Date: 2022-02-03 Impact factor: 7.963

Review 2. Forensic image analysis - CCTV distortion and artefacts.

Authors: Dilan Seckiner; Xanthé Mallett; Claude Roux; Didier Meuwly; Philip Maynard
Journal: Forensic Sci Int Date: 2018-02-06 Impact factor: 2.395

2 in total