Literature DB >> 33432264

FakeBERT: Fake news detection in social media with a BERT-based deep learning approach.

Rohit Kumar Kaliyar1, Anurag Goswami1, Pratik Narang2.   

Abstract

In the modern era of computing, the news ecosystem has transformed from old traditional print media to social media outlets. Social media platforms allow us to consume news much faster, with less restricted editing results in the spread of fake news at an incredible pace and scale. In recent researches, many useful methods for fake news detection employ sequential neural networks to encode news content and social context-level information where the text sequence was analyzed in a unidirectional way. Therefore, a bidirectional training approach is a priority for modelling the relevant information of fake news that is capable of improving the classification performance with the ability to capture semantic and long-distance dependencies in sentences. In this paper, we propose a BERT-based (Bidirectional Encoder Representations from Transformers) deep learning approach (FakeBERT) by combining different parallel blocks of the single-layer deep Convolutional Neural Network (CNN) having different kernel sizes and filters with the BERT. Such a combination is useful to handle ambiguity, which is the greatest challenge to natural language understanding. Classification results demonstrate that our proposed model (FakeBERT) outperforms the existing models with an accuracy of 98.90%. © Springer Science+Business Media, LLC, part of Springer Nature 2021.

Entities:  

Keywords:  BERT; Deep learning; Fake news; Neural network; Social media

Year:  2021        PMID: 33432264      PMCID: PMC7788551          DOI: 10.1007/s11042-020-10183-2

Source DB:  PubMed          Journal:  Multimed Tools Appl        ISSN: 1380-7501            Impact factor:   2.757


Introduction

In the past few years, various social media platforms such as Twitter, Facebook, Instagram, etc. have become very popular since they facilitate the easy acquisition of information and provide a quick platform for information sharing [10, 21]. The availability of unauthentic data on social media platforms has gained massive attention among researchers and become a hot-spot for sharing fake news [16, 46]. Fake news has been an important issue due to its tremendous negative impact [16, 46, 53], it has increased attention among researchers, journalists, politicians and the general public. In the context of writing style, fake news is written or published with the intent to mislead the people and to damage the image of an agency, entity, person, either for financial or political benefits [14, 35, 39, 53]. Few examples of fake news are shown in Fig. 1. These examples of fake news were in trending during the COVID-19 pandemic and 2016 U.S. General Presidential Election.
Fig. 1

Examples of some fake news spread over social media (Source: Facebook®;)

Examples of some fake news spread over social media (Source: Facebook®;) In the research context, related synonyms (keywords) often linked with fake news: Rumor: A rumour [4, 12, 16] is an unverified claim about any event, transmitting from individual to individual in the society. It might imply to an occurrence, article, and any social issue of open public concern. It might end up being a socially dangerous phenomenon in any human culture. Hoax: A hoax is a falsehood deliberately fabricated to masquerade as the truth [43]. Currently, it has been increasing at an alarming rate. Hoax is also known as with similar names like prank or jape.

Existing approaches for fake news detection

Detection of fake news is challenging as it is intentionally written to falsify information. The former theories [1] are valuable in guiding research on fake news detection using different classification models. Existing learnings for fake news detection can be generally categorized as (i) News Content-based learning and (ii) Social Context-based learning. News content-based approaches [1, 14, 51, 53] deals with different writing style of published news articles. In these techniques, our main focus is to extract several features in fake news article related to both information as well as the writing style. Furthermore, fake news publishers regularly have malignant plans to spread mutilated and deluding, requiring specific composition styles to interest and convince a wide extent of consumers that are not present in true news stories. In these learnings, style-based methodologies [12, 35, 53] are helpful to capture the writing style of manipulators using linguistic features for identifying fake articles. Thus, it is difficult to detect fake news more accurately by using only news content-based features [14, 33, 46]. Thus, we also need to investigate the engagement of fake news articles with users. Social context-based approaches [14, 17, 38, 51, 53] deals with the latent information between the user and news article.Social engagements (the semantic relationship between news articles and user) can be used as a significant feature for fake news detection. In these approaches, instance-based methodologies [51] deals with the behaviour of the user towards any social media post to induce the integrity of unique news stories. Furthermore, propagation-based methodologies [51] deals with the relations of significant social media posts to guide the learning of validity scores by propagating credibility values between users, posts, and news. Approaches related to fake news detection show in Fig. 2. In most of the existing and useful methods [14, 38, 51] consists of news content and context level features using unidirectional pre-trained word embedding models (such as GloVe, TF-IDF, word2Vec, etc.) There is a large scope to use bidirectional pre-trained word embedding models having powerful feature extraction capability.
Fig. 2

Approaches for fake news detection

Approaches for fake news detection

Our contribution

In the existing approaches [1, 33, 40], for the detection of fake news, many useful methods have been presented using traditional machine learning models. The primary advantage of using deep learning model over existing classical feature-based approaches is that it does not require any handwritten features; instead, it identifies the best feature set on its own. The powerful learning ability of deep CNN is primarily due to the use of multiple feature extraction stages that can automatically learn representations from the dataset. In the existing approaches [18, 19, 26], several inspiring ideas have been discussed to bring advancements in deep Convolutional Neural Networks(CNNs) like exploiting temporal and channel information, depth of architecture, and graph-based multi-path information processing. The idea of using a block of layers as a structural unit is also gaining popularity among researchers. In this paper, we propose a BERT-based deep learning approach (FakeBERT) by combining different parallel blocks of the single-layer CNNs with the Bidirectional Encoder Representations from Transformers (BERT). We utilize BERT as a sentence encoder, which can accurately get the context representation of a sentence. This work is in contrast to previous research works [9] where researchers looked at a text sequence in a unidirectional way (either left to right or right to left for pre-training). Many existing and useful methods had been [9, 24] presented with sequential neural networks to encode the relevant information. However, a deep neural network with bidirectional training approach can be an optimal and accurate solution for the detection of fake news. Our proposed method improves the performance of fake news detection with the powerful ability to capture semantic and long-distance dependencies in sentences. To design our proposed architecture, we have added a classification layer on the top of the encoder output, multiplying the output vector by the embedding matrix, and finally calculated the probability of each vector with the Softmax function. Our model is a combination of three parallel blocks of 1D-convolutional neural networks with BERT having different kernel sizes and filters following by a max-pooling layer across each block. With this combination, the documents were processed using different CNN topologies by varying kernel size (different n-grams), filters, and several hidden layers or nodes. The design of FakeBERT consists of five convolution layers, five max-pooling layers followed by two densely connected layers and one embedding layer (BERT-layer) of input. In each layer, several filters have been applied to extract the information from the training dataset. Such a combination of BERT with one-dimensional deep convolutional neural network (1d-CNN) is useful to handle large-scale structure as well as unstructured text. It effectively addresses ambiguity, which is the greatest challenge to natural language understanding. Experiments were conducted to validate the performance of our proposed model. Several performance evaluation parameters (training accuracy, validation accuracy, False Positive Rate (FPR), and False Negative Rate (FNR)) have been taken into consideration to validate the classification results. Extensive experimentations demonstrate that our proposed model outperforms as compared to the existing benchmarks for classifying fake news. We illustrate the performance of our bidirectional pre-trained model (BERT) achieved an accuracy of 98.90%. Our proposed approach produces improved results by 4% comparing to the baseline approaches and is promising for the detection of fake news.

Related work

This section briefly summarizes the work in the field of fake news detection. Kumar et al [21] have explored a comprehensive survey of diverse aspects of fake news. Different categories of fake news, existing algorithms for counterfeit news detection, and future aspects have been explored in this research article. In one of the research, Shin et al [37] have investigated about fundamental theories across various disciplines to enhance the interdisciplinary study of fake news. In their study, authors have mainly investigated the problem of fake news from four prospectives: False knowledge it carries (what type of false message you get from the content), writing styles(different writing styles for creating fake news), propagation patterns (when it is shared in a network, then which trends it follows), and the credibility of its creators and spreaders (the credibility score of a news creator and spreader). Bondielli et al [4] have presented a hybrid approach for detecting automated spammers by amalgamating community-based features with other feature categories, namely meta-content and interaction-based features. In another research, Ahmed et al [1] have focused on automatic detection of fake content using online fake reviews. Authors have also explored two different feature extraction methods for classifying fake news. They have examined six different machine learning models and shown improved accomplishments as compared to existing state-of-the-art benchmarks. In one of the researches, Allcott et al [2] have focused on a quantitative report to understand the impact of fake news on social media in the 2016 U.S. Presidential General Election and its effect upon U.S. voters. Authors have investigated the authentic and unauthentic URLs related to fake news from the BuzzFeed dataset. In one of the studies, Shu et al [38] have investigated a way for robotization process through hashtag recurrence. In this research article, authors have also presented a comprehensive review of detecting fake news on social media, false news classifications on psychology and social concepts, and existing algorithms from a data mining perspective. Ghosh et al [14] have investigated the impact of web-based social networking on political decisions. Quantity research [2, 53, 54] has been done in the context of detecting political-news-based articles. Authors have investigated the effect of various political gatherings related to the discussion of any fake news as agenda. Authors have also explored the Twitter-based data of six Venezuelan government officials with a specific end goal to investigate bot collaboration. Their discoveries recommend that political bots in Venezuela tend to imitate individuals from political gatherings or basic natives. In one of the studies, Zhou et al [53] have investigated the ability of social media to aggregate the judgments of a large community of users. In their further investigation, they have explained machine learning approaches with the end goal to develop a better rumours detection. They have investigated the difficulties for the spread of rumours, rumours classification, and deception for the advancement of such frameworks. They have also investigated the utilization of such useful strategies towards creating fascinating structures that can help individuals in settling on choices towards evaluating the integrity of data gathered from various social media platforms. Vosoughi et al [46] have recognized salient features of rumours by investigating three aspects of information spread online: linguistic style, characteristics of people involved in propagating information, and network propagation subtleties. Authors have analyzed their proposed algorithm on 209 rumours representing 938,806 tweets collected from real-world events, including the 2013 Boston Marathon bombings, the 2014 Ferguson unrest, and the 2014 Ebola epidemic. They have expressed the effectiveness of their proposed framework with all existing methods. The primary objective of their study was to introduce a novel way of assessing style-similarity between different text contents. They have implemented numerous machine learning models and achieved an accuracy of 51% for fake news detection. Chen et al [7] have proposed an unsupervised learning model combining recurrent neural networks and auto-encoders to distinguish rumours as anomalies from other credible micro-blogs based on users’ behaviours. The experimental results show that their proposed model was able to achieve an accuracy of 92.49% with an F1 score of 89.16%. Further, Yang et al [49] have arrived with comparative resolutions for detecting false rumours. During the 2011 riots in England, authors have noticed and investigated that any improvement in the false rumours based stories could produce good results. In their investigation of the 2013 Boston Marathon bombings, they have found some exciting news stories, and most of them were rumours and produced a significant impact on the share market. Shu et al [39] have explored the connection between fake and real facts available on social media platforms using an open tweet dataset. This dataset was created by gathering online tweets from Twitter that contains URLs from reality checking facts. In their investigation, they have found that URL’s are the most widely recognized strategy to share news articles on various stages for the measurement of client articulation (for example, Twitter’s limit is with 140 characters constraint). In their further investigation, they have used a Hoax-based dataset that gives a more accurate prediction for distinguishing fake news stories by conflicting them against known news sources from renowned inspection sites. In one of the researches, Monteiro et al [25] have collected a fake news dataset in the Portuguese language and investigated their results based on different linguistic features. Authors have achieved the highest accuracy of 49% using machine learning techniques. One of the researches, Karimi et al [20] have analyzed 360 satirical news articles including civics, science, business, and delicate news. They have also proposed an SVM-based model. In their investigation, their five highlights are Absurdity, Humor, Grammar, Negative effect, and punctuation. Their proposed framework achieved an accuracy of 38.81%. One of the researches, Perez-Rosas et al [29] have explained the automatic identification of fake content in online news articles. They have presented a comprehensive analysis for the identification of linguistic features in the false news content. In one of the studies, Castillo et al [5] have investigated feature-based methods to assess the credibility of tweets on Twitter. Roy et al [34] have explored the neural embedding approach using the deep recurrent model. They have used weighted n-gram bag of word model using statistical features and other external features with the help of featuring engineering. Subsequently, they have combined all features and classifying fake news with the accuracy of 43.82%. One of the researches, Wang et al [47] have presented a novel dataset for fake news detection. They have proposed a hybrid architecture to solve fake news problem. They have created a model using two main components; one is a Convolutional Neural Network for meta-data representation learning, followed by a Long Short-Term Memory neural network (LSTM). Although being complicated with many parameters to be optimized, their proposed model performs poorly on the test set, with only 27.4% inaccuracy. One of the researches, Peters et al [30] took a different perspective on detecting fake news by looking at its linguistic characteristics. Despite substantial dependence on lexical resources, the performance on political-set was even slower than [47], with an accuracy of 22.0% only. In many existing studies [13, 23, 28, 42], authors have explored the problem of fake news employing a real-world fake news dataset: Fake-News. In one of the studies, Ahmed et al [1] have utilised TF-IDF (Term Frequency-Inverse Document Frequency) as a feature extraction method with different machine learning models. Extensive experiments have performed with LR (Linear-regression model) and obtained an accuracy of 89.00%. Subsequently, they have shown an accuracy of 92% using their LSVM (Linear Support Vector Machine). Liu et al [23] have investigated the methods for recognizing false tweets. In their investigation, authors have utilized a corpus of more than 8 million tweets gathered from the supporters of the presidential candidates in the general election in the U.S. In their investigation, they have employed deep CNNs for fake news detection. In their approach, they have utilised the concept of subjectivity analysis and obtained an accuracy of 92.10%. O’Brien et al [28] have applied deep learning strategies for classifying fake news. In their study, they have achieved an accuracy of 93.50% using the black-box method. Ghanem et al [13] have adopted different word embeddings, including n-gram features to detect the stances in fake articles. They have obtained an accuracy of 48.80%. Ruchansky et al [35] have employed a deep hybrid model for classifying fake news. They have utilized news-user relationships as an essential factor and achieved an accuracy of 89.20%. In one of the studies, Singh et al [42] have investigated with LIWC (Linguistic Analysis and Word Count) features using traditional machine learning methods for classifying fake news. They have explored the problem of fake news with SVM (support vector machine) as a classifier obtained an accuracy of 87.00%. In one of the studies, Jwa et al [18] have explored the approach towards automatic fake news detection. They have used Bidirectional Encoder Representations from Transformers model (BERT) model to detect fake news by analyzing the relationship between the headline and the body text of the news story. Their results improve the 0.14 F-score over existing state-of-the-art models. Weiss et al [48] have investigated the origins of the term “fake news” and the factors contributing to its current prevalence. This lack of consensus may have future implications for students in particular and higher education. Crestani et al [8] have proposed a novel model that can classify a user as a potential fact checker or a potential fake news spreader. Their model was based on a Convolutional Neural Network (CNN) and combined word embeddings with features that represent users’ personality traits and linguistic patterns.

Methodology

In this section, an overview of word embedding, GloVe word embedding, BERT model, fine-tuning of BERT, and the selection of hyperparameters discussed. Our proposed model (FakeBERT) and other deep learning architectures also investigated in this section.

Word embedding

Word embeddings [30] are widely used in both machine learning as well as deep learning models. These models perform well in cases such as reduced training time and improved overall classification performance of the model. Pre-trained representations can also either be static or contextual (refer Fig. 3 for more details). Contextual models generate a representation of each word that is based on the other words in the sentence. Word2Vec and GloVe [50] are currently among the most widely used word embedding models that can convert words into meaningful vectors. For using pre-trained embedding models for training, we displace the parameters of the processing layer with input embedding vectors. Primarily, we maintain the index and then fix this layer, restricting it from being updated throughout the method of gradient descent [30, 31]. Our experiment shows that embedding-based input vectors perform a valuable role in text classification tasks.
Fig. 3

An Overview of existing word-embedding models

An Overview of existing word-embedding models

GloVe

The GloVe is a weighted least square model [3] that train the model using co-occurrence counts of the words in the input vectors. It effectively leverages the benefits of the statistical information by training on the non-zero elements in a word-to-word co-occurrence matrix. The GloVe is an unsupervised training model that is useful to find the co-relation between two words with their distance in a vector space [31]. These generated vectors are known as word embedding vectors. We have used word embedding as semantic features in addition to n-grams because they represent the semantic distances between the words in the context. The smallest package of embedding is 822Mb, called “glove.6B.zip”. GloVe model is trained on a dataset having one billion words with a dictionary of 400 thousand words. There exist different embedding vector sizes, having 50, 100, 200 and 300 dimensions for processing. In this paper, we have taken the 100-dimensional version.

BERT

BERT [11] is a advanced pre-trained word embedding model based on transformer encoded architecture [44]. We utilize BERT as a sentence encoder, which can accurately get the context representation of a sentence [30]. BERT removes the unidirectional constraint using a mask language model (MLM) [44]. It randomly masks some of the tokens from the input and predicts the original vocabulary id of the masked word based only. MLM has increased the capability of BERT to outperforms as compared to previous embedding methods. It is a deeply bidirectional system that is capable of handling the unlabelled text by jointly conditioning on both left and right context in all layers. In this research, we have extracted embeddings for a sentence or a set of words or pooling the sequence of hidden-states for the whole input sequence. A deep bidirectional model is more powerful than a shallow left-to-right and right-to-left model. In the existing research [11], two types of BERT models have been investigated for context-specific tasks, are: BERT Base (refer Table 1 for more information about parameters setting): Smaller in size, computationally affordable and not applicable to complex text mining operations.
Table 1

Parameters for BERT-Base

Parameter NameValue of Parameter
Number of Layers12
Hidden Size768
Attention Heads12
Number of Parameters110M
BERT Large (refer Table 2 for more information about parameters setting): Larger in size, computationally expensive and crunches large text data to deliver the best results.
Table 2

Parameters for BERT-Large

Parameter NameValue of Parameter
Number of Layers24
Hidden Size1024
Attention Heads16
Number of Parameters340M
Parameters for BERT-Base Parameters for BERT-Large

Fine-tuning of BERT

Fine-tuning of BERT [11] is a process that allows it to model many downstream tasks, irrespective of the text form (single text or text pairs). A limited exploration is available to enhance the computing power of BERT to improve the performance on target tasks. BERT model uses a self-attention mechanism to unify the word vectors as inputs that include bidirectional cross attention between two sentences. Mainly, there exist a few fine-tuning strategies that we need to consider: 1) The first factor is the pre-processing of long text since the maximum sequence length of BERT is 512. In our research, we have taken the sequence length of 512. 2) The second factor is layer selection. The official BERT-base model consists of an embedding layer, a 12-layer encoder, and a pooling layer. 3) The third factor is the over-fitting problem. BERT can be fine-tuned with different learning parameters for different context-specific tasks [44] (refer Table 2 for more information).

Deep learning models for fake news detection

Deep learning models are well-known for achieving state-of-the-art results in a wide range of artificial intelligence applications [31]. This section provides an overview of the deep learning models used in our research with their architectures to achieve the end goal. Experiments have been conducted using deep learning-based models (CNN and LSTM [15]) and our proposed model (FakeBERT) with different pre-trained word embeddings. a) Convolutional Neural Network (CNN): In Fig. 4, the computational graph of our designed Convolutional Neural Network (CNN) model is shown. This CNN model (Fig. 4) truncates, zero-pads, and tokenizes the fake news article separately and passes each into an embedding layer. In this architecture (refer Table 3 and Fig. 4), first convolution layer holds 128 filters with kernels_size= 5, which decreases the input embedding vector from 1000 to 996 after convolution process. In the network, after each convolution layer, a max-pooling layer is also present to reduce the input vector dimension. Subsequently, a max-pooling layer with filter_size= 5; that further minimises the embedding vector to 1/5th of 996, i.e. 199. The second convolution layer holds 128 filters with kernels_size= 5, which decreases the input embedding vector from 199 to 195. Subsequently, this is the max-pooling layer with filter size 5; that further reduces the input vector to 1/5th of 199, i.e. 39. After three convolution layers, a flatten layer is added to convert 2-D input to 1-D. Subsequently, there are two hidden layers having 128 neurons in each one. The outputs of the CNNs are passed through a dense layer with dropout and then passed through a softmax layer to yield a stance classification. Number of trainable parameters are also shown in Table 3.
Fig. 4

CNN model

Table 3

CNN layered architecture

LayerInput sizeOutput sizeParam number
Embedding10001000 × 10025187700
Conv1D1000 × 100996 × 12864128
Maxpool996 × 128199 × 1280
Conv1D199 × 128195 × 12882048
Maxpool195 × 12839 × 1280
Conv1D39 × 12835 × 12882048
Maxpool35 × 1281 × 1280
Flatten1 × 1281280
Dense12812816512
Dense1282258
CNN model CNN layered architecture b) Long Short Term Memory Network (LSTM): In this paper, we have implemented the LSTM model having four dense layers with a batch normalization process for the classification of fake news. The selection of optimal hyperparameters is also made for accurate results. From Table 4, we can observe the layered architecture of the LSTM model.
Table 4

LSTM layered architecture

LayerInput sizeOutput sizeParam number
Embedding1000 × 1001000 × 10025187700
Dropout1000 × 1001000 × 1000
Conv1D1000 × 1001000 × 3216032
Maxpool1000 × 32500 × 320
Conv1D500 × 32500 × 646208
Maxpool500 × 64250 × 640
LSTM250 × 6410066000
Batch-Normalization100100400
Dense10025625856
Dense25612832896
Dense128648256
Dense642130
LSTM layered architecture

Proposed model: FakeBERT

In this paper, the most fundamental advantage of selecting a deep convolutional neural network is the automatic feature extraction. In our proposed model, we pass the input in the form of a tensor in which local elements correlates with one another. More concrete results can be achieved with a deep architecture which develops hierarchical representations of learning. From Fig. 5, we can perceive the computational graph of our proposed approach (FakeBERT). In many existing and useful studies [6, 52], the problem of fake news has examined utilising a unidirectional pre-trained word embedding model followed by a 1D-convolutional-pooling layer network [52]. Our suggested model obtains the advantages of automated feature engineering approach [36]. In our model, inputs are the vectors generated after word-embedding from BERT. We give the equal dimensional input vectors to all three convolutional layers present in parallel blocks [26] followed by a pooling layer in each block. In our proposed model, the decision of chosen number of convolutional layers, kernels_sizes, no. of filters, and optimal hyperparameters etc.[19, 26] to make our model more accurate as follows:
Fig. 5

FakeBERT model

FakeBERT model

Convolutional layer

The convolutional layer consists of a set of filters and kernels [52] for better semantic representations of words having a different length. The significant actions performed are matrix multiplications (non-linear operation) passes through an activation function to produce the final output. In our proposed model, we have used three parallel blocks of 1D-CNN having one layer in each block and two straight forward layers after the concatenation process with different kernel sizes and filters.

Max-pooling layer

Max-pooling layer effectively down-samples [27, 36] the output obtained from the convolutional layer and reduce the number of computation operations needed in the system. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network. In our proposed model, we have used five max-pooling layers (three using parallel blocks of 1D-CNN and two with straight forward convolutional layers).

Flatten layer

In between the convolutional layer and the fully connected layer, there is a Flatten layer. Flattening transforms a two-dimensional matrix of features into a vector that can be fed into a fully connected neural network classifier.

Dense layer

A dense layer is just a regular layer of neurons in a neural network. Each neuron receives input from all the neurons in the previous layer, thus densely connected. The layer has a weight matrix W, a bias vector b, and the activations of previous layer a. In many existing and useful methods [36, 45], authors have mostly used one or two dense layers in their proposed networks to prevent over-fitting. In our proposed model, we have also taken two dense layers with a diverse number of filters.

Dropout

Dropout is a regularization technique [36, 45] where randomly selected neurons are ignored during training. Its main contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass. We have applied dropout to dense layers in the network. Dropout works by randomly setting the outgoing edges of hidden units to 0 at each update of the training phase. We have used the value of dropout is 0.2 in our investigations.

Activation Function

ReLu refers to the Rectifier Unit, the most commonly deployed activation function [22, 41] for the outputs of the CNN neurons. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. ReLU is computed after the convolution and is a non-linear activation function like tanh or sigmoid. The equation of ReLU can be written as: here z =input

Loss Function (L)

The cross-entropy compares the model’s prediction with the label which is the true probability distribution. The cross-entropy goes down as the prediction gets more and more accurate. It becomes zero if the prediction is perfect. As such, the cross-entropy can be a loss function to train a classification model. So predicting a probability of .014 when the actual observation label is 1 would be bad and result in a high loss value. In binary classification, where the number of classes (M) equals 2, cross-entropy can be calculated as: If M > 2 (i.e. multi-class classification), we calculate a separate loss for each class label per observation and sum the result. Here y - binary indicator (0 or 1) if class label c is the correct classification for observation o, p - predicted probability observation o is of class c We can observe the computational graph and layered architecture of our proposed FakeBERT model using Table 5 and Fig. 5. In this design, the input is scattered into three parallel blocks of 1D-CNN having 128 filters and one convolutional layer across each block. First convolution layer consists of 128 filters and kernel_size= 3, which reduces input embedding vector from 1000 to 998, second layer has 128 filters and kernel_size= 4, which reduces input vector from 1000 to 997, and third layer has 128 filters and kernel_size= 5, which decreases input vector from 1000 to 996. After a particular convolution layer, a max-pooling layer is also present to decrease the dimension. Subsequently, a max-pooling layer with kernel_size= 5 further reduces the vector to 1/5th of 996, i.e. 199. After concatenation of three above conv-layers, a convolution layer is applied having kernel_size= 5 including 128 filters. Subsequently, there are two hidden layers having 384 and 128 nodes respectively. The number of trainable parameters across each layer is also presented (for more details refer column Param number) in Table 5. This model is not both computationally complex for training at any real-world fake news dataset. The work was carried using the NVIDIA DGX-1 V100 machine. The machine is equipped with 40600 CUDA cores, 5120 tensor cores, 128 GB RAM and 1000 TFLOPS speed.
Table 5

FakeBERT layered architecture

LayerInput sizeOutput sizeParam number
Embedding10001000 × 10025187700
Conv1D1000 × 100998 × 12838528
Conv1D1000 × 100997 × 12851328
Conv1D1000 × 100996 × 12864128
Maxpool998 × 128199 × 1280
Maxpool997 × 128199 × 1280
Maxpool996 × 128199 × 1280
Concatenate199 × 128, 199 × 128, 199 × 128597 × 1280
Conv1D597 × 128593 × 12882048
Maxpool593 × 128118 × 1280
Conv1D118 × 128114 × 12882048
Maxpool114 × 1283 × 1280
Flatten3 × 1283840
Dense38412849280
Dense1282258
FakeBERT layered architecture

Experiments

Experiments have been conducted using deep learning models (CNN and LSTM) and our proposed model (FakeBERT) using pre-trained word embedding techniques (BERT and GloVe). Performances are recorded of different classification models and analyzed with the benchmark results.

Dataset description

In this paper, we have done extensive experiments using the real-world fake news dataset.1 It (refer Table 9) consists of two files (i) train.csv, and (ii) test.csv: A testing dataset without the label. It is a collection of the fake and real news of propagated during the time of the U.S. General Presidential Election-2016. In Table 10, we can see the instances with the class labels in the respective fake news dataset.
Table 9

Attributes in the fake news dataset

AttributeNumber of Instances
ID (unique value to the news article)20800
title (main heading related to particular news)20242
author (name of the creator of that news)18843
text (complete news article)20761
label (information about that the article as fake or real)20800
Table 10

Fake news dataset with the class labels

Class labelNumber of Instances
True10540
False10260

Hyperparameter setting

The selection of optimal hyperparameters is one of the main methods of any deep learning solution. Existing deep learning models explicitly define optimal hyperparameters that examine several factors such as memory and cost. Optimal selection of best numbers depends on the balanced or imbalanced dataset. For selecting optimal numbers, there are two fundamental approaches: automatic and manual selection. Both the methods are equally valid, but for manual selection, deep knowledge of the model is needed. For automatic selection, the high computational cost is required. From Tables 6, 7 and 8, we can observe the values of hyperparameters used in our investigations (Tables 9 and 10).
Table 6

Optimal hyperparameters with CNN

HyperparameterValue
Number of convolution layers3
Number of max pooling layers3
Number of dense layers2
Number of Flatten layers1
Loss functionCategorical-crossentropy
Activation functionRelu
Learning rate0.001
OptimizerAda-delta
Number of epochs10
Batch size128
Table 7

Optimal hyperparameters with LSTM

HyperparameterValue
Number of convolution layers2
Number of max pooling layers2
Number of dense layers4
Dropout rate.2
OptimizerAdam
Activation functionRelu
Loss functionBinary-crossentropy
Number of epochs10
Batch size64
Table 8

Optimal hyperparameters with FakeBERT

HyperparameterValue
Number of convolution layers5
Number of max pooling layers5
Number of dense layers2
Number of Flatten layers1
Dropout rate.2
OptimizerAdadelta
Activation functionRelu
Loss functionCategorical-crossentropy
Number of epochs10
Batch size128
Optimal hyperparameters with CNN Optimal hyperparameters with LSTM Optimal hyperparameters with FakeBERT Attributes in the fake news dataset Fake news dataset with the class labels

Evaluation parameters

To evaluate the performance of FakeBERT, we have considered the accuracy, cross-entropy loss, FPR (False Positive Rate), FNR (False Negative Rate), and confusion matrix (refer Table 11 for more details) as evaluation matrices.
Table 11

Representation of confusion matrix

Predicted negativePredicted positive
Actual negativeTrue negative (TN)False positive (FP)
Actual positiveFalse negative (FN)True positive (TP)
Representation of confusion matrix

Results and discussion

We have investigated and analyzed the results with several classifiers having different types of learning paradigms (different optimal hyper-parameters and architectures). Classification results demonstrate that the capability of automatic feature extraction with deep learning models plays an essential role in the accurate detection of fake news. Our proposed model (FakeBERT) produced more accurate results as compared to existing benchmarks with an accuracy of 98.90%.

Classification results using machine learning models

Firstly, several experiments conducted for estimating the performance of elected machine learning classifiers. (Multinomial Naive Bayes (MNB), Random Forest (RF), Decision Tree (DT), K-nearest neighbor (KNN)) using real-world fake news dataset. In our investigation, we have found that using MNB; we have achieved an accuracy of 89.97% with GloVe. Respective confusion matrix is shown in Table 12. Confusion matrices with others machine learning classifiers are shown in Tables 13, 14 and 15. The decision-tree algorithm also provides an accuracy of 73.65%. The confusion matrix using the MNB classifier predicts more labels accurately closer to actual labels with the testing dataset (for more details refer to Table 12). Dealing with the balanced dataset, MNB provided more accurate results. Machine Learning-based classification results are tabulated in Table 21 and Fig. 6. In this research, we have investigated the performance of different machine learning models with uni-directional pre-training model. In our investigation, we found that accuracy is not up to the mark with real-world fake news dataset. Further, a bidirectional training model which is a more powerful feature extractor [44] was on priority for investigation. Motivated this fact, we introduced BERT, a bidirectional transformer encoder-based pre-trained word embedding model. BERT is a more powerful feature extractor than GloVe and provides effective results for NLP-tasks. Experiments have been conducted using the BERT-based machine learning approach and achieved improved classification results. Deep Learning is a subset of Machine Learning that achieves great power and flexibility by learning to represent the world as a nested hierarchy of concepts. One of the deep learning’s main advantages over other machine learning is its capacity to execute feature engineering on its own. A deep learning algorithm will scan the data to search for features that correlate and combine them to enable faster learning.
Table 12

Confusion matrix for MNB with GloVe

Predicted negativePredicted positive
Actual negative853 (TN)111 (FP)
Actual positive73 (FN)898 (TP)
Table 13

Confusion matrix for KNN with GloVe

Predicted negativePredicted positive
Actual negative282 (TN)762 (FP)
Actual positive200 (FN)836 (TP)
Table 14

Confusion matrix for DT with GloVe

Predicted negativePredicted positive
Actual negative631 (TN)413 (FP)
Actual positive135 (FN)901 (TP)
Table 15

Confusion matrix for RF with GloVe

Predicted negativePredicted positive
Actual negative683 (TN)361 (FP)
Actual positive234 (FN)802 (TP)
Table 21

Classification results with BERT and GloVe

Word embedding modelClassification modelAccuracy (%)
TF-IDF (using unigrams and bigrams)Neural Network94.31
BOW (Bag of words)Neural Network89.23
Word2VecNeural Network75.67
GloVeMNB89.97
GloVeDT73.65
GloVeRF71.34
GloVeKNN53.75
BERTMNB91.20
BERTDT79.25
BERTRF76.40
BERTKNN59.10
GloVeCNN91.50
GloVeLSTM97.25
BERTCNN92.70
BERTLSTM97.55
BERTOur Proposed model (FakeBERT)98.90
Fig. 6

Classification results with GloVe

Classification results with GloVe Confusion matrix for MNB with GloVe Confusion matrix for KNN with GloVe Confusion matrix for DT with GloVe Confusion matrix for RF with GloVe

Classification results using deep learning models

To improve the classification results and to consider the issues in machine learning implementations, more experiments have been conducted with the deep learning models (CNN, LSTM, and FakeBERT) and recorded the performances with real-world fake news dataset. We have designed a deep convolutional network with BERT as a word embedding model. Our deep learning-based approach has built on the top of BERT. In our deep investigation, we have found that using the GloVe-based deep approach with Long Short Term Memory (LSTM) and convolutional neural network (CNN, we found the improved classification results with an accuracy of 92.70% and 97.55% respectively with 10 epochs. The respective confusion matrix show with the help of Table 16. Experiments have been conducted using CNN, LSTM, and our proposed BERT approach. Respective confusion matrices are shown with the help of Tables 17 and 18. Using BERT, we achieved a validation accuracy of 92.70% with CNN and 97.55% with LSTM respectively with 10 epochs. We have found in our investigation that our BERT approach provided state-of-the-art results in fake news classification.
Table 16

Confusion matrix for LSTM with GloVe

Predicted negativePredicted positive
Actual negative1030 (TN)8 (FP)
Actual positive47 (FN)995 (TP)
Table 17

Confusion matrix for CNN with BERT

Predicted negativePredicted positive
Actual negative1004 (TN)63 (FP)
Actual positive90 (FN)942 (TP)
Table 18

Confusion matrix for LSTM with BERT

Predicted negativePredicted positive
Actual negative1032 (TN)7 (FP)
Actual positive44 (FN)998 (TP)
Confusion matrix for LSTM with GloVe Confusion matrix for CNN with BERT Confusion matrix for LSTM with BERT To validate the performance of our BERT-based deep learning model (FakeBERT), several experiments have been conducted with optimized hyperparameters. In our investigation, we have found that our model achieved more accurate results with an accuracy of 98.90%. A respective confusion matrix shows with the help of Table 19. In our approach, the selection of hyperparameters shows in Table 8. From Fig. 7, we can examine the accuracy and cross-entropy loss of our implemented CNN model with a real-world fake news dataset. As seen from Fig. 8, the training loss decays more quickly with BERT-based model as compared to the previous word embedding model(like GloVe, word2Vec etc.) From Fig. 5 and Table 5, we can observe the architecture of our implemented BERT-based model (FakeBERT). From Table 21, we can see the accuracy of the implemented FakeBERT model with 98.90% using the test set. As investigated above, the pre-trained embedding-based models consistently outperform with a significant margin of improvement. The training loss of BERT approach decays comparatively fast and without any inconstancies. It shows clearly from Fig. 9 that cross-entropy loss is reducing fastly using FakeBERT model. We achieved more accurate results with our proposed model as compared to other implemented models with minimal losses of data. To validate the performance of our recommended model; we have considered two more evaluations parameters (FPR and FNR). Results are tabulated in Table 22. In these results, it is clear that with our proposed model (FakeBERT), both FPR and FNR are minimum with the value of 1.60% and 0.59% respectively. It shows the performance of our proposed model with real-world fake news dataset. With other classification models, the values of FPR and FNR are high.
Table 19

Confusion matrix for FakeBERT with BERT

Predicted negativePredicted positive
Actual negative1045 (TN)6 (FP)
Actual positive17 (FN)1012 (TP)
Fig. 7

Accuracy and cross entropy loss using CNN

Fig. 8

Accuracy and cross entropy loss using FakeBERT

Fig. 9

Classification results with BERT

Table 22

False Positive Rate (FPR) and False Negative Rate (FNR)

Word Embedding ModelClassification ModelFPRFNR
TF-IDF (using unigrams and bigrams)Neural Network0.046840.0742
BOW (Bag of words)Neural Network0.10400.0862
Word2VecNeural Network0.13200.3416
GloVeMNB0.11510.0752
GloVeDT0.39560.1303
GloVeRF0.34580.2259
GloVeKNN0.72990.1931
BERTMNB0.09850.0789
BERTDT0.16600.2429
BERTRF0.12450.3318
BERTKNN0.40370.4110
GloVeCNN0.09890.0776
GloVeLSTM0.00800.0482
BERTCNN0.05900.0872
BERTLSTM0.00770.0451
BERTFakeBERT0.01600.0059
Accuracy and cross entropy loss using CNN Accuracy and cross entropy loss using FakeBERT Classification results with BERT Confusion matrix for FakeBERT with BERT It perceived that using bidirectional pre-trained word embedding (BERT), leads to faster training of model and lower cross-entropy loss. Consistently in classification tasks, precision and recall improve when we use pre-trained word embedding (trained on a sufficiently large corpus). From Table 21, we can observe the results using both machine learning as well as deep learning models. It demonstrates clearly that our proposed model (FakeBERT) performs state-of-the-art results as compared to existing benchmarks results using different classification models. From Table 20, we can comprehend the comparative analysis of the proposed method with the existing benchmarks using the Kaggle real-world fake news dataset. It is a precise observation that the highest classification accuracy is reported with an accuracy of 93.50%. Table 21 demonstrates clearly that our proposed model gives comparatively more accurate results and better performances (testing accuracy, FPR, FNR, Cross-entropy loss). Cross-Entropy loss is also very less using BERT as training model (more details refer to Fig. 10). Using our BERT-based in-depth convolutional approach (FakeBERT), we were capable of achieving an accuracy of 98.90% as compared to 98.36% with GloVe (Table 22).
Table 20

Our proposed model vs existing benchmarks with real-world fake news dataset

AuthorsAccuracy(%)
Ghanem et al [13]48.80
Singh et al [42]87.00
Ahmed et al [1] using LR-unigram model89.00
Ruchansky et al [35]89.20
Ahmed et al [1] using LSVM model92.00
Liu et al [23]92.10
O’Brien et al [28]93.50
Our Proposed model (FakeBERT)98.90
Fig. 10

Cross entropy loss with CNN,LSTM,and FakeBERT

Cross entropy loss with CNN,LSTM,and FakeBERT Our proposed model vs existing benchmarks with real-world fake news dataset Classification results with BERT and GloVe False Positive Rate (FPR) and False Negative Rate (FNR)

Conclusion and future scope

In this research, we have demonstrated the performance of our proposed model (FakeBERT-a BERT-based deep convolutional approach) for fake news detection. Our model is a combination of BERT and three parallel blocks of 1d-CNN having different kernel-sized convolutional layers with different filters for better learning. Our model is built on the top of a bidirectional transformer encoder-based pre-trained word embedding model (BERT). Classification results demonstrate that FakeBERT provides more accurate results with an accuracy of 98.90%. The accuracy of FakeBERT is better than the current state-of-the-art models with real-world fake news dataset: Fake-News. This dataset consists of thousands fake and real news articles during the 2016 U.S. General Preseantiaial Election. We evaluated our models with different parameters (Accuracy, FPR, FNR, and Cross-entropy loss). In future work, we will design a hybrid approach (combining content, context, and temporal level information from news articles) applying for both the binary as well as multi-class real-world fake news dataset. This hybrid approach can be valuable to detect the instances of fake news for multi-label datasets which propagate in a graph. We will further study the problem of fake news from the viewpoint of different echo-chambers that exists in social media data, which can consider as a group of personalities having the same opinion for any social concern. The prime motivation to introduce echo-chambers is that every user is co-related in a graph like structure (not in isolation) to any social media platform like a community.
  3 in total

1.  The spreading of misinformation online.

Authors:  Michela Del Vicario; Alessandro Bessi; Fabiana Zollo; Fabio Petroni; Antonio Scala; Guido Caldarelli; H Eugene Stanley; Walter Quattrociocchi
Journal:  Proc Natl Acad Sci U S A       Date:  2016-01-04       Impact factor: 11.205

2.  LSTM: A Search Space Odyssey.

Authors:  Klaus Greff; Rupesh K Srivastava; Jan Koutnik; Bas R Steunebrink; Jurgen Schmidhuber
Journal:  IEEE Trans Neural Netw Learn Syst       Date:  2016-07-08       Impact factor: 10.451

3.  FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media.

Authors:  Kai Shu; Deepak Mahudeswaran; Suhang Wang; Dongwon Lee; Huan Liu
Journal:  Big Data       Date:  2020-06       Impact factor: 2.128

  3 in total
  16 in total

1.  Stance detection with BERT embeddings for credibility analysis of information on social media.

Authors:  Hema Karande; Rahee Walambe; Victor Benjamin; Ketan Kotecha; T S Raghu
Journal:  PeerJ Comput Sci       Date:  2021-04-14

2.  Augmentation and heterogeneous graph neural network for AAAI2021-COVID-19 fake news detection.

Authors:  Andrea Stevens Karnyoto; Chengjie Sun; Bingquan Liu; Xiaolong Wang
Journal:  Int J Mach Learn Cybern       Date:  2022-01-08       Impact factor: 4.377

3.  A Fine-Tuned BERT-Based Transfer Learning Approach for Text Classification.

Authors:  Rukhma Qasim; Waqas Haider Bangyal; Mohammed A Alqarni; Abdulwahab Ali Almazroi
Journal:  J Healthc Eng       Date:  2022-01-07       Impact factor: 2.682

4.  CB-Fake: A multimodal deep learning framework for automatic fake news detection using capsule neural network and BERT.

Authors:  Balasubramanian Palani; Sivasankar Elango; Vignesh Viswanathan K
Journal:  Multimed Tools Appl       Date:  2021-12-28       Impact factor: 2.577

Review 5.  A Framework of AI-Based Approaches to Improving eHealth Literacy and Combating Infodemic.

Authors:  Tianming Liu; Xiang Xiao
Journal:  Front Public Health       Date:  2021-11-30

6.  New explainability method for BERT-based model in fake news detection.

Authors:  Mateusz Szczepański; Marek Pawlicki; Rafał Kozik; Michał Choraś
Journal:  Sci Rep       Date:  2021-12-08       Impact factor: 4.379

7.  Nursing Perspectives on the Impacts of COVID-19: Social Media Content Analysis.

Authors:  Ainat Koren; Mohammad Arif Ul Alam; Sravani Koneru; Alexa DeVito; Lisa Abdallah; Benyuan Liu
Journal:  JMIR Form Res       Date:  2021-12-10

8.  IFND: a benchmark dataset for fake news detection.

Authors:  Dilip Kumar Sharma; Sonal Garg
Journal:  Complex Intell Systems       Date:  2021-10-16

9.  Rumor detection in social network based on user, content and lexical features.

Authors:  Sushila Shelke; Vahida Attar
Journal:  Multimed Tools Appl       Date:  2022-03-07       Impact factor: 2.577

10.  Evaluating the effectiveness of publishers' features in fake news detection on social media.

Authors:  Ali Jarrahi; Leila Safari
Journal:  Multimed Tools Appl       Date:  2022-04-11       Impact factor: 2.757

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.