Literature DB >> 35697814

Character gated recurrent neural networks for Arabic sentiment analysis.

Eslam Omara1, Mervat Mousa2, Nabil Ismail2.   

Abstract

Sentiment analysis is a Natural Language Processing (NLP) task concerned with opinions, attitudes, emotions, and feelings. It applies NLP techniques for identifying and detecting personal information from opinionated text. Sentiment analysis deduces the author's perspective regarding a topic and classifies the attitude polarity as positive, negative, or neutral. In the meantime, deep architectures applied to NLP reported a noticeable breakthrough in performance compared to traditional approaches. The outstanding performance of deep architectures is related to their capability to disclose, differentiate and discriminate features captured from large datasets. Recurrent neural networks (RNNs) and their variants Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Bi-directional Long-Short Term Memory (Bi-LSTM), and Bi-directional Gated Recurrent Unit (Bi-GRU) architectures are robust at processing sequential data. They are commonly used for NLP applications as they-unlike RNNs-can combat vanishing and exploding gradients. Also, Convolution Neural Networks (CNNs) were efficiently applied for implicitly detecting features in NLP tasks. In the proposed work, different deep learning architectures composed of LSTM, GRU, Bi-LSTM, and Bi-GRU are used and compared for Arabic sentiment analysis performance improvement. The models are implemented and tested based on the character representation of opinion entries. Moreover, deep hybrid models that combine multiple layers of CNN with LSTM, GRU, Bi-LSTM, and Bi-GRU are also tested. Two datasets are used for the models implementation; the first is a hybrid combined dataset, and the second is the Book Review Arabic Dataset (BRAD). The proposed application proves that character representation can capture morphological and semantic features, and hence it can be employed for text representation in different Arabic language understanding and processing tasks.
© 2022. The Author(s).

Entities:  

Mesh:

Year:  2022        PMID: 35697814      PMCID: PMC9192763          DOI: 10.1038/s41598-022-13153-w

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.996


Introduction

Natural language processing considers many tasks to analyze the text structure and understand its semantics. The extracted syntactic and semantic information is then exploited for a higher-level task. Examples of NLP tasks are Part-of-speech Tagging (POS)[1], Chunking or shallow parsing[2], Parsing[3], Semantic role labeling (SRL)[1], Named entity recognition (NER)[1], Word-sense disambiguation[4], Anaphora resolution (pronoun resolution)[5], Sentence classification[6], Sentiment analysis[7], Emotion detection (ED)[8,9], Document classification[10], Text summarization[11], Machine translation[3], and Question answering (QA)[2]. Natural language processing tasks can be categorized according to the revealed information as[1,2]: Syntactic tasks as part-of-speech tagging, chunking, and parsing. Semantic tasks include sentiment analysis, emotion detection, document classification, text summarization, machine translation, question answering, sentence classification, word-sense disambiguation, semantic role labelling, named entity recognition, and anaphora resolution. NLP tasks were investigated by applying statistical and machine learning techniques. Recently, deep learning (DL) structures are extensively used in NLP. Deep learning models can identify and learn features from raw data, and they registered superior performance in various fields[12]. In addition to natural language processing, DL were employed in computer vision, handwriting recognition, speech recognition, object detection, cancer detection, biological image classification, face recognition, stock market analysis, and many others[13]. Deep learning applies a variety of architectures capable of learning features that are internally detected during the training process. RNNs are deep learning architectures commonly used for sequence modelling. The recurrence connection in RNNs supports the model to memorize dependency information included in the sequence as context information in natural language tasks[14]. And hence, RNNs can account for words order within the sentence enabling preserving the context[15]. Unlike feedforward neural networks that employ the learned weights for output prediction, RNN uses the learned weights and a state vector for output generation[16]. Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Bi-directional Long-Short Term Memory (Bi-LSTM), and Bi-directional Gated Recurrent Unit (Bi-GRU) are variants of the simple RNN. The variants are based on the notion of gates[16,17]. Contrary to RNN, gated variants are capable of handling long term dependencies. Also, they can combat vanishing and exploding gradients by the gating technique[14]. Bi-directional recurrent networks can handle the case when the output is predicted based on the input sequence's surrounding components[18]. LSTM is the most widespread DL architecture applied to NLP as it can capture far distance dependency of terms[15]. GRUs implemented in NLP tasks are more appropriate for small datasets and can train faster than LSTM[17]. In addition to gated RNNs, Convolutional Neural Network (CNN) is another common DL architecture used for feature detection in different NLP tasks. For example, CNNs were applied for SA in deep and shallow models based on word and character features[19]. Moreover, hybrid architectures—that combine RNNs and CNNs—demonstrated the ability to consider the sequence components order and find out the context features in sentiment analysis[20]. These architectures stack layers of CNNs and gated RNNs in various arrangements such as CNN-LSTM, CNN-GRU, LSTM-CNN, GRU-CNN, CNN-Bi-LSTM, CNN-Bi-GRU, Bi-LSTM-CNN, and Bi-GRU-CNN. Convolutional layers help capture more abstracted semantic features from the input text and reduce dimensionality. RNN layers capture the gesture of the sentence from the dependency and order of words. Meanwhile, many customers create and share content about their experience on review sites, social channels, blogs etc. The valuable information in the authors tweets, reviews, comments, posts, and form submissions stimulated the necessity of manipulating this massive data. The revealed information is an essential requirement to make informed business decisions. Sentiment analysis is a crucial NLP task that aims at studying and understanding personal emotions, behaviours, opinions, feelings, and assessments of various targets such as services, facilities, products, problems, items, firms, occasions, topics, and public figures[8,19]. Understanding individuals sentiment is the basis of understanding, predicting, and directing their behaviours. By applying NLP techniques, SA detects the polarity of the opinioned text and classifies it according to a set of predefined classes. Statistical, machine learning and deep learning methodologies applied for SA performance improvement tackled problems such as capturing context information, considering dialectical language, handling social media text's unique nature and identifying the sentiment holder. In the Arabic language, the character form changes according to its location in the word. It can be written connected or disconnected at the end, placed within the word, or found at the beginning. Besides, diacritics or short vowels control the word phonology and alter its meaning. These characteristics propose challenges to word embedding and representation[21]. Further challenges for Arabic language processing are dialects, morphology, orthography, phonology, and stemming[21]. In addition to the Arabic nature related challenges, the efficiency of word embedding is task-related and can be affected by the abundance of task-related words[22]. Therefore, a convenient Arabic text representation is required to manipulate these exceptional characteristics. Most implementations of LSTMs and GRUs for Arabic SA employed word embedding to encode words by real value vectors. Besides, the common CNN-LSTM combination applied for Arabic SA used only one convolutional layer and one LSTM layer. Up to the available knowledge, the performance of deep LSTM, GRU, Bi-LSTM, and Bi-GRU has not been investigated in Arabic language SA using character representation. Furthermore, deep hybrid combinations such as CNN-LSTM, CNN-GRU, LSTM-CNN, GRU-CNN, CNN-Bi-LSTM, CNN-Bi-GRU, Bi-LSTM-CNN, and Bi-GRU-CNN have not been studied or compared. Therefore, the contributions of the proposed work are: Four architectures of deep LSTM, GRU, Bi-LSTM and Bi-GRU are investigated based on character features for Arabic sentiment analysis. Eight deep hybrid CNN-LSTM, CNN-GRU, LSTM-CNN, GRU-CNN, CNN-Bi-LSTM, CNN-Bi-GRU, Bi-LSTM-CNN, and Bi-GRU-CNN structures that merge layers of different architectures are also implemented and validated. The presented deep networks are tested on two datasets; the first is a hybrid dataset that was built from multiple available datasets dedicated to Arabic SA. The second and the benchmarking dataset is the Arabic book reviews dataset (BRAD). The proposed application examines the ability of deep networks to detect discriminating features from data represented at the character level. Extensive empirical analysis of the predictive performance of the twelve networks using the two datasets is conducted to find out the architectures that best fit the low-level representation. The remainder of the paper is organized as follows: the “Sentiment analysis” section explains notions, concepts, and definitions related to sentiment analysis and the “Feature representation” section discusses the approaches commonly used to represent features for NLP tasks. The literature review is introduced in the “Related work” section. The “Applied models” section clarifies in details the structure and settings of the implemented networks. Results invistigation and empirical analysis are proposed in the “Experiments and results” section. Finally, the concluded results and further future work are declared in the “Conclusion” section.

Sentiment analysis

SA research depends on data originating from social media, such as tweets, reviews, and comments. Lately, medical services, stock market, and human emotions were discussed while early topics included reviews, product features, and elections[23]. Sentiment analysis has been studied at multiple granularity levels: document, sentence, and aspect. Each opinionated text is considered one unit and assigned a positive, negative, or neutral polarity at the document level. The document holds an opinion regarding a single entity and has one opinion holder. Opinions that maintain multiple entities assessment cannot be analyzed using this level[6,24]. Sentence level SA begins with determining if the sentence expresses an opinion or not (subjective or objective). This step is known as subjectivity classification. Next, the sentiment orientation of emotional sentences is identified by multi-class or binary classification. The multi-class classification assigns a positive, negative, or neutral category to subjective sentences, whereas the binary type considers only positive and negative classes[6,25]. A more fine-grained SA is the aspect level or phrase level that defines the quintuple (Object, Aspect, Sentiment Orientation, Opinion Holder, Time) components of an opinion concerning an entity or an entity feature. It is also called feature-based sentiment analysis. An argument about an object may hold a positive orientation regarding a characteristic and a negative orientation regarding another characteristic, so it is not positive or negative for the whole entity[24,25]. Sentiment analysis is generally applied using three approaches. Most machine learning algorithms applied for SA are mainly supervised approaches such as Support Vector Machine (SVM), Naïve Bayes (NB), Artificial Neural Networks (ANN), and K-Nearest Neighbor (KNN)[26]. A large labelled dataset is required to train a robust classifier. But, large pre-annotated datasets are usually unavailable and extensive work, cost, and time are consumed to annotate the collected data. Lexicon based approaches use sentiment lexicons that contain words and their corresponding sentiment scores. The corresponding value identifies the word polarity (positive, negative, or neutral). These approaches do not use labelled datasets but require wide-coverage lexicons that include many sentiment holding words. Dictionaries are built by applying corpus-based or dictionary-based approaches[6,26]. The lexicon approaches are popularly used for Modern Standard Arabic (MSA) due to the lack of vernacular Arabic dictionaries[6]. Sentiment polarities of sentences and documents are calculated from the sentiment score of the constituent words/phrases. Most techniques use the sum of the polarities of words and/or phrases to estimate the polarity of a document or sentence[24]. The lexicon approach is named in the literature as an unsupervised approach because it does not require a pre-annotated dataset. It depends mainly on the mathematical manipulation of the polarity scores, which differs from the unsupervised machine learning methodology. The hybrid approaches (Semi-supervised or weakly supervised) combine both lexicon and machine learning approaches. It manipulates the problem of labelled data scarcity by using lexicons to evaluate and annotate the training set at the document or sentence level. Un-labelled data are then classified using a classifier trained with the lexicon-based annotated data[6,26].

Feature representation

Processing unstructured data such as text, images, sound records, and videos are more complicated than processing structured data. The difficulty of capturing semantics and concepts of the language from words proposes challenges to the text processing tasks. A document can not be processed in its raw format, and hence it has to be transformed into a machine-understandable representation[27]. Selecting the convenient representation scheme suits the application is a substantial step[28]. The fundamental methodologies used to represent text data as vectors are Vector Space Model (VSM) and neural network-based representation. Text components are represented by numerical vectors which may represent a character, word, paragraph, or the whole document. VSM can be formulated by many approaches[28,29]. Binary representation is an approach used to represent text documents by vectors of a length equal to the vocabulary size. Documents are quantized by One-hot encoding to generate the encoding vectors[30]. The representation does not preserve word meaning or order, so similar words cannot be distinguished from entirely different worlds. One-hot encoding of a document corpus is a vast sparse matrix resulting in a high dimensionality problem[28]. This representation is referred to as discrete or local representation[29]. The bag of Word (BOW) approach constructs a vector representation of a document based on the term frequency. BOW is widely speared for text classification applications[27]. However, a drawback of BOW representation is that word order is not preserved, resulting in losing the semantic associations between words. Another limitation is that each word is represented as a distinct dimension. The representation vectors are sparse, with too many dimensions equal to the corpus vocabulary size[31]. Also, there exist many cases of polysemous and homonymous. Polysemy refers to the presence of many possible meanings for a word. Homonymy means the existence of two or more words with the same spelling or pronunciation but different meanings and origins. Words with different semantics and the same spelling have the same representation. And synonym words with different spelling have completely different representations[28,29]. Representing documents based on the term frequency does not consider that common words have higher occurrence than other words, and so the corresponding dimensions are defined by much higher values than rare but discriminating words. Term weighting techniques are applied to assign appropriate weights to the relevant terms to handle such problems. Term Frequency-Inverse Document Frequency (TF-IDF) is a weighting schema that uses term frequency and inverse document frequency to discriminate items[29]. Bag-Of-N-Grams (BONG) is a variant of BOW where the vocabulary is extended by appending a set of N consecutive words to the word set. The N-words sequences extracted from the corpus are employed as enriching features. But, the number of words selected for effectively representing a document is difficult to determine[27]. The main drawback of BONG is more sparsity and higher dimensionality compared to BOW[29]. Bag-Of-Concepts is another document representation approach where every dimension is related to a general concept described by one or multiple words[29]. Alternatively, words can be quantized by a distributed representation. Each word is assigned a continuous vector that belongs to a low-dimensional vector space. Neural networks are commonly used for learning distributed representation of text, known as word embedding[27,29]. Popular neural models used for learning word embedding are Continuous Bag-Of-Words (CBOW)[32], Skip-Gram[32], and GloVe[33] embedding. In CBOW, word vectors are learned by predicting a word based on its context. A context is a predefined number of words around the expected word. Skip-Gram follows a reversed strategy as it predicts the context words based on the centre word. GloVe uses the vocabulary words co-occurrence matrix as input to the learning algorithm where each matrix cell holds the number of times by which two words occur in the same context. A discriminant feature of word embedding is that they capture semantic and syntactic connections among words. Embedding vectors of semantically similar or syntactically similar words are close vectors with high similarity[29]. Learning word embedding depends on a distributional assumption which supposes that words with similar meanings occur in similar contexts and hence they have comparable distributions[27]. Relying on word co-occurrence may place antonymous words near each other in the vector space, which can be a drawback of word embedding. For example, “good and bad” may be assigned close vectors because they often appear in similar contexts. The efficiency of word embedding may be affected by such cases, especially in tasks like SA[29]. In the proposed investigation, the SA task is inspected based on character representation, which reduces the vocabulary set size compared to the word vocabulary. Besides, the learning capability of deep architectures is exploited to capture context features from character encoded text.

Related work

Recurrent neural networks (RNNs) and their gated variants, Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been applied in different NLP tasks such as text generation, sentiment analysis, machine translation, question answering, and summarization. The applications exploit the capability of RNNs and gated RNNs to manipulate inputs composed of sequences of words or characters[17,34]. RNNs process chronological sequence in both input and output, or only one of them. According to the investigated problem, RNNs can be arranged in different topologies[16]. In addition to the homogenous arrangements composed of one type of deep learning networks, there are hybrid architectures combine different deep learning networks. The hybrid architectures avail from the outstanding characteristic of each network type to empower the model. CNN, LSTM, Bi-LSTM, and GRU were implemented using word and character embedding for sentiment categorization[34]. Bi-LSTM showed the best performance using the word embedding, whereas CNN reported the best performance using the character embedding. The results were further enhanced by combining the features disclosed from character CNN and word Bi-LSTM in a hybrid model. The integrated features were fed to the classification layer for polarity identification, and the model showed more boosted performance. Also, CNN, RNN, LSTM, GRU, and CNN-LSTM were tested for sentiment analysis of product reviews and based on word embedding, the CNN-LSTM architecture registered the highest performance[35]. LSTM reported the second-highest performance. It was highlighted that LSTM is efficient at NLP tasks. Shallow LSTM, GRU, Bi-LSTM, and Bi-GRU were trained and compared using the Amazon review corpus[36]. Results reported that bi-directional structures reached higher performance compared to unidirectional versions. Additionally, GRU trained faster and outperformed LSTM. A comparative study was conducted applying multiple deep learning models based on word and character features[37]. Three CNN and five RNN networks were implemented and compared on thirteen reviews datasets. One, nine, and twenty-nine layers CNN models were implemented. Also, RNN, LSTM, GRU, Bi-LSTM, and Bi-GRU architectures were tested. Although the thirteen datasets included reviews, the deep models performance varied according to the domain and the characteristics of the dataset. Based on word-level features Bi-LSTM, GRU, Bi-GRU, and the one layer CNN reached the highest performance on numerous review sets, respectively. Based on character level features, the one layer CNN, Bi-LSTM, twenty-nine layers CNN, GRU, and Bi-GRU achieved the best measures consecutively. A sentiment categorization model that employed a sentiment lexicon, CNN, and Bi-GRU was proposed in[38]. Sentiment weights calculated from the sentiment lexicon were used to weigh the input embedding vectors. The CNN-Bi-GRU network detected both sentiment and context features from product reviews better than the networks that applied only CNN or Bi-GRU. For Arabic SA, a lexicon was combined with RNN to classify sentiment in tweets[39]. An RNN network was trained using feature vectors computed using word weights and other features as percentage of positive, negative and neutral words. RNN, SVM, and L2 Logistic Regression classifiers were tested and compared using six datasets. In addition, LSTM models were widely applied for Arabic SA using word features and applying shallow structures composed of one or two layers[15,40-42], as shown in Table 1.
Table 1

Arabic sentiment analysis using RNNs and gated RNN.

[39][15][40][41][42][47][48][49][43][44][45][46][14][50]
ModelRNN

LSTM

DNN

LSTM

CNN

RCNN

LSTM

CNN

LSTM

LSTM

Bi-LSTM

CNN

Bi-LSTM

Bi-LSTM

LSTM

CNN

CNN

LSTM

CNN-LSTM

CNN-LSTMsCNN-LSTMCNN-LSTM

LSTM

GRU

Bi-LSTM

Bi-GRU

GRU
LayersTwo layersOne layerOne layerOne layer

One layer

Two layers

One layerOne layer

One layer

Three layers

Conv. layer

LSTM layer

Conv. layer

Two LSTM layers

Conv. layer

LSTM layer

One layerOne layer
FeaturesWord weightBinary vectorsWord embeddingWord embeddingWord embeddingWord embeddingWord embeddingWord embeddingWord embedding

Character

Character N-gram

Word

Word embeddingWord embeddingEmojisWord embedding
Dataset

ASTD 9174 tweets

MASTD 1850 tweets

ArSAS 19,762 tweets

GS 4191 tweets

Syrian 2000 tweets

ArTwitter 2000 tweets

LABR

63,000 reviews

40,000 tweets

MSAC

2000 reviews SemEval-2017 task 4

2000 tweets and comments

LABR

16,448 reviews

TSAC

17,069 comments

ASTD

10,000 tweets

ASTD 1589 tweets

ArTwitter 1951tweets

LABR 16,448 reviews

MPQA 9996 news articles

Multi-domain resource 31,598 reviews

Main-AHS 2026 tweets

ASTD

10,000 tweets

ArTwitter

2000 tweets

Main-AHS

2026 tweets

Sub-AHS

1732 tweets

Ar-Twitter

2000 tweets

ASTD

2479 tweets

Ar-Twitter 2000 tweets

ASTD 10,000 tweets

LABR 63,000

reviews

SemEval

2017 task 4-A

9655 tweets

ASTD

10,000 tweets

ArSAS

17,784 tweets

Tweets

YouTube comments 2091

Religious hate 6600

tweets

Preprocessing

Tokenization

Normalization

Stemming

Deleting non-Arabic words, numbers, URLs, users mentions, hashtags, stop words, names

Detecting intensification, emoji, idioms, negation

Tokenization

Stemming

Removing punctuation marks, stop-words, tab space, blank space

Removing special characters, none Arabic letters, diacritics, elongation

Normalization

Manual correction

Tokenization

Normalization

Stemming

Deleting stop words

Normalization

Stemming

Removing diacritics, repetition, punctuations, stop words, non-Arabic words

Normalization

Removing numbers, punctuation symbols, elongation, diacritics

Replace emoticons with emoji

Tokenize emoji

Tokenization

Normalization

Stemming

Removing stop words, punctuations, Latin characters, digits

Deleting non-Arabic symbols, dialectical marks, punctuation marks, tatweel, duplicate characterTokenization (character, character 5-g, word)

Normalization

Stemming

Lemmatization

Removing stop words, duplicated letters

Filtering non-Arabic words

Spelling correction

Normalization

Removing elongation, unknown characters, diacritics, punctuation

Normalization

Removing tweets with no Emojis

Normalization

Handling elongation

Deleting stop words, diacritics, punctuations, emojis, tatweel, one-letter words, non-Arabic characters

Arabic sentiment analysis using RNNs and gated RNN. LSTM DNN LSTM CNN RCNN LSTM CNN LSTM Bi-LSTM CNN Bi-LSTM Bi-LSTM LSTM CNN CNN LSTM CNN-LSTM LSTM GRU Bi-LSTM Bi-GRU One layer Two layers One layer Three layers Conv. layer LSTM layer Conv. layer Two LSTM layers Conv. layer LSTM layer Character Character N-gram Word ASTD 9174 tweets MASTD 1850 tweets ArSAS 19,762 tweets GS 4191 tweets Syrian 2000 tweets ArTwitter 2000 tweets LABR 63,000 reviews MSAC 2000 reviews SemEval-2017 task 4 2000 tweets and comments LABR 16,448 reviews TSAC 17,069 comments ASTD 10,000 tweets ASTD 1589 tweets ArTwitter 1951tweets LABR 16,448 reviews MPQA 9996 news articles Multi-domain resource 31,598 reviews Main-AHS 2026 tweets ASTD 10,000 tweets ArTwitter 2000 tweets Main-AHS 2026 tweets Sub-AHS 1732 tweets Ar-Twitter 2000 tweets ASTD 2479 tweets Ar-Twitter 2000 tweets ASTD 10,000 tweets LABR 63,000 reviews SemEval 2017 task 4-A 9655 tweets ASTD 10,000 tweets ArSAS 17,784 tweets Tweets YouTube comments 2091 Religious hate 6600 tweets Tokenization Normalization Stemming Deleting non-Arabic words, numbers, URLs, users mentions, hashtags, stop words, names Detecting intensification, emoji, idioms, negation Tokenization Stemming Removing punctuation marks, stop-words, tab space, blank space Removing special characters, none Arabic letters, diacritics, elongation Normalization Manual correction Tokenization Normalization Stemming Deleting stop words Normalization Stemming Removing diacritics, repetition, punctuations, stop words, non-Arabic words Normalization Removing numbers, punctuation symbols, elongation, diacritics Replace emoticons with emoji Tokenize emoji Tokenization Normalization Stemming Removing stop words, punctuations, Latin characters, digits Normalization Stemming Lemmatization Removing stop words, duplicated letters Filtering non-Arabic words Spelling correction Normalization Removing elongation, unknown characters, diacritics, punctuation Normalization Removing tweets with no Emojis Normalization Handling elongation Deleting stop words, diacritics, punctuations, emojis, tatweel, one-letter words, non-Arabic characters LSTMs were used for classifying short tweets and lengthy reviews. It was noted that LSTM outperformed CNN in SA when used in a shallow structure based on word features. Applying the data shuffling augmentation technique enhanced the LSTM model performance[40]. In another context, the impact of morphological features on LSTM and CNN performance was tested by applying different preprocessing steps steps such as stop words removal, normalization, light stemming and root stemming[41]. It was reported that preprocessing steps that eliminate text noise and reduce distortions in the feature space affect the classification performance positively. Whilst, preprocessing actions that cause the loss of relevant morphological information as root stemming affected the performance. Also, in[42], different settings of LSTM hyper-parameters as batch size and output length, was tested using a large dataset of book reviews. Combinations of CNN and LSTM were implemented to predict the sentiment of Arabic text in[43-46]. In a CNN–LSTM model, the CNN feature detector find local patterns and discriminating features and the LSTM processes the generated elements considering word order and context[46,47]. Most CNN-LSTM networks applied for Arabic SA employed one convolutional layer and one LSTM layer and used either word embedding[43,45,46] or character representation[44]. Temporal representation was learnt for Arabic text by applying three stacked LSTM layers in[43]. The model performance was compared with CNN, one layer LSTM, CNN-LSTM and combined LSTM. Also, different optimizers were tested as Adam, Rmsprop, Adagrad and SGD. A worthy notice is that combining two LSTMs outperformed stacking three LSTMs due to the dataset size, as deep architectures require extensive data for feature detection. Morphological diversity of the same Arabic word within different contexts was considered in a SA task by utilizing three types of feature representation[44]. Character, Character N-Gram, and word features were employed for an integrated CNN-LSTM model. The fine-grained character features enabled the model to capture more attributes from short text as tweets. The integrated model achieved an enhanced accuracy on the three datasets used for performance evaluation. Moreover, a hybrid dataset corpus was used to study Arabic SA using a hybrid architecture of one CNN layer, two LSTM layers and an SVM classifier[45]. The CNN-LSTM model was tested using one and two LSTM Layers. Stacked LSTM layers produced feature representations more appropriate for class discrimination. Various word embedding approaches were assessed. The results highlighted that the model realized the highest performance on the largest considered dataset. The online Arabic SA system Mazajak was developed based on a hybrid architecture of CNN and LSTM[46]. The model was evaluated on three benchmarking datasets. The applied word2vec word embedding was trained on a large and diverse dataset to cover several dialectal Arabic styles. Bi-LSTM, the bi-directional version of LSTM, was applied to detect sentiment polarity in[47-49]. A bi-directional LSTM is constructed of a forward LSTM layer and a backward LSTM layer. The fore cells handle the input from start to end, and the back cells process the input from end to start. The two layers work in reverse directions, enabling to keep the context of both the previous and the following words[47,48]. LSTM, Bi-LSTM and deep LSTM and Bi-LSTM with two layers were evaluated and compared for comments SA[47]. It was reported that Bi-LSTM showed more enhanced performance compared to LSTM. The deep LSTM further enhanced the performance over LSTM, Bi-LSTM, and deep Bi-LSTM. The authors indicated that the Bi-LSTM could not benefit from the two way exploration of previous and next contexts due to the unique characteristics of the processed data and the limited corpus size. Also, CNN and Bi-LSTM models were trained and assessed for Arabic tweets SA and achieved a comparable performance[48]. The separately trained models were combined in an ensemble of deep architectures that could realize a higher accuracy. In addition, The ability of Bi-LSTM to encapsulate bi-directional context was investigated in Arabic SA in[49]. CNN and LSTM were compared with the Bi-LSTM using six datasets with light stemming and without stemming. Results emphasized the significant effect of the size and nature of the handled data. The highest performance on large datasets was reached by CNN, whereas the Bi-LSTM achieved the highest performance on small datasets. GRUs were studied in[14,50] for Arabic sentiment identification. LSTM, Bi-LSTM, GRU, and Bi-GRU were used to predict the sentiment category of Arabic microblogs depending on Emojis features[14]. Results reported that Bi-GRU outperformed Bi-LSTM with slightly different performance on a small dataset of short dialectical Arabic tweets. Experiments evaluated diverse methods of combining the bi-directional features and stated that concatenation led to the best performance for LSTM and GRU. Besides, the detection of religious hate speech was analyzed as a classification task applying a GRU model and pre-trained word embedding[50]. The embedding was pre-trained on a Twitter corpus that contained different Arabic dialects. GRU outperformed other machine learning and lexicon-based classifiers. Supporting the GRU model with handcrafted features about time, content, and user boosted the recall measure. A hybrid parallel model that utlized three seprate channels was proposed in[51]. The channels outputs were concatenated and fed to the final dense layer. Each channel is an independant model with a distinct input. Character CNN, word CNN, and sentence Bi-LSTM-CNN channels were trained parallel. A positioning binary embedding scheme (PBES) was proposed to formulate contextualized embeddings that efficiently represent character, word, and sentence features. The model was validated on 34 Arabic sentiment analysis datasets. Binary and tertiary hybrid datasets were also used for the model assessment. The model performance was more evaluated using the IMDB movie review dataset. Experimental results showed that the model outperformed the baselines for all datasets. Another hybridization paradigm is combining word embedding and weighting techniques. Combinations of word embedding and weighting approaches were investigated for sentiment analysis of product reviews[52]. The embedding schemes Word2vec, GloVe, FastText, DOC2vec, and LDA2vec were combined with the TF-IDF, inverse document frequency, and smoothed inverse document frequency weighting approaches. To account for word relevancy, weighting approaches were used to weigh the word embedding vectors to account for word relevancy. Weighted sum, centre-based, and Delta rule aggregation techniques were utilized to combine embedding vectors and the computed weights. RNN, LSTM, GRU, CNN, and CNN-LSTM deep networks were assessed and compared using two Twitter corpora. The experimental results showed that the CNN-LSTM structure reached the highest performance. The LSTM network achieved the second-best performance. Word embedding models such as FastText, word2vec, and GloVe were integrated with several weighting functions for sarcasm recognition[53]. Weighting mechanisms include TF-IDF, term-frequency, odds ratio, balanced distributional concentration, inverse gravity moment, short text weighting, regularized entropy, inverse false negative—true positive—inverse category frequency, relevance frequency, and inverse question frequency—question frequency-inverse category frequency were employed. The deep learning structures RNN, GRU, LSTM, Bi-LSTM, and CNN were used to classify text as sarcastic or not. Three sarcasm identification corpora containing tweets, quote responses, news headlines were used for evaluation. The proposed representation integrated word embedding, weighting functions, and N-gram techniques. The weighted representation of a document was computed as the concatenation of the weighted unigram, bigram and trigram representations. The three layers Bi-LSTM model trained with the trigrams of inverse gravity moment weighted embedding realized the best performance. Combinations of word embedding and handcrafted features were investigated for sarcastic text categorization[54]. Sarcasm was identified using topic supported word embedding (LDA2Vec) and evaluated against multiple word embedding such as GloVe, Word2vec, and FastText. The CNN trained with the LDA2Vec embedding registered the highest performance, followed by the network that was trained with the GloVe embedding. Handcrafted features namely pragmatic, lexical, explicit incongruity, and implicit incongruity were combined with the word embedding. Diverse combinations of handcrafted features and word embedding were tested by the CNN network. The best performance was achieved by merging LDA2Vec embedding and explicit incongruity features. The second-best performance was obtained by combining LDA2Vec embedding and implicit incongruity features.

Applied models

The hybrid notion was considered in SA by combining different features (word and character[22]; word and weighting techniques[52]; character, word, and sentence[51]), deep architectures (CNN and LSTM)[43-46], approaches (lexicon-based and deep learning)[38,39], and domains (video games reviews, cell phones reviews, and food reviews)[37]. Furthermore, different dialects were merged in the training corpus[22]. The proposed work applies multiple ways of hybridization namely hybrid deep architectures (CNN-LSTM, CNN-GRU, LSTM-CNN, GRU-CNN, CNN-Bi-LSTM, CNN-Bi-GRU, Bi-LSTM-CNN, and Bi-GRU-CNN), hybrid language styles (MSA and dialectical), and hybrid data sources (tweets, reviews) which proposes more challenges for Arabic SA. In addition, deep models based on a single architecture (LSTM, GRU, Bi-LSTM, and Bi-GRU) are also investigated. The datasets utilized to validate the applied architectures are a combined hybrid dataset and the Arabic book review corpus (BRAD).

Network design

Multiple deep architectures are implemented for Arabic SA. All architectures employ a character embedding layer to convert encoded text entries to a vector representation. Feature detection is conducted in the first architecture by three LSTM, GRU, Bi-LSTM, or Bi-GRU layers, as shown in Figs. 1 and 2. The discrimination layers are three fully connected layers with two dropout layers following the first and the second dense layers. In the dual architecture, feature detection layers are composed of three convolutional layers and three max-pooling layers arranged alternately, followed by three LSTM, GRU, Bi-LSTM, or Bi-GRU layers. Finally, the hybrid layers are mounted between the embedding and the discrimination layers, as described in Figs. 3 and 4.
Figure 1

LSTM/GRU architecture (created by Microsoft PowerPoint 2010).

Figure 2

Bi-LSTM/Bi-GRU architecture (created by Microsoft PowerPoint 2010).

Figure 3

CNN-LSTM/CNN-GRU architecture (created by Microsoft PowerPoint 2010).

Figure 4

CNN-Bi-LSTM/CNN-Bi-GRU architecture (created by Microsoft PowerPoint 2010).

LSTM/GRU architecture (created by Microsoft PowerPoint 2010). Bi-LSTM/Bi-GRU architecture (created by Microsoft PowerPoint 2010). CNN-LSTM/CNN-GRU architecture (created by Microsoft PowerPoint 2010). CNN-Bi-LSTM/CNN-Bi-GRU architecture (created by Microsoft PowerPoint 2010).

Network settings

Each opinion entry is represented as a sequence of characters. The character vocabulary includes all characters found in the dataset (Arabic characters, , Arabic numbers, English characters, English numbers, emoji, emoticons, and special symbols). A vocabulary set of (746) characters is used for encoding the text corpus. Opinion entries are quantized as sequences of length 1014 characters. The training methodology and settings are conducted following[27,55]. Python, Keras, and Tensorflow are used for the models application. CNN, LSTM, GRU, Bi-LSTM, and Bi-GRU layers are trained on CUDA11 and CUDNN10 for acceleration. The implementation is conducted on NVIDIA GEFORCE GTX 1070 GPU. The settings of the applied architectures are stated in Tables 2 and 3.
Table 2

Network settings for feature detection layers.

ArchitectureParameterSettings
Gated RNNsEmbedding size16
LSTM, GRU, Bi-LSTM, Bi-GRU layers3
LSTM, GRU cells100 in each layer
Bi-LSTM, Bi-GRU cells100 in each direction
Gated RNNs—CNNEmbedding size16
LSTM, GRU, Bi-LSTM, Bi-GRU layers3
LSTM, GRU cells100 in each layer
Bi-LSTM, Bi-GRU cells100 in each direction
CNN layers3
CNN local receptive field (kernel)3
CNN feature maps512
Pooling layers3
Pooling size3
CNN—gated RNNsEmbedding size16
CNN layers3
CNN local receptive field (kernel)3
CNN feature maps512
Pooling layers3
Pooling size3
LSTM, GRU, Bi-LSTM, Bi-GRU layers3
LSTM, GRU cells100 in each layer
Bi-LSTM, Bi-GRU cells100 in each direction
Table 3

Network settings for discrimination layers.

ParameterSettings
Dense layer one cells2048
Dropout 10.5
Dense layer two cells2048
Dropout 20.5
Dense layer three cells1
Network settings for feature detection layers. Network settings for discrimination layers.

Experiments and results

Data preparation and preprocessing

Two datasets are used for training and testing the described architectures. The first dataset is a hybrid dataset built from ten free accessible Arabic sentiment analysis corpora. Opinion entries are composed in colloquial and modern standard Arabic and belong to various domains: tweets, product reviews, restaurant reviews, hotel reviews, book reviews, and movie reviews. Only positive and negative categories are used to build the training set. The combined, balanced and hybrid dataset contains (146,388) samples. Table 4 describes the corpora used to construct the mixed dataset.
Table 4

The Arabic sentiment analysis corpora.

NameTotal entriesDomainPositive entriesNegative entries
[59] TDS2000Tweeter10001000
[60] ASTD10,006Tweeter7991684
[61] SemEval671Tweeter222128
[62] Social media posts3200Tweeter/blogs7191760
[63] OCA500Movie reviews250250
[64] LABR63,257Book reviews42,8328224
[65] LARGE34,492Reviews24,9486650
[66] Health services2026Tweeter6281398
[67] HARD409,562Hotel reviews52,84952,849
[68] ArSAS21,064Tweeter46437840
The Arabic sentiment analysis corpora. The second dataset is BRAD, a publicly available corpus for Arabic sentiment analysis[56]. BRAD was collected from “http://www.goodreads.com” and includes (510,598) book reviews. The balanced dataset contains (156,506) samples. Reviews are composed in modern standard and colloquial Arabic. Books were rated on a scale from 1 to 5 where ratings 4 and 5 belong to the positive category and ratings 1 and 2 belong to the negative category. For both sets, 70% of the samples are reserved for training, 20% are used for development, and 10% are employed for testing.

Results analysis

The measures used to evaluate the efficiency of the applied models are accuracy and F-score. Accuracy is the percentage of correctly predicted samples. F-score is the harmonic mean of precision and recall[34]. Precision is the ratio between the correctly predicted positive entries to all the entries that are predicted as positive. The recall is the ratio between the correctly predicted positive entries to all the entries that are real positive. The performance measures indicate the ability of the deep models to discriminate both polarity categories. Equations (1), (2), (3) and (4) identify how the estimates are calculated[69]:where: TP is the number of true-positive instances, TN is the number of true-negative instances, FP is the number of false-positive instances, FN is the number of false-negative instances. To mitigate bias and preserve the text semantics no extensive preprocessing as stemming, normalization, and lemmatization is applied to the datasets, and the considered vocabulary includes all the characters that appeare in the dataset[57,58]. Also, all terms in the corpus are encoded, including stop words and Arabic words composed in English characters that are commonly removed in the preprocessing stage. The elimination of such observations may influence the understanding of the context. GRU models showed higher performance based on character representation than LSTM models. Although the models share the same structure and depth, GRUs learned and disclosed more discriminating features. On the other hand, the hybrid models reported higher performance than the one architecture model. Employing LSTM, GRU, Bi-LSTM, and Bi-GRU in the initial layers showed more boosted performance than using CNN in the initial layers. In addition, bi-directional LSTM and GRU registered slightly more enhanced performance than the one-directional LSTM and GRU. Comparing the performance of the trained models using the hybrid dataset indicates that the Bi-GRU-CNN model achieved the highest accuracy, 89.67, followed by the GRU-CNN model with 89.65% as stated in Table 5. The Bi-LSTM model registered the least accuracy with 87.85. The highest LSTM accuracy is 89.30% achieved by the Bi-LSTM-CNN model, and the lowest accuracy is 88.12 reported by the CNN-LSTM model. Results show that starting the models with CNN layers is not beneficial for detecting efficient features.
Table 5

Comparison of applied models’ performance on the hybrid dataset.

ArchitecturePrecisionRecallF-scoreAccuracy
LSTM88.8988.8588.8488.85
GRU88.9888.9288.9288.92
LSTM-CNN88.8288.7988.7988.79
GRU-CNN89.6689.6589.6589.65
CNN-LSTM88.1388.1288.1288.12
CNN-GRU88.1188.1088.1088.10
Bi-LSTM88.0387.8587.8387.85
Bi-GRU88.9688.8588.8488.85
Bi-LSTM-CNN89.3189.3089.3089.30
Bi-GRU-CNN89.6989.6789.6689.67
CNN-Bi-LSTM88.5288.5188.5188.51
CNN-Bi-GRU88.2088.1788.1788.17

Significant values are in given in bold.

Comparison of applied models’ performance on the hybrid dataset. Significant values are in given in bold. The Bi-GRU-CNN model showed the highest performance with 83.20 accuracy for the BRAD dataset, as reported in Table 6. In addition, the model achived nearly 2% improved accuracy compared to the Deep CNN ArCAR System[21] and almost 2% enhanced F-score, as clarified in Table 7. The GRU-CNN model registered the second-highest accuracy value, 82.74, with nearly 1.2% boosted accuracy. Also, the LSTM model with 82.14 increased the accuracy by almost 0.7%.
Table 6

Comparison of applied models’ performance on the BRAD data set.

ArchitecturePrecisionRecallF-scoreAccuracy
LSTM82.1682.1482.1482.14
GRU80.6280.6280.6180.61
LSTM-CNN80.2479.9079.8579.91
GRU-CNN82.7482.7482.7482.74
CNN-LSTM80.6980.6980.6880.68
CNN-GRU80.8180.8180.8180.81
Bi-LSTM81.1481.1381.1381.13
Bi-GRU81.1881.1881.1881.18
Bi-LSTM-CNN80.0980.0780.0780.07
Bi-GRU-CNN83.2283.2083.2083.20
CNN-Bi-LSTM80.8980.8980.8980.89
CNN-Bi-GRU80.4480.4480.4480.44

Significant values are in given in bold.

Table 7

Comparison of Bi-GRU-CNN performance and the related literature on the BRAD dataset.

ArchitecturePrecisionRecallF-scoreAccuracy
[21] Deep CNN ArCAR system81.4881.4481.4581.46
Bi-GRU-CNN83.2283.2083.2083.20
GRU-CNN82.7482.7482.7482.74
LSTM82.1682.1482.1482.14
Comparison of applied models’ performance on the BRAD data set. Significant values are in given in bold. Comparison of Bi-GRU-CNN performance and the related literature on the BRAD dataset. Another experiment was conducted to evaluate the ability of the applied models to capture language features from hybrid sources, domains, and dialects. The models trained on the mixed dataset are tested using the BRAD test set. The Bi-GRU-CNN model reported the highest performance on the BRAD test set, as shown in Table 8. The hybrid model can correctly classify nearly 76% of the test set. Results prove that the knowledge learned from the hybrid dataset can be exploited to classify samples from unseen datasets. The exhibited performace is a consequent on the fact that the unseen dataset belongs to a domain already included in the mixed dataset. Using a giant hybrid dataset can increase the model capability.
Table 8

Performance of models trained on the Hybrid dataset and tested using the BRAD test set.

ArchitecturePrecisionRecallF-scoreAccuracy
LSTM75.4271.3570.1371.32
GRU75.0270.9469.6770.90
LSTM-CNN77.0374.0373.2874.01
GRU-CNN77.2974.6874.0574.66
CNN-LSTM74.8871.4970.4671.46
CNN-GRU75.0871.7670.7771.73
Bi-LSTM72.4165.9763.3065.93
Bi-GRU74.1468.9767.1868.93
Bi-LSTM-CNN77.4374.1973.3874.16
Bi-GRU-CNN78.6775.9975.3975.96
CNN-Bi-LSTM75.4272.1671.2272.13
CNN-Bi-GRU74.9671.7070.7271.67

Significant values are in given in bold.

Performance of models trained on the Hybrid dataset and tested using the BRAD test set. Significant values are in given in bold. The accuracy of the LSTM based architectures versus the GRU based architectures is illastrated in Fig. 5. Results show that GRUs are more powerful to disclose features from the rich hybrid dataset. On the other hand, LSTMs are more sensitive to the nature and size of the manipulated data. Stacking multiple layers of CNN after the LSTM, GRU, Bi-GRU, and Bi-LSTM reduced the number of parameters and boosted the performance. The networks parameters are listed in Table 9.
Figure 5

Accuracy of LSTM/GRU based architectures (created by Microsoft PowerPoint 2010).

Table 9

The networks parameers.

ArchitectureTotal params (M)
LSTM212.088
GRU212.036
LSTM-CNN43.905
GRU-CNN43.853
CNN-LSTM13.599
CNN-GRU13.497
Bi-LSTM420.124
Bi-GRU419.980
Bi-LSTM-CNN44.429
Bi-GRU-CNN44.284
CNN-Bi-LSTM21.540
CNN-Bi-GRU21.296
Accuracy of LSTM/GRU based architectures (created by Microsoft PowerPoint 2010). The networks parameers. Precision, Recall, and F-score of the trained networks for the positive and negative categories are reported in Tables 10 and 11. The inspection of the networks performance using the hybrid dataset indicates that the positive recall reached 0.91 with the Bi-GRU and Bi-LSTM architectures. Considering the positive category the recall or sensitivity measures the network ability to discriminate the actual positive entries[69]. The precision or confidence which measures the true positive accuracy registered 0.89 with the GRU-CNN architecture. Similar statistics for the negative category are calculated by predicting the opposite case[70]. The negative recall or specificity evaluates the network identification of the actual negative entries registered 0.89 with the GRU-CNN architecture. The negative precision or the true negative accuracy, which estimates the ratio of the predicted negative samples that are really negative, reported 0.91 with the Bi-GRU architecture.
Table 10

Performance measures of the hybrid dataset.

PrecisionRecallF-scorePrecisionRecallF-score
a. LSTMb. GRU
Negative0.90240.87110.8865Negative0.90500.86980.8870
Positive0.87550.90580.8904Positive0.87470.90860.8913
Average0.88890.88850.8884Average0.88980.88920.8892
c. LSTM-CNNd. GRU-CNN
Negative0.89860.87460.8864Negative0.90140.89040.8959
Positive0.87780.90130.8894Positive0.89180.90260.8972
Average0.88820.88790.8879Average0.89660.89650.8965
e. CNN-LSTMf. CNN-GRU
Negative0.88910.87100.8799Negative0.88630.87420.8802
Positive0.87360.89130.8824Positive0.87590.88790.8818
Average0.88130.88120.8812Average0.88110.88100.8810
g. Bi-LSTMh. Bi-GRU
Negative0.90700.84340.8740Negative0.91080.86130.8854
Positive0.85360.91360.8826Positive0.86850.91570.8914
Average0.88030.87850.8783Average0.88960.88850.8884
i. Bi-LSTM-CNNj. Bi-GRU-CNN
Negative0.89990.88440.8921Negative0.90610.88500.8954
Positive0.88640.90160.8939Positive0.88760.90830.8978
Average0.89310.89300.8930Average0.89690.89670.8966
k. CNN-Bi-LSTMl. CNN-Bi-GRU
Negative0.89290.87520.8839Negative0.89140.86930.8802
Positive0.87760.89500.8862Positive0.87250.89410.8832
Average0.88520.88510.8851Average0.88200.88170.8817
Table 11

Performance measures of the BRAD dataset.

PrecisionRecallF-scorePrecisionRecallF-score
a. LSTMb. GRU
Negative0.81370.83450.8240Negative0.80840.80330.8058
Positive0.82960.80830.8188Positive0.80390.80900.8065
Average0.82160.82140.8214Average0.80620.80620.8061
c. LSTM-CNNd. GRU-CNN
Negative0.77090.85190.8094Negative0.83050.82340.8270
Positive0.83390.74610.7875Positive0.82440.83140.8279
Average0.80240.79900.7985Average0.82740.82740.8274
e. CNN-LSTMf. CNN-GRU
Negative0.80970.80300.8064Negative0.81290.80110.8070
Positive0.80400.81070.8073Positive0.80330.81510.8091
Average0.80690.80690.8068Average0.80810.80810.8081
g. Bi-LSTMh. Bi-GRU
Negative0.80860.81650.8125Negative0.81160.81280.8122
Positive0.81410.80610.8101Positive0.81190.81070.8113
Average0.81140.81130.8113Average0.81180.81180.8118
i. Bi-LSTM-CNNj. Bi-GRU-CNN
Negative0.80950.78730.7983Negative0.84010.82070.8303
Positive0.79240.81420.8031Positive0.82420.84330.8337
Average0.80090.80070.8007Average0.83220.83200.8320
k. CNN-Bi-LSTMl. CNN-Bi-GRU
Negative0.81040.80720.8088Negative0.80660.80170.8041
Positive0.80740.81060.8090Positive0.80230.80710.8047
Average0.80890.80890.8089Average0.80440.80440.8044
Performance measures of the hybrid dataset. Performance measures of the BRAD dataset. On the other side, for the BRAD dataset the positive recall reached 0.84 with the Bi-GRU-CNN architecture. The precision or confidence registered 0.83 with the LSTM-CNN architecture. The negative recall or Specificity acheived 0.85 with the LSTM-CNN architecture. The negative precision or the true negative accuracy reported 0.84 with the Bi-GRU-CNN architecture. The confusion matrices of the networks are stated in Tables 12 and 13. In some cases identifying the negative category is more significant than the postrive category, especially when there is a need to tackle the issues that negatively affected the opinion writer. In such cases the candidate model is the model that efficiently discriminate negative entries.
Table 12

Confusion matrices of the hybrid dataset.

Predicted labelPredicted label
True labelNegativePositiveTrue labelNegativePositive
a. LSTMb. GRU
Negative70851048Negative70741059
Positive7667367Positive7437390
c. LSTM-CNNd. GRU-CNN
Negative71131020Negative7242891
Positive8037330Positive7927341
e. CNN-LSTMf. CNN-GRU
Negative70841049Negative71101023
Positive8847249Positive9127221
g. Bi-LSTMh. Bi-GRU
Negative68596859Negative70051128
Positive7037430Positive6867447
i. Bi-LSTM-CNNj. Bi-GRU-CNN
Negative7193940Negative7198935
Positive8007333Positive7467387
k. CNN-Bi-LSTMl. CNN-Bi-GRU
Negative71181015Negative70701063
Positive8547279Positive8617272
Table 13

Confusion matrices of the BRAD dataset.

Predicted labelPredicted label
True labelNegativePositiveTrue labelNegativePositive
a. LSTMb. GRU
Negative65411297Negative62961542
Positive14986315Positive14926321
c. LSTM-CNNd. GRU-CNN
Negative66771161Negative64541384
Positive19845829Positive13176496
e. CNN-LSTMf. CNN-GRU
Negative62941544Negative62791559
Positive14796334Positive14456368
g. Bi-LSTMh. Bi-GRU
Negative64001438Negative63711467
Positive15156298Positive14796334
i. Bi-LSTM-CNNj. Bi-GRU-CNN
Negative61711667Negative64331405
Positive14526361Positive12246589
k. CNN-Bi-LSTMl. CNN-Bi-GRU
Negative63271511Negative62841554
Positive14806333Positive15076306
Confusion matrices of the hybrid dataset. Confusion matrices of the BRAD dataset.

Conclusion

Deep neural architectures have proved to be efficient feature learners, but they rely on intensive computations and large datasets. In the proposed work, LSTM, GRU, Bi-LSTM, Bi-GRU, and CNN were investigated in Arabic sentiment polarity detection. Character features are used to encode the morphology and semantics of text. The applied models showed a high ability to detect features from the user-generated text. The model layers detected discriminating features from the character representation. GRU models reported more promoted performance than LSTM models with the same structure. Moreover, deep hybrid networks realized the highest performance measures. Combining LSTM, GRU, Bi-LSTM, and Bi-GRU with CNN boosted the performance. Bi-GRU-CNN hybrid models registered the highest accuracy for the hybrid and BRAD datasets. On the other hand, the Bi-LSTM and LSTM-CNN models wrote the lowest performance for the hybrid and BRAD datasets. The proposed Bi-GRU-CNN model reported 89.67% accuracy for the mixed dataset and nearly 2% enhanced accuracy for the BRAD corpus. In addition, the Bi-GRU-CNN trained on the hyprid dataset identified 76% of the BRAD test set. Therefore, hybrid models that combine different deep architectures can be implemented and assessed in different NLP tasks for future work. Also, the performance of hybrid models that use multiple feature representations (word and character) may be studied and evaluated.
  1 in total

1.  Arabic Sentiment Classification Using Convolutional Neural Network and Differential Evolution Algorithm.

Authors:  Abdelghani Dahou; Mohamed Abd Elaziz; Junwei Zhou; Shengwu Xiong
Journal:  Comput Intell Neurosci       Date:  2019-02-26
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.