Literature DB >> 36084041

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model.

Abstract

Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasoning and mathematical operations after learning multimodal emotional features. For the problem of how to consider the effective fusion of multimodal data and the relevance of multimodal data in multimodal sentiment analysis, we propose an attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT). In the data pre-processing stage, text features are extracted using the pre-trained language model BERT (Bi-directional Encoder Representation from Transformers), and the BiLSTM (Bi-directional Long Short-Term Memory) is used to obtain the internal information of the audio. In the data fusion phase, the multimodal data fusion network effectively fuses multimodal features through the interaction of text and audio information. During the data analysis phase, the multimodal data association network analyzes the data by exploring the correlation of fused information between text and audio. In the data output phase, the model outputs the results of multimodal sentiment analysis. We conducted extensive comparative experiments on the publicly available sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experimental results show that AFR-BERT improves on the classical multimodal sentiment analysis model in terms of relevant performance metrics. In addition, ablation experiments and example analysis show that the multimodal data analysis network in AFR-BERT can effectively capture and analyze the sentiment features in text and audio.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36084041 PMCID： PMC9462790 DOI： 10.1371/journal.pone.0273936

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

1. Introduction

With the development of internet technology, people often express their feelings about their daily lives and opinions on hot topics, are used to reading product reviews before consumption, and like to write down their feelings after experiencing products and enjoying services. This has led to the generation of a huge amount of multimodal data. Mining sentiment in these multimodal data has been a popular research topic in the fields of natural language processing, data mining, and user requirement analysis. Traditional sentiment analysis studies are limited to a single textual modality that expresses sentiment primarily by words, phrases, and relationships [1]. Kim et al. [2] used convolutional neural networks to model sentences with convolution and pooling. Wang et al. [3] proposed a disconnected recurrent neural network that restricts the flow of textual information to a fixed number of steps. Lin et al. [4] used the self-attention mechanism in the sentence modeling process. According to numerous studies, it is not possible to effectively determine a person’s emotional change based on a specific entity or event in front of us [5]. For sentiment analysis, a single modality cannot accurately determine a person’s emotion. Psychologist Mehrabian [6] found that words only reflect 7% of emotions, voice and its characteristics reflect 38% of emotions, and expressions and body language reflect 55% of emotions in daily conversations, which indicates that facial expressions and voice convey the primary emotional information. However, since the emotion analysis of human micro-expressions and micro-motions is still not perfect, our research focuses mainly on text and audio. As the study progressed, the researchers discovered correlations and complementarities between the semantic information contained in the text and the phonetic information contained in the speech. Likewise, the researchers found that more comprehensive information could be provided for sentiment analysis through the emotional interaction between text and audio patterns [7]. How to effectively fuse text and audio information becomes one of the issues that must be considered in multimodal sentiment analysis. To solve the above problems, we proposed an attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT) based on the study in literature [8, 9]. We added audio data to the traditional text sentiment analysis, making the data itself more malleable and recognizable. First, text and audio features were pre-processed using a bi-directional encoder by transformers [10] (BERT [11]) and a bi-directional long short-term memory network (BiLSTM) [12] to obtain unimodal features of text and audio. After obtaining the unimodal features of text and audio, the multimodal features were fused using a multimodal data fusion network based on the attention mechanism. Then, the self-attention mechanism is adopted to reduce the dependence on external information and capture the internal correlation of features. Finally, the processed multimodal information was classified according to sentiments. To demonstrate the effectiveness of the method in this paper, we conducted researches and experiments on the public sentiment benchmark datasets CMU-MOSI [13] and CMU-MOSEI [14]. The experimental results show that our proposed multimodal sentiment analysis model can not only effectively fuse multimodal features and improve the accuracy of sentiment analysis, but also has the advantage of focusing on the relevance of model information.

2. Related work

Multimodal sentiment analysis has become an important research topic in natural language processing, mainly involving the computational study of information such as opinions and emotional states in data composed of text, images or audio, or even video [15]. Current research directions are focused on feature learning and multimodal fusion. For feature learning, Zadeh et al. [16] designed a multi-attentive memory fusion network for sentiment evaluation through view interaction. Hazarika et al. [17] proposed a multimodal sentiment analysis model with a fusion of text and audio feature learning and provided a self-focused mechanism for multimodal sentiment weighting. In contrast to the above feature fusion methods based on attention mechanisms, Liu et al. [18] proposed a low-rank multimodal fusion method that uses a low-rank tensor to enhance the efficiency of the model and improve the performance of sentiment analysis while reducing the parameters. Hazarika et al. [19] proposed a modality-invariant and -specific sepresentations method, which incorporates a combination of losses including distributional similarity, orthogonal loss, reconstruction loss, and task prediction loss to learn modality-invariant and modality-specific representation. Concerning multimodal fusion, many researchers have proposed many solutions to proactively fuse multimodal features. So far, the main fusion strategies are feature fusion, decision fusion, and model fusion [20]. Feature fusion is the cascading of different unimodal features into multimodal features. Decision fusion is mainly a different combination strategy for different unimodal data. Model fusion methods combine feature fusion and decision fusion, which increases the complexity and training difficulty of the model while combining the advantages of both [21]. Priyasad et al. [22] blended text and audio information for sentiment analysis and designed a deep convolutional neural network (DCNN [23]) and recurrent neural network (RNN [24]) cascaded network to extract text and audio features, and finally fused different patterns of features by cross-attention layer to achieve sentiment analysis. D. Krishna et al. [8] proposed a cross-modal attention mechanism and a one-dimensional convolutional neural network to implement multimodal assignment and sentiment analysis with a 1.9% improvement in accuracy compared to previous methods. Poria et al. [25] proposed a deep learning model based on contextual BiLSTM [12], which uses the contextual information of audio data to obtain more sentiment features, independently examines and classifies the features of text modality and audio modality in decision fusion, and later fuses the results into a decision vector to develop a combined strategy. Although the above models investigated feature extraction and multimodal fusion methods, they all ignored the correlation and complementarity between semantic and speech information, which directly impact feature fusion and sentiment analysis results.

Pre-trained language models

Influenced by transfer learning, pre-trained language models have made a breakthrough in natural language processing(NLP), learning generic language representations from massive corpora and greatly improving downstream tasks without the need for manual annotation. Peters et al. [26] proposed a novel deep bidirectional language model ELMo (Embeddings from Language Models). ELMo is a deep contextual word representation method. It is pre-trained on a large text corpus to obtain the generic semantic representations, and the obtained semantic representations are used as features to migrate to downstream tasks. After extensive experiments, ELMo has been shown to significantly improve the performance of NLP tasks. Radford et al. [27] proposed GPT(Generative Pre-Training), a generative pre-trained transformer language model. GPT pre-trains word vectors on a large-scale unsupervised text corpus and fine-tunes them using small-scale supervised text data. Experiments show that GPT achieves surprising results in tasks such as text translation, semantic matching, knowledge quizzes, and inference. Devlin et al. [11] proposed the BERT based on the encoder bidirectional encoding representation in transformers [10]. BERT uses the new masked language model for pre-training to generate deep bi-directional language representations. The experiments show that BERT significantly outperforms other pre-trained language models with the latest results achieved in 11 NLP tasks. Multitask learning [28] is a widely used learning method and is likewise a derived transfer learning method. The purpose of multitask learning is to optimize multiple learning tasks simultaneously and to improve the model’s generalization and prediction performance for each task using shared information between tasks. Multitask learning is generally divided into two types. One type has a primary task and a secondary task. While the primary task is the main function of the model, the secondary task is to help train the primary task. The other type has multiple equal tasks that build on each other. The first type is widely used in deep learning, where the selection of appropriate auxiliary tasks is crucial for the success of a multitask learning framework [29]. The domain information of the training signal through related tasks in the main task is used as a derivation bias to improve the generalization of the main task. In recent years, multitask learning has come to be used in deep learning. Multitask learning allows deep optimization of models, enabling better access to data representations and more comprehensive and variable mining of data information. Yu et al. [30] focused on refining visual features by learning multiple related tasks simultaneously in given target information and proposed a goal-oriented multimodal BERT (TomBERT). The motivation is the observation that correlated images in the sample highlight the focused target and reflects the users’ emotions towards the focused target. To be specific, TomBERT first learns the image features associated with the target and then uses the transformer model to fuse them with text features. Xu et al. [31] proposed the multiple interaction memory network, which includes two interaction memory networks for monitoring textual and visual information of a given target, to capture the global information of the data more comprehensively. The above studies show that the combination of multitask learning can help the model achieve better performance.

3. Materials and methods

3.1 Dataset

CMU-MOSI (Multimodal Opinion Level Sentiment Intensity) [13] is one of the most popular benchmark datasets, containing 93 videos with a total of 2199 conversations. Each conversation has a sentiment label in the range [-3,+3], and we define labels >0 as positive sentiment and labels < = 0 as negative sentiment. The training, validation, and test sets in CMU-MOSI each has 52, 10, and 31 videos with 1284 (679 positive conversations, 605 negative conversations), 229 (124 positive conversations, 105 negative conversations), and 686 (277 positive conversations, 409 negative conversations) conversations each. The division of our experimental dataset strictly follows the CMU-MOSI dataset format. Information about the CMU-MOSI dataset is shown in Table 1.

Table 1

CMU-MOSI dataset information.

CMU-MOSI
Data	Train sets	Valid sets	Test sets	Total
Video	52	10	31	93
Dialogue	1284	229	686	2199
Positive	679	124	277	1080
Negative	605	105	409	1119
Polarity	[-3,+3]

CMU-MOSEI(Multimodal Opinion Sentiment and Emotion Intensity) [14] dataset comes from over 1,000 online YouTube speakers and contains 3229 videos with a total of 22676 conversations. Each conversation has an emotion label and these labels are divided into the range [-3,+3] at a coarse granularity. We define labels >0 as positive emotions and labels < = 0 as negative emotions, at fine granularity they are divided into 7 emotion labels: anger, disgust, fear, happiness, sadness, and surprise. The CMU-MOSEI training set, validation set, and test set each has 2550, 300, and 679 videos with 16216 (114992 positive conversations, 5219 negative conversations), 1835 (1333 positive conversations, 502 negative conversations), and 4625 (3281 positive conversations, 1344 negative conversations) conversations. The relevant information of the CMU-MOSEI dataset is shown in Table 2.

Table 2

CMU-MOSEI dataset information.

CMU-MOSEI
Data	Train sets	Valid sets	Test sets	Total
Video	2550	300	679	3229
Dialogue	16216	1835	4625	22676
Positive	11499	1333	3281	13213
Negative	4717	502	1344	6563
Polarity	[-3,+3]

3.2 Evaluation metrics

The performance evaluation metrics of the experiment include 2-class Accuracy (ACC2), 7-class Accuracy (ACC7), weighted average F1-score (F1), and mean absolute error (MAE). The formula for calculating the accuracy rate is as follows: The F1 is a weighted summed average of the precision and recall rates, and the metric is calculated as follows: MAE is the absolute error between the predicted value and the true value. denotes the true value and y denotes the predicted value. The metric is calculated as follows: To demonstrate the validity of the model, the Pearson correlation coefficient (Corr) is used to further measure the degree of correlation between the predicted and true labels of the model. The closer Corr is to 1, the better the performance of the model. , y, , denotes the true and predicted values and their corresponding mean values, respectively. The metric is calculated as follows:

3.3 Baseline

In the field of multimodal sentiment analysis, the classical models LMF, MFN, etc., and the recently proposed multimodal sentiment analysis models CM-BERT, Self-MM, etc. have achieved notable results. On the CMU-MOSI and CMU-MOSEI, AFR-BERT will perform performance comparison experiments with the baseline models. The baseline models are as follows: TFN. The tensor fusion network uses a tensor fusion multimodal fusion approach to model intermodal dynamics, which aggregates single-, two-, and three-peak interactions [32]. LMF. The low-rank multimodal fusion network utilizes a low-rank weight tensor to improve multimodal fusion efficiency without compromising performance [4]. MFN. The memory fusion network adopts a neural network model with multi-view sequential learning for multimodal sentiment analysis by integrating view-specific information and cross-view information [14]. RAVEN. The recurrent attended variation embedding network model considers the fine-grained structure of nonverbal word sequences and dynamically changes the representation of words based on nonverbal cues [33]. ICCN. The interaction canonical correlation network exploits the outer product of feature pairs and typical deep learning analysis methods to study useful multimodal embedding features [34]. MFM. The multimodal factorization model generates discriminative targets by optimizing the cross-modal data and labels jointly and then ensures that the learned representations are rich in intra- and inter-modal features by differentiating the targets to predict label sentiment by distinguishing targets [35]. MulT. The multimodal transformer model is an end-to-end model that extends the standard transformer network to learn representations directly from unaligned multi-modal flows [36]. CM-BERT. The cross-modal BERT model introduces information from the audio modality to help the text modality fine-tune the pre-trained BERT model, and then uses a novel multimodal attention fusion method that dynamically adjusts the word weights through the interaction of text and audio modalities [9]. MISA. The modality-invariant and -specific representations model learns a factorized subspace of each modality to provide a better representation as input to the fusion [19]. Self-MM. The self-supervised multitask multimodal model uses a self-supervised multitask learning strategy to improve pattern recognition accuracy by adjusting the weight of each subtask based on the design of multimodal labels and modal representation of the single-peak label generation module adjustment strategy [37]. MAG-BERT. The multimodal adaptation gate for BERT uses a gate structure connected to the BERT model to continuously improve the multimodal recognition accuracy of the model by modifying the BERT model with attention and adaptive vectors conditional on non-verbal behavior [38].

3.4 Method

Fig 1 presents the structure of the attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT).

Fig 1

Structure of AFR-BERT multimodal sentiment analysis model.

AFR-BERT is divided into four network modules, which correspond to data input, data fusion, data analysis, and data output.

Structure of AFR-BERT multimodal sentiment analysis model.

AFR-BERT is divided into four network modules, which correspond to data input, data fusion, data analysis, and data output. The AFR-BERT model consists of the following four main components. Data Preprocessing Layer. Multimodal Fusion Layer. Multimodal Association Layer. Output Layer. The Data Preprocessing Layer preprocesses text and audio data. Text data. The text data can be considered as consisting of phrases and relations with contextual relationships. The BERT is used in the experiments to obtain the output of the last layer of the encoder as text features. The text sequence for each word-piece token is T = [T1, T2, …, T], where n is the number of sequences length. The BERT model adds a CLS start classification identifier to the input sequence, and the output sequence is calculated after embedding and encoding as follows: Audio data. The speech signal is extracted by the COVAREP [39]. The time step of each word in the text data is obtained by the P2FA [40] so that the audio features are averaged over the corresponding step. Since multimodal feature fusion requires matrix operations on the data to ensure the same length as the text features, the null missing parts of the speech features are filled with zeros. A[ is the zero vector and the audio features are represented as below: The audio feature representation with contextual information is obtained by BiLSTM [12]. LSTM [41] is a mechanism that uses memory cells and gates. It not only remembers long-term historical information but also solves the problem of gradient disappearance and gradient explosion. The LSTM core structure consists of forgetting gates, input gates, output gates, and memory cells. The expression of LSTM is calculated as below: x is the input feature of the audio at moment t. C is the cell state. is the temporary cell state. h is the hidden layer state of the audio t. σ and tanh both are the activation functions. W is the weight matrix. b is the bias vector. h is the hidden layer state at the previous moment. f represents the forgetting gate. i indicates the memory gate. o means the output gate. Since one-way LSTM cannot utilize interdiscourse contextual information, Huang Z et al. [12] proposed BiLSTM (Bidirectional Long-Short Term Memory) to obtain long-term historical information in each moment of discourse by forward and reverse LSTM. The specific structure is shown in Fig 2, with the following expressions: indicates LSTM forward output. indicates LSTM backward output. [⊕] is splicing operation. h indicates BiLSTM output.

Fig 2

BiLSTM model structure.

(Forward) means forward propagation of the model. (Backward) means model backward propagation.

BiLSTM model structure.

(Forward) means forward propagation of the model. (Backward) means model backward propagation. The Multimodal Fusion Layer fuses multimodal data features. Fig 3 shows the structure of the cross-modal fusion attention mechanism proposed in this paper.

Fig 3

Cross-modal fusion attention mechanism structure.

(K) represents text feature data. (K) represents audio feature data. (Relu, Row Softmax, softmax, concat) are all function calculations. (Mask) is a matrix.

Cross-modal fusion attention mechanism structure.

(K) represents text feature data. (K) represents audio feature data. (Relu, Row Softmax, softmax, concat) are all function calculations. (Mask) is a matrix. In this paper, multimodal sequence data include two main forms: text (T) and audio (A). Different extraction methods of modal features result in different text sequence dimensional features T ∈ {T, A} and audio sequence dimensional features A ∈ {T, A}. Referring to the literature [36], we use 1D temporal convolutional layer as a sequence alignment tool to ensure that both are of the same dimension. k{T, A} means the size of the convolution kernel for text and audio modalities. and means the dimensionality data of text and audio features after convolution calculation. Cross-modal fusion attention mechanism is one of the cores of AFR-BERT. Cross-modal Attention uses the information interaction between text and audio modalities to adjust the weights of the model and fine-tune the pre-trained language model BERT, as shown in Fig 3. and are the text features and audio features obtained from the data processing layer. The text interaction matrix N1 and audio interaction note matrix N2 are defined as below: In the data processing layer, the model uses zero vector padding speech features with the same length as the text features. In order to reduce the influence of padding sequences, the model uses the mask matrix N in attention, after which the probability distribution of each feature sequence is calculated using soft attention to obtain a multimodal attention representation. 0 represents the location of the token. For the padding part, the feature data is calculated by the mask matrix as −∞ (negative infinity), and the attention fraction of the filled position is 0 after the Softmax function calculation. Bimodal attention representation matrices M1 and M2 are defined as follows: After obtaining the bimodal attention matrix, the feature representations of the two modalities are connected to help capture the important emotional factors between the multimodalities. We define the multimodal fusion matrix X as below: The Multimodal Association Layer models the correlation of multimodal data. The text modality emotional information is often closely related to the emotional changes of the audio modality. The emotional characteristics of audio patterns are usually related to frequency factors such as pitch, vocal intensity, loudness, and pitch length. How to filter a small amount of important information from a large amount of information, ensure that the reliance on external information is reduced, and capture important information with internal relevance is one of the cores of AFR-BERT research. In this paper, we adopt the self-attentive mechanism of transformer [10], also known as scaled dot product attention based on deflation. Its specific structure is shown in Fig 4, and the calculation expression is as follows:

Fig 4

Scaled dot product attention structure.

(Q) means the query matrix. (K) means the key matrix. (V) means the value matrix. (Mask) represents matrix operations for processing non-fixed-length sequences. () is the scale factor for scaling.

Scaled dot product attention structure.

(Q) means the query matrix. (K) means the key matrix. (V) means the value matrix. (Mask) represents matrix operations for processing non-fixed-length sequences. () is the scale factor for scaling. We define multimodal data with computed internal correlations as X. The expression is as follows: The attention X and the last layer of BERT encoder text output sequence R are processed with residual concatenation and normalization (Add&Norm). It makes the network effectively superimposed, avoids the degradation of network depth due to gradient disappearance, and also improves the accuracy and convergence speed of the model. The above calculation yields the classifiable aggregated total feature data X, which is calculated by the following expression: The Output Layer outputs sentiment classification results. The multimodal fused feature data X is computed by the fully connected layer and softmax function to derive the sentiment classification results with the following expressions: W and b are the weight and bias of the fully connected. W, and b are the weight and bias of the softmax layer. X is the aggregated features. y is the sentiment classification result.

3.5 Experimental settings

Parameters in deep learning can usually be divided into trainable parameters and hyperparameters. Trainable parameters can be optimally learned by backpropagation algorithms during model training, while hyperparameters are manually set to the correct values based on existing experience before training learning begins. The hyperparameters determine to some extent the final performance of the algorithm model. We use a basic Grid-Search to adjust the hyperparameters and select the best hyperparameter settings based on the performance of AFR-BERT on the validation set. For AFR-BERT, hyperparameters and tuning ranges are: learning rate (0.00001–0.001), batch size (16–128), max sequences length (32–96), the number of epochs (10–100), and hidden dimensions of BiLSTM (64–512). Mean Squared Error (MSE) is used as the loss function, and Adam is used as the optimizer. Whenever the training of the AFR-BERT with a specific hyperparameter setting has finished, features learned from the AFR-BERT are used as input to the same downstream task models. The above optimal parameter settings report test result is shown in Table 3.

Table 3

The optimal parameter settings report.

Parameter
learning rate	batch size	max sequences	epoch	hidden dimensions of BiLSTM	loss function	optimizer
0.0002	32	50	30	200	MSE	Adam

4. Results and discussion

We designed three sets of experiments to verify the validity of the model from different directions and discussed the experimental results through qualitative analysis.

4.1 Comparison experiments

A comparison experiment refers to setting up two or more experiments, followed by a comparative analysis of the experimental results to explore the relationship between various factors and the experimental subjects. We chose to conduct experiments comparing the baseline models and the AFR-BERT model on the CMU-MOSI and CMU-MOSEI. Table 4 shows the experimental results of the evaluation metrics (ACC2, ACC7, F1, MAE) of the baseline models and AFR-BERT on the CMU-MOSI. The higher values of ACC2, ACC7, and F1 in the evaluation metrics prove the higher model performance, while the lower values of MAE prove the higher model performance.

Table 4

Comparative experiments of multimodal sentiment analysis models on the dataset CMU-MOSI.

Model	MAE	ACC₂	F1	ACC₇
TFN(B)	0.901	80.82	80.77	34.94
LMF(B)	0.917	82.47	82.45	33.239
MFN	0.965	77.40	77.30	-
RAVEN	0.915	78.00	76.60	-
ICCN	0.860	83.00	83.00	39.00
MFM(B)	0.877	81.72	81.64	35.42
MulT(B)	0.861	83.00	82.80	40.00
CM-BERT	0.729	84.50	84.50	44.90
MISA(B)	0.783	83.40	83.60	-
Self-MM(B)	0.713	85.98	85.95	-
MAG-BERT	0.740	86.10	86.00	-
AFR-BERT	0.702	86.74	86.23	43.61

(B) means the language features are based on BERT. (-) means null value. The bolded values represent the best values of the performance indicators.

(B) means the language features are based on BERT. (-) means null value. The bolded values represent the best values of the performance indicators. From the experimental results in Table 4, it can be concluded that the AFR-BERT model produced new results on the CMU-MOSI and improved the related performance evaluation metrics. On the binary sentiment classification task, the AFR-BERT achieved 86.74% on ACC2 with an improvement of 0.64%-9.34% compared to the baseline model. Similar to the result on ACC2, our model achieved an improvement of 0.23%-9.63% on F1. In the sentiment score classification task, the AFR-BERT model achieved 43.61% on ACC7, second only to CM-BERT. In the regression task, the AFR-BERT lowered the MAE value by about 0.26–0.11. What’s more, most of the above baselines were analyzed using text, audio, and video, but our model created an outstanding result using only text and audio information. To further capture the degree of correlation between the true values of sentiment and model predictions on the CMU-MOSI dataset, we used the pearson correlation coefficient (Corr) to measure the degree of linear correlation between the two values. Fig 5 shows the histogram of the comparative experimental results of the Corr index on the CMU-MOSI. By comparing the histograms, it was found that AFR-BERT and Self-MM achieved the optimal value of 0.798. The correlation between the model predicted values and the true labels approached a very strong degree of correlation. In the regression task, the AFR-BERT improved with an increase of 0.007–0.166 on Corr over other models except for Self-MM.

Fig 5

Cross-sectional histograms of Corr metrics for each model on the CMU-MOSI.

To demonstrate the generalizability of the AFR-BERT model, we conducted the same comparative experiments on the CMU-MOSEI. Table 5 shows the experimental results of the baseline model and AFR-BERT on the CMU-MOSEI to evaluate the metrics (ACC2, F1, MAE). Since most of the baseline models did not produce experimental results of ACC7 on the CMU-MOSEI, we do not compare the values of the metric.

Table 5

Comparative experiments of multimodal sentiment analysis models on the dataset CMU-MOSEI.

Model	MAE	ACC₂	F1
TFN(B)	0.901	80.82	80.77
LMF(B)	0.623	82.00	82.10
MFN	-	76.00	76.00
RAVEN	0.614	79.10	79.50
ICCN(B)	0.565	84.18	84.15
MFM(B)	0.568	84.40	84.30
MulT (B)	0.580	82.50	82.30
MISA(B)	0.555	85.50	85.30
Self-MM(B)	0.536	85.17	85.30
MAG-BERT	-	84.70	84.50
AFR-BERT	0.530	86.23	86.15

(B) means the language features are based on BERT. (-) means null value. The bolded values represent the best values of the performance indicators.

(B) means the language features are based on BERT. (-) means null value. The bolded values represent the best values of the performance indicators. From the experimental results in Table 5, it can be concluded that the AFR-BERT model produced better results on the CMU-MOSEI dataset and improved all performance evaluation metrics. On the binary sentiment classification task, the AFR-BERT model achieved 86.23% on ACC2 with an improvement of 0.80% -10.23% compared to the baseline model. Similar to the result on ACC2, our model achieved 86.15% on F1, which is an improvement of 0.85%-10.15% compared with baselines. In the regression task, the AFR-BERT model lowered the MAE value by about 0.006–0.093 compared to the baseline model. To further capture the degree of correlation between the true values of sentiment and model predictions on the CMU-MOSEI, we used the pearson correlation coefficient (Corr) to measure the degree of linear correlation between the two values. Fig 6 shows the histogram of the comparative experimental results of the Corr on the CMU-MOSEI dataset. By comparing the histogram, it was found that AFR-BERT and Self-MM achieved the optimal value of 0.772 close to 0.8. The correlation between the model predicted value and the true label is close to the degree of a very strong correlation. In the regression task, the AFR-BERT achieved an increase of 0.007–0.110 on Corr compared to the baseline model. Since the original papers of some models lack Corr data, it is empty.

Fig 6

Cross-sectional histograms of Corr metrics for each model on the CMU-MOSEI dataset.

It can be seen from the two sets of comparison experiments that the AFR-BERT model we proposed outperformed each classical multimodal sentiment analysis model in general in each performance metric, which fully verified the effectiveness and correctness of our method.

4.2 Ablation experiments

Ablation experiments are used to understand the network by removing parts of the network and studying the performance of the network for relatively complex neural networks. In order to investigate the effect of some modules on the performance of the overall model, four sets of ablation experiments were conducted on the CMU-MOSI. AFR-BERT(-BiLSTM). We eliminated the BiLSTM module and inputted the extracted audio unimodal features and the text features output by encoder directly to the data fusion layer. AFR-BERT(-MFL). We eliminated the multimodal fusion layer (MFL) and synthesized text features and audio features to transfer multimodal data to the data association layer. AFR-BERT(-MAL). We eliminated the multimodal association layer (MAL) and inputted multimodal data directly to the output layer for sentiment classification. Table 6 shows our results of ablation experiments on the CMU-MOSI. We chose ACC2, F1, and MAE as the model performance evaluation metrics.

Table 6

Ablation experiments on the CMU-MOSI dataset.

Model	ACC₂	F1	MAE
ARF-BERT(-BiLSTM)	81.43	80.98	0.864
ARF-BERT(-MFL)	78.54	76.84	0.944
ARF-BERT(-MAL)	80.47	79.43	0.907
AFR-BERT	86.74	86.23	0.702

(-) means subtracting the corresponding network. The bolded values represent the best values of the performance indicators.

(-) means subtracting the corresponding network. The bolded values represent the best values of the performance indicators. From the experimental results, it can be seen that the AFR-BERT model shows a decreasing trend in each performance evaluation metric after removing the BiLSTM, MFL, and MAL modules, separately. After removing the BiLSTM module, firstly, ACC2 decreased by 5.31%, and F1 decreased by 5.25% in the binary sentiment classification task, while MAE increased by 0.162 in the regression task. It shows that using BiLSTM to extract audio data can help the model to mine the emotional information in the data. After removing the MFL module, firstly, ACC2 decreased by 8.2% and F1 dropped by 9.36% in the binary sentiment classification task, while the MAE increased by 0.242 in the regression task. It implies that the use of MFL with multimodal data can improve the effectiveness of the model for sentiment classification. After removing the MAL module, first, ACC2 decreased by 6.27% and F1 decreased by 6.8% in the binary sentiment classification task, whereas MAE increased by 0.205 in the regression task. It validates that MAL can filter out a small amount of important information from a large amount of information and effectively capture the internal correlation between multiple modalities. It can be seen from the ablation experimental results that BiLSTM has a great contribution in the feature extraction stage, MFL in the data fusion stage, and MAL in the data analysis stage.

4.3 Example analysis

To demonstrate the importance of multimodal data fusion and multimodal data correlation analysis, we selected some samples from the CMU-MOSI for testing. The sentiment polarity of each dialogue ranged between very strong negative (-3) and very strong positive (3). As shown in Table 7, each sample contained text and audio information, the true sentiment labels of the samples, and the predicted results of the AFR-BERT model. Examples 2 and 3 showed the presence of positive words such as “jokes”, “laugh”, “like” and “welcome” were easily judged as positive emotions from text modality alone, but their real emotions were nervous (negative) and low (negative). The model in this paper identified the correlation between text and audio data, mined the important features of multimodality, and accurately judged that the emotions of dialogues 3 and 4 were both negative, so the real emotions were predicted.

Table 7

Sample analysis.

Case	Data		True	AFR-BERT
1	Text	I really did enjoy as well wasn’t too fond of the ending.	P(+1.4)	P(+0.6)
1	Audio	Cheerful tone	P(+1.4)	P(+0.6)
2	Text	Except their eyes are kind of like this welcome to the polar express.	N(-0.6)	N(-0.2)
2	Audio	Tense tone	N(-0.6)	N(-0.2)
3	Text	Maybe 5 jokes could make me laugh.	N(-1.8)	N(-0.3)
3	Audio	Low tone	N(-1.8)	N(-0.3)
4	Text	But umm I liked it.	N(+1.8)	N(+0.5)
4	Audio	Emphasis on tone	N(+1.8)	N(+0.5)

(P) means positive emotions. (N) means negative emotions. The numbers represent emotional scores, where positive numbers represent positive scores and negative numbers represent negative scores.

4.4 Qualitative analysis

Text modality has great limitations to understand the human emotions in Example 2 and Example 3 of Table 7, while AFR-BERT can adjust the emotional intensity by considering speech information to correctly capture human emotions. These examples show that AFR-BERT can better integrate text modality and audio modality to explore the hidden emotional information deeper. In Table 6, we provide the results of the ablation experiments. It is clear from the experimental results that the modules do have a significant impact on the AFR-BERT performance. First, it shows that extracting audio data using BiLSTM can facilitate AFR-BERT to mine the emotional information in speech data. Second, it means that Multimodal Fusion Layer (MFL) can effectively fuse multimodal data and help improve the performance of AFR-BERT sentiment classification. Third, it shows that Multimodal Association Layer (MAL) can filter out a small amount of important information from a large number of messages and effectively capture the internal correlation between multiple patterns to help AFR-BERT discern the true sentiment information in the data. Tables 4 and 5 show the results of the comparison experiments on the CMU-MOSI and CMU-MOSEI. From the experimental results, it is easy to see that the AFR-BERT model outperforms other baseline models in terms of performance evaluation metrics. This also confirms the correctness and high performance of the AFR-BERT model.

5. Conclusion

We propose an attention-based mechanism feature relevance fusion multimodal sentiment analysis model(ARF-BERT). The model is different from the traditional text unimodal sentiment analysis because we add audio modal information to obtain more comprehensive information and capture more sentiment features through the information interaction between text modality and audio modality. While we focused on multimodal data fusion, we reasonably analyzed the correlation between multimodal data, which greatly improved the effect of sentiment classification. Comparison experiments were conducted on CMU-MOSI and CMU-MOSEI. The experimental results prove that AFR-BERT can effectively improve the performance of multimodal sentiment analysis compared with the classical multimodal sentiment analysis model. Ablation experiments were conducted on the CMU-MOSI. The experimental results indicate that the model adopts BiLSTM, MFL, and MAL modules to contribute to the tasks of feature extraction, feature fusion, correlation extraction, and sentiment recognition of multimodal data. In future work, we will try to integrate visual information into text and audio information to dissect the sentiment changes more deeply and improve the performance of the model in sentiment analysis.

Processed dataset.

Processed dataset includes text and audio data. (RAR) Click here for additional data file. 2 Jun 2022

PONE-D-22-11384

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model

PLOS ONE Dear Dr. Zhou, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 17 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Sriparna Saha, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse. 3. Thank you for stating the following in the Acknowledgments Section of your manuscript: "This work is sponsored by National Natural Science Foundation of China (Grant No.61901103). We note that you have provided funding information. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "This work is sponsored by National Natural Science Foundation of China (Grant No. 435 61901103)." Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments: Both the reviewers have suggested some major changes; Authors are requested to incorporate the suggestions of both the reviewers to improve the paper. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I am not sure because I am not find much difference between MISA and proposed methodology. Kindly provide the Code. Without experiment I am not able to comment anything. Atleast provide some screen shots. proofreading required. Reviewer #2: Strengths: 1. Sections are detailed and informative for an average reader. All necessary information are provided. 2. The experiments performed are thorough with rigorous ablation studies. Weakness: 1. The evaluation of the proposed method seems weak and more so, the improvement of the proposed method with the best performing baseline is marginal (in few decimal points), which limits the effectiveness of the approach. 2. One major problem I have with the manuscript, overall, is the poor writing style. There is huge scope for improvement in the presentation, specifically numerous typos, grammatical errors, etc. Some of the expositions are shown below. Considering myself as an average versed reader, it was very difficult to ignore all sort of typos (random capitalizations, missing punctuations, spellings, ) occurring in each and every line. The manuscript is not acceptable in its current form based on presentation issues alone. Comments to the authors: In table 6 caption, there is a mention of bolded values, however I don't find any. At multiple places in the manuscript, the authors chose to write that their method 'significantly improves existing systems performances. This is slightly misleading as, (a). performance improvements are marginal compared to existing best method, (b). significance test not done to show that the obtained results are not attained by chance. Typos and Expositions: hard to follow. break into shorter sentences. "With the development of Internet technology and the popularity of many social media, in daily life, people often express their feelings about daily life and opinions on hot topics in microblogs, are used to reading product reviews before consumption, and like to write down their feelings after experiencing products and enjoying services, which consequently generates a large amount of multimodal data." Wang [3]proposed a -> space should be given after each citation Lin et al [4] used the -> al is followed by a fullstop CMU-MOSI [12] (CMU Mul-timodal -> multimodal can be written without a hyphen or it may come after multi, but definitely after mul. same for 're-quire' Sentiment Intensity)dataset -> remove paranthesis NLP, BERT, etc. -> abbreviated terms must be defined with its full form at their first occurrence Line 215, 'COVARER' - correct spelling 'method. a The 215 time step' -> misplaced 'a' line 218, re-quires -> requires line 418 -> ablation experiments: capitalization not done In all equations, inconsistent superscript and subscript usage. line 232 mentioned X in caps which do not relate to any variable in the equations. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 8 Jul 2022 Dear Editors and Reviewers: Thank you for your letter and for the reviewers' comments concerning our manuscript entitled “AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model” (PONE-D-22-11384). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have studied the comments carefully and have made corrections which we hope to meet with approval. Revised portions are marked in yellow on the paper. The corrections in the paper and the responses to the editors' and reviewers' comments are as follows: editors: 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. Response: Thank you very much for your review. After our tireless efforts, we have made corresponding changes to the style of the manuscript, the writing requirements, including the naming of the document. We hope our manuscript will meet your requirements, and we also hope to hear from you soon. 2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Response: We have placed the main code covered by this manuscript at https://github.com/Learn-Be/AFR-BERT. 3. We note that you have provided funding information. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Response: We have removed any funding-related text from the manuscript. Reviewers: 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly Response: Thank you very much for your constructive comments. To show that our results are not obtained by chance, we have added Section 3.5 (Experimental Settings) to our manuscript, which provides a detailed analysis of the experimental parameters of our method. At the same time, section 4.4 (Qualitative analysis) has been added to our manuscript to further explain and elaborate on the experimental history of the comparison, ablation and case studies. We sincerely hope that our modifications and improvements will meet with your approval. 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No Response: Thank you very much for your constructive comments. In order to perform the statistical analysis correctly and rigorously, the analysis of the experimental results was fully refined and the results obtained for each experiment were synthesized in the qualitative analysis phase in Section 4.4 in order to obtain the best method. 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Response: Thank you very much for your affirmation and recognition of our work. 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No Response: We apologize for the poor language of our manuscript. We worked on the manuscript for a long time and the repeated addition and removal of sentences and sections obviously led to poor readability. We have now worked on both language and readability and have also involved native English speakers for language corrections. We really hope that the flow and language level have been substantially improved. Reviewer 1: 1. I am not sure because I am not finding much difference between MISA and proposed methodology. Kindly provide the Code. Without experiment I am not able to comment anything. At least provide some screen shots. proofreading required. Response: Thank you very much for your constructive comments. Your comments have had an unimaginable impact on the future direction of our team's research. We express our great appreciation. We find that MISA is fundamentally different from these existing works. MISA does not use contextual information and neither focus on complex fusion mechanisms. Instead, it stresses the importance of representation learning before fusion (MISA/2 RELATED WORK/ line 27-30). On the contrary, we have paid attention to the contextual information before fusion, proposed a new fusion algorithm for complex fusion mechanisms, and performed data analysis for the fused multimodal data. And we have elaborated and introduced the features of MISA method in our paper, which we hope can answer your questions (line 58-62). To refine the details of the experiment, we have added section 3.5 in our text, where we further describe the settings of the experimental parameters. We also place our method at https://github.com/Learn-Be/AFR-BERT. Reviewer 2: 1. Sections are detailed and informative for an average reader. All necessary information is provided. Response: Thank you very much for your affirmation. 2. The experiments performed are thorough with rigorous ablation studies. Response: Thank you very much for your comments and for inspiring us. 3. The evaluation of the proposed method seems weak and more so, the improvement of the proposed method with the best performing baseline is marginal (in few decimal points), which limits the effectiveness of the approach. Response: Thank you very much for your constructive comments. Your comments have helped us a lot in our research work. First of all, we acknowledge that the performance improvement of our proposed method is indeed limited compared to the best method, which is undeniable. However, it is well known that multimodal data is huge, complex and contains a large amount of interfering data, which makes the performance improvement very difficult. We also regret this. Second, our approach is not without merits. It contributes in the following ways. 1. we get rid of the limitations of traditional unimodality and outperform the current optimal model using only text and audio data. 2. Our approach takes a new approach to the research areas of feature extraction, multimodal data fusion and multimodal data analysis. 4. One major problem I have with the manuscript, overall, is the poor writing style. There is huge scope for improvement in the presentation, specifically numerous typos, grammatical errors, etc. Some of the expositions are shown below. Considering myself as an average versed reader, it was very difficult to ignore all sort of typos (random capitalizations, missing punctuations, spellings, ) occurring in each and every line. The manuscript is not acceptable in its current form based on presentation issues alone. Response: We apologize for the poor language of our manuscript. We worked on the manuscript for a long time and the repeated addition and removal of sentences and sections obviously led to poor readability. We have now worked on both language and readability and have also involved native English speakers for language corrections. We really hope that the flow and language level have been substantially improved. 5. Comments to the authors: In table 6 caption, there is a mention of bolded values, however I don't find any. At multiple places in the manuscript, the authors chose to write that their method 'significantly improves existing systems performances. This is slightly misleading as, (a). performance improvements are marginal compared to existing best method, (b). significance test not done to show that the obtained results are not attained by chance. Response: We apologize for the writing errors in our manuscript. We are also very grateful to you for pointing out our mistakes. We have made detailed corrections and improvements to your comments, and we hope our efforts will meet your requirements. Details can be found in Table 7. We acknowledge that the performance improvement of our proposed method is indeed limited compared to the best method, which is undeniable. However, it is well known that multimodal data are huge, complex and contain a large amount of interfering data, which makes the performance improvement very difficult. We also regret this. To show that our results are not obtained by chance, we have added Section 3.5 (Experimental Settings) to our manuscript, which provides a detailed analysis of the experimental parameters of our method. At the same time, section 4.4 (Qualitative analysis) has been added to our manuscript to further explain and elaborate on the experimental history of the comparison, ablation and case studies. 6. Typos and Expositions: a) hard to follow. break into shorter sentences. "With the development of Internet technology and the popularity of many social media, in daily life, people often express their feelings about daily life and opinions on hot topics in microblogs, are used to reading product reviews before consumption, and like to write down their feelings after experiencing products and enjoying services, which consequently generates a large amount of multimodal data." Response: We are very sorry for the irregularities in writing. And thank you very much for pointing out where we went wrong. We have modified this long sentence. “With the development of internet technology, people often express their feelings about their daily lives and opinions on hot topics, are used to reading product reviews before consumption, and like to write down their feelings after experiencing products and enjoying services. This has led to the generation of a huge amount of multimodal data.”( line 2-5 ) b) Wang [3]proposed a -> space should be given after each citation. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> Wang et al. [3] proposed (line 11) c) Lin et al [4] used the -> al is followed by a fullstop. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake. -> Lin et al. [4] used the (line 12) d) CMU-MOSI [12] (CMU Mul-timodal -> multimodal can be written without a hyphen or it may come after multi, but definitely after mul. same for 're-quire' Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> CMU-MOSI (Multimodal Opinion Level Sentiment Intensity) (line 129) e) Sentiment Intensity)dataset -> remove paranthesis. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> CMU-MOSI (Multimodal Opinion Level Sentiment Intensity) [12] is one (line 129) f) NLP, BERT, etc. -> abbreviated terms must be defined with its full form at their first occurrence Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake. g) Line 215, 'COVARER' - correct spelling. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> COVAREP (line 215) h) method. a The 215 time step' -> misplaced 'a'. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> The time step of (line 215) i) line 218, re-quires -> requires. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> requires (line 217) j) line 418 -> ablation experiments: capitalization not done. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> Ablation experiments (line 453) k) In all equations, inconsistent superscript and subscript usage. line 232 mentioned X in caps which do not relate to any variable in the equations. Response: We are very sorry for the irregularities in writing. We have made a modification to the mistake.-> xt is the input feature of the audio at moment t.(line 232) We appreciate your warm work earnestly and hope that the corrections will meet with your approval. Once again, thank you very much for your comments and suggestions. Submitted filename: Response to Reviewers.docx Click here for additional data file. 18 Aug 2022 AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model PONE-D-22-11384R1 Dear Dr. Zhou, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sriparna Saha, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Please Proofread the paper. Please do t-test as results are too close. Few lines have grammatical issues please solve the issue. Reviewer #2: The authors have made sufficient revisions and have addressed all the concerns raised. Therefore, it could be accepted as a regular paper. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Soumitra Ghosh ********** 23 Aug 2022 PONE-D-22-11384R1 AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model Dear Dr. Jiawei: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sriparna Saha Academic Editor PLOS ONE

5 in total