| Literature DB >> 36092008 |
Abstract
The Russian language is still not as well-resourced as English, especially in the field of sentiment analysis of Twitter content. Though several sentiment analysis datasets of tweets in Russia exist, they all are either automatically annotated or manually annotated by one annotator. Thus, there is no inter-annotator agreement, or annotation may be focused on a specific domain. In this article, we present RuSentiTweet, a new sentiment analysis dataset of general domain tweets in Russian. RuSentiTweet is currently the largest in its class for Russian, with 13,392 tweets manually annotated with moderate inter-rater agreement into five classes: Positive, Neutral, Negative, Speech Act, and Skip. As a source of data, we used Twitter Stream Grab, a historical collection of tweets obtained from the general Twitter API stream, which provides a 1% sample of the public tweets. Additionally, we released a RuBERT-based sentiment classification model that achieved F 1 = 0.6594 on the test subset.Entities:
Keywords: Russian; Sentiment analysis; Sentiment dataset
Year: 2022 PMID: 36092008 PMCID: PMC9454938 DOI: 10.7717/peerj-cs.1039
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Sentiment analysis datasets of Russian language texts.
More detailed description of each datasetcan be found in Smetanin (2020a), Smetanin & Komarov (2021a), Kotelnikov (2021), as well as in original papers (if published). For datasets that contain several subsets from different data sources, we indicated only those subsets that are made from tweets.
| Dataset | Data source | Domain | Annotation | Classes | Size | Link |
|---|---|---|---|---|---|---|
| Twitter Sentiment for 15 European Languages ( | General | Manual | 3 | 107,773 |
| |
| SemEval-2016 Task 5 Russian ( | Restaurants | Manual | 3 | 405 |
| |
| SentiRuEval-2016 ( | Telecom and banks | Manual | 3 | 23,595 |
| |
| SentiRuEval-2015 ( | Telecom and banks | Manual | 4 | 16,318 |
| |
| RuTweetCorp ( | General | Automatic | 3 | 334,836 |
| |
| Kaggle Russian_twitter_sentiment | n/a | n/a | 2 | 226,832 |
|
Figure 1An example of user interface for annotators in Yandex.Toloka in Russian (on the left) and its translation in English (on the right).
The green block with quotation marks contains the text of the tweet. Under the block with the text, there are numbered sentiment classes, where 1 is Negative, 2 is Neutral, 3 is Positive, 4 is Speech Act, and 5 is Skip. Numbers are used as hotkeys during annotation.
Examples of tweets with no agreement between annotators.
| Tweet | Annotation | |||
|---|---|---|---|---|
| Russian | English | Annotator 1 | Annotator 2 | Annotator 3 |
| ю ноу блин | yu know damn it | Skip | Neutral | Negative |
| @USER доброе утро всем дэдди сегодня | @USER good morning daddies to everyone today | Speech | Positive | Skip |
| Путешествуем по Уэльсу. не уважать мужчин | Traveling in Wales. disrespect men | Negative | Neutral | Skip |
| тот факт что в энимал кроссинге так мало прикольных мышиных жителей | the fact that there are so few funny mouse inhabitants in animal crossing | Negative | Positive | Neutral |
| Кто в нижнем родился, в верхнем не сгодился. | Who was born in the bottom, did not fit in the top. | Negative | Neutral | Skip |
Distance between classes for interval Krippendorff’s , where 0 means that classes are the same, 1 means that classes are close to each other, and 2 means that classes a far away from each other.
| Class | Negative | Neutral | Positive | Speech | Skip |
|---|---|---|---|---|---|
| Negative | 0 | 1 | 2 | 2 | 1 |
| Neutral | 1 | 0 | 1 | 1 | 1 |
| Positive | 2 | 1 | 0 | 0 | 1 |
| Speech | 2 | 1 | 0 | 0 | 1 |
| Skip | 1 | 1 | 1 | 1 | 0 |
Note:
Positive and Speech classes have zero distance between them; they both represent positive sentiment as per RuSentiment guidelines.
Figure 2Texts length distribution.
Most common unigrams, bigrams, and emojis without stop words, punctuation, and numbers.
Stop words were removed using NLTK (Bird, Klein & Loper, 2009). Most unigrams and bigrams can have several English translations depending on the context. The table provides only one translation option.
| Unigram | Bigram | Emoji | |||||
|---|---|---|---|---|---|---|---|
| Item | Count | Item | Count | Item | Count | ||
| Russian | English | Russian | English | ||||
| это | it | 1,117 | доброе утро | good morning | 39 |
| 443 |
| просто | simply | 355 | спокойной ночи | good night | 26 |
| 313 |
| спасибо | thanks | 306 | спасибо большое | thanks a lot | 24 |
| 246 |
| хочу | want | 249 | самом деле | actually | 23 |
| 240 |
| ещё | yet | 223 | это просто | it’s simple | 23 |
| 120 |
| почему | why | 209 | опубликовано фото | published photo | 18 |
| 119 |
| очень | very | 205 | сих пор | so far | 17 |
| 118 |
| всё | all | 204 | руб г | rub g | 16 |
| 113 |
| блять | fuck | 184 | днем рождения | birthday | 15 |
| 104 |
| вообще | generally | 174 | все ещё | still | 13 |
| 100 |
Five-class sentiment classification on RuSentiTweet.
| Model | Precision | Recall |
|
|
|---|---|---|---|---|
| RuBERT | 0.6793 | 0.6449 | 0.6594 | 0.6675 |
| MNB | 0.5867 | 0.5021 | 0.5216 | 0.5189 |
Five-class sentiment classification studies.
| Study | Dataset | Model | Classification metrics | ||||
|---|---|---|---|---|---|---|---|
| Accuracy | Precision | Recall |
|
| |||
|
| NaijaSenti | XLM-R-base+LAFT | n/a | n/a | n/a | n/a | 0.795 |
|
| NaijaSenti | M-BERT+LAFT | n/a | n/a | n/a | n/a | 0.7700 |
|
| sentiment@USNavy | BART large + CNN | n/a | n/a | n/a | 0.596 | n/a |
|
| RuSentiment | M-BERT-Base | n/a | 0.6722 | 0.6907 | 0.6794 | 0.7244 |
|
| RuSentiment | RuBERT | n/a | 0.7089 | 0.7362 | 0.7203 | 0.7571 |
|
| RuSentiment | M-USE-CNN | n/a | 0.6571 | 0.6708 | 0.6627 | 0.7105 |
|
| RuSentiment | M-USE-Trans | n/a | 0.6821 | 0.6982 | 0.6860 | 0.7342 |
|
| TripAdvisor | Dempster–Shafer-based model | 0.79 | 0.5 | 0.47 | 0.49 | n/a |
|
| CitySearch | Dempster–Shafer-based model | 0.79 | 0.48 | 0.48 | 0.48 | n/a |
|
| RuSentiment | Multilingual BERT | n/a | n/a | n/a | n/a | 0.7082 |
|
| RuSentiment | RuBERT | n/a | n/a | n/a | n/a | 0.7263 |
|
| RuSentiment | SWCNN + fastText Twitter | n/a | n/a | n/a | n/a | 0.7850 |
|
| RuSentiment | BiGRU + ELMo Wiki | n/a | n/a | n/a | n/a | 0.6947 |
|
| YouTube | LSTM | 0.5424 | n/a | n/a | 0.5320 | n/a |
|
| Logistic Regression | 0.6899 | 0.6053 | 0.6899 | 0.6354 | n/a | |
|
| SST-5 | RNTN | 0.41 | n/a | n/a | 0.32 | n/a |
|
| Naïve Bayes | 0.7177 | 0.716 | 0.718 | n/a | n/a | |
|
| LABR | SVM | 0.503 | n/a | n/a | n/a | 0.491 |
|
| ROMIP-2012 (Movies) | n/a | 0.407 | n/a | n/a | 0.377 | n/a |
|
| ROMIP-2012 (Books) | SVM | 0.481 | 0.339 | 0.496 | 0.402 | n/a |
|
| ROMIP-2012 (Cameras) | n/a | 0.480 | n/a | n/a | 0.336 | n/a |
|
| ROMIP-2011 (Movies) | SVM | 0.599 | n/a | n/a | 0.286 | n/a |
|
| ROMIP-2011 (Books) | SVM | 0.622 | n/a | n/a | 0.291 | n/a |
|
| ROMIP-2011 (Cameras) | SVM | 0.626 | n/a | n/a | 0.342 | n/a |
Note:
We selected only those studies, which consideredfive sentiment classes and reported at least one of the following classification measures: Precision, Recall, macro F1, weighted F1. Among all datasets, only ROMIP (Chetviorkin, Braslavskiy & Loukachevich, 2013; Chetvirokin & Loukachevitch, 2013) and RuSentiment (Rogers et al., 2018) datasets are in Russian.
Figure 3Confusion matrix for RuSentiTweet.
Figure 4Confusion matrix for RuSentiment were created using molders from Smetanin & Komarov (2021a).
Examples of tweets classification.
All usernames and URLs were replaced with keywords for anonymity purposes.
| Tweet | True class | Predicted class | |
|---|---|---|---|
| Russian | English | ||
| @USERNAME @USERNAME @USERNAME @USERNAME @USERNAME Помедорус | @USERNAME @USERNAME @USERNAME @USERNAME @USERNAME Pomedorus | Skip | Skip |
| @USERNAME ты не лохушка ЛОЛ я тебе завидую…. у меня травма из за интернета вот я лохушка | @USERNAME you’re not a sucker LOL I envy you…. I’m traumatized because of the internet I’m a sucker | Skip | Negative |
| @USERNAME Котиков Одриосолу Дождь | @USERNAME Cats Odriosolu Rain | Skip | Neutral |
| @USERNAME Уж лучше твоя грудь | @USERNAME Your breasts are better | Skip | Positive |
| Как сережки URL | How do you like the earrings URL | Neutral | Neutral |
| @USERNAME Реквием по мечте | @USERNAME Requiem for a dream | Neutral | Positive |
| @USERNAME ПОДОЖДИ НУ МНЕ КАЗАЛОСЬ ДА | @USERNAME WAIT I THINK YES | Neutral | Negative |
| @USERNAME Спокойной ночи и сладких снов | @USERNAME Good night and sweet dreams | Speech | Speech |
| @USERNAME как дела моя хорошая?? ( ) | @USERNAME how are you my dear?? ( ) | Speech | Positive |
| @USERNAME Это классно что у тебя есть эти люди | @USERNAME It’s great that you have these people | Positive | Positive |
| На самом деле я ловлю уруру с этого облачка. | In fact, I catch ururu from this cloud. | Positive | Neutral |
| @USERNAME Это классно что у тебя есть эти люди | What kind of morons are you, many have a school/work day tomorrow | Negative | Negative |
| интересный факт: смысла в клипах тхт больше, чем в твоей жизни | Interesting fact: there is more sense in txt clips than in your life | Negative | Neutral |