| Literature DB >> 34254044 |
Rhea Mahajan1, Vibhakar Mansotra1.
Abstract
Twitter is one of the most popular micro-blogging and social networking platforms where users post their opinions, preferences, activities, thoughts, views, etc., in form of tweets within the limit of 280 characters. In order to study and analyse the social behavior and activities of a user across a region, it becomes necessary to identify the location of the tweet. This paper aims to predict geolocation of real-time tweets at the city level collected for a period of 30 days by using a combination of convolutional neural network and a bidirectional long short-term memory by extracting features within the tweets and features associated with the tweets. We have also compared our results with previous baseline models and the findings of our experiment show a significant improvement over baselines methods achieving an accuracy of 92.6 with a median error of 22.4 km at city level prediction.Entities:
Keywords: Bidirectional long short-term memory; Convolutional neural network; Geolocation; Social networking platform; Twitter
Year: 2021 PMID: 34254044 PMCID: PMC8264169 DOI: 10.1007/s41019-021-00165-1
Source DB: PubMed Journal: Data Sci Eng ISSN: 2364-1541
Chronologically lists some of the important works in geolocation prediction of tweets
| Data set | Features used | Techniques | |
|---|---|---|---|
| Han et al. [ | The regional North America geolocation Dataset, WORLD | Location indicative words | Naïve Bayes and Logistic regression |
| Han et al. [ | WORLD | Tweet text and Meta data | Topic-based modeling using mutinomial Naïve Bayes classifier |
| Han et al. [ | WORLD | Location indicative words, hashtags, user mentions and meta data | Naïve Bayes and logistic regression |
| Huang and Carley [ | Real time tweets | Tweet text and meta data | CNN |
| Huang and Carley [ | Twitter US, Twitter World, W-NNUT | Tweet text, meta data and network features | Hierarchical Method using neural network |
| Huang et al. [ | W-NUT 2016 | Subword feature | Multihead self- attention mechanism and CNN |
| Our approach | Real time geo-tagged English language Tweets collected across 10 cities of India | Tweet text, user self declared home location, and User display name in word embedding | Combination of CNN and BiLSTM |
Dataset description
| No. of tweets | No. of users | Country | Cities | Time zone |
|---|---|---|---|---|
| 45,678 | 21,544 | India | 10 | One(GMT + 5:30) |
Fig. 1Architecture of the proposed approach
Fig. 2Architecture of the proposed approach
Model hyperparameters
| Batch size | 512 |
| Sequence length | 30 |
| Number of classes | 10 |
| Vocabulary ( | 175,409 |
| Embedded vector ( | 128 |
| Shape of the Tensor (batch size, sequence length, embedded vector length) | [512 × 30 × 128] |
| Embedded matrix ( | 175,409 × 128 |
| Epochs | 50 |
| Learning rate | 10−4 |
| Optimizer | Rmsprop |
Performance of the model
| City | Precision | Recall | F1-score | Accuracy | Output probability |
|---|---|---|---|---|---|
| Lucknow | 0.726891 | 0.667954 | 0.696177 | 0.966944 | 0.625 |
| Patna | 0.834008 | 0.67541 | 0.746377 | 0.969352 | 0.676 |
| Bhopal | 0.90201 | 0.518038 | 0.658112 | 0.959173 | 0.518 |
| Ahmedabad | 0.714286 | 0.431034 | 0.537634 | 0.938813 | 0.431 |
| Hyderabad | 0.566929 | 0.251309 | 0.348247 | 0.941003 | 0.252 |
| Chandigarh | 0.583643 | 0.291822 | 0.389095 | 0.946038 | 0.290 |
| Bengaluru | 0.882102 | 0.61791 | 0.726741 | 0.948884 | 0.617 |
| Gurugram | 0.533279 | 0.738202 | 0.619227 | 0.911559 | 0.741 |
| New Delhi | 0.384338 | 0.645015 | 0.481669 | 0.798818 | 0.647 |
| Mumbai | 0.718521 | 0.850889 | 0.779123 | 0.884194 | 0.851 |
Comparison of our approach with previous baselines models for city level prediction
| Accuracy | Acc@Top5 | Median (kms) | |
|---|---|---|---|
| Han et al. [ | 0.260 | – | 260 |
| Han et al. [ | 0.389 | 0.595 | 77.5 |
| Huang and Carley [ | 0.528 | 0.711 | 28.0 |
| Huang and Carley [ | 0.720 | – | 28.2 |
| Proposed approach | 0.926 | 0.951 | 22.4 |
Fig. 3City level prediction results. The height of the blue bar shows percentage of Tweets whose location is predicted correctly from each city. The height of the orange bar shows the percentage of tweets whose location is incorrectly predicted from each city
Fig. 4Precision and recall of each city
Fig. 5Confusion matrix showing true labels and predicted label