| Literature DB >> 35755485 |
H Long Nguyen1, Dorian Tsolak1, Anna Karmann1, Stefan Knauff1, Simon Kühne1.
Abstract
More and more, social scientists are using (big) digital behavioral data for their research. In this context, the social network and microblogging platform Twitter is one of the most widely used data sources. In particular, geospatial analyses of Twitter data are proving to be fruitful for examining regional differences in user behavior and attitudes. However, ready-to-use spatial information in the form of GPS coordinates is only available for a tiny fraction of Twitter data, limiting research potential and making it difficult to link with data from other sources (e.g., official statistics and survey data) for regional analyses. We address this problem by using the free text locations provided by Twitter users in their profiles to determine the corresponding real-world locations. Since users can enter any text as a profile location, automated identification of geographic locations based on this information is highly complicated. With our method, we are able to assign over a quarter of the more than 866 million German tweets collected to real locations in Germany. This represents a vast improvement over the 0.18% of tweets in our corpus to which Twitter assigns geographic coordinates. Based on the geocoding results, we are not only able to determine a corresponding place for users with valid profile locations, but also the administrative level to which the place belongs. Enriching Twitter data with this information ensures that they can be directly linked to external data sources at different levels of aggregation. We show possible use cases for the fine-grained spatial data generated by our method and how it can be used to answer previously inaccessible research questions in the social sciences. We also provide a companion R package, nutscoder, to facilitate reuse of the geocoding method in this paper.Entities:
Keywords: Twitter; geocoding; official statistics; regional analysis; spatial linkage
Year: 2022 PMID: 35755485 PMCID: PMC9220088 DOI: 10.3389/fsoc.2022.910111
Source DB: PubMed Journal: Front Sociol ISSN: 2297-7775
Figure 1Number of Twitter users who tweeted (retweets and tweets from verified accounts not included) between October 15, 2018, and October 14, 2021, per NUTS-3 region in Germany according to Twitter geotags and our geocoding results using user profile locations.
NUTS-3 regions with the fewest users based on Twitter geotags.
|
|
|
|
|---|---|---|
| DEB3G | Kusel | 6 |
| DEG0D | Sömmerda | 9 |
| DE255 | Schwabach | 9 |
| DE272 | Kaufbeuren | 10 |
| DE22C | Dingolfing-Landau | 11 |
| DE247 | Coburg | 11 |
| DE926 | Holzminden | 11 |
| DE267 | Haßberge | 11 |
| DEG0N | Eisenach, Stadt | 11 |
| DE234 | Amberg-Sulzbach | 12 |
| DE23A | Tirschenreuth | 12 |
| DEB37 | Pirmasens, kreisfreie Stadt | 12 |
| DEG06 | Eichsfeld | 12 |
| DEG0A | Kyffhäuserkreis | 12 |
Figure 2Number of monthy active Twitter users in our dataset.
Random sample of geocoding results where the input is the Twitter profile location and the output is the corresponding administrative regions in Germany.
|
|
|
|
|
|---|---|---|---|
| fRaNkFuRt | DE7 | DE71 | DE712 |
| Aicha vorm Wald | DE2 | DE22 | DE228 |
| Schwei | DE9 | DE94 | DE94G |
| Brochenzell | DE1 | DE14 | DE147 |
| hh | DE6 | DE60 | DE600 |
| nrw | DEA | – | – |
| Jena, Germany | DEG | DEG0 | DEG03 |
| Aub, Deutschland | DE2 | DE26 | DE26C |
| Germany-Mülheim an der Ruhr | DEA | DEA1 | DEA16 |
| Kuhbach im Schwarzwald | DE1 | DE13 | DE134 |
Number of tweets per user from October 15, 2018, to October 14, 2021. Retweets and tweets from verified accounts are excluded.
|
|
|
|
| |
|---|---|---|---|---|
| Geocoded with profile location | 230.0 | 9 | 1,939 | 792,298 |
| Geotagged by Twitter | 29.8 | 2 | 1,108 | 226,900 |
| No geolocation | 42.9 | 1 | 669 | 447,564 |
Performance of our geocoding method.
|
| |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| NUTS-1 | 13,423 | 92.74 | - | - | - |
| NUTS-2 | 12,919 | 90.92 | - | - | - |
| NUTS-3 | 12,793 | 86.07 | - | - | - |
| All levels | 13,423 | 85.70 | 95.87 | 0 | 18.35 |
Figure 3Share of Twitter users geotagged by Twitter and geocoded with profile locations vs. share of the German population by NUTS-3 region.
Figure 4Number of users who tweeted in support for the Green party during the 30-day period leading up to the 2021 German federal election divided by population per NUTS-2 region.
Figure 5Most common name for German bread rolls by NUTS-3 region.
Figure 6Percentage of Twitter users who used gender-inclusive language at least once by NUTS-3 region.
Regression models of the proportion of gender-inclusive language users in NUTS-3 regions.
|
|
|
|
| |
|---|---|---|---|---|
| (Intercept) | 0.058*** | 0.082*** | −0.110* | −0.106* |
| (0.007) | (0.007) | (0.047) | (0.048) | |
| Population density (log) | 0.016*** | 0.003* | 0.002 | 0.003 |
| (0.001) | (0.002) | (0.002) | (0.002) | |
| Share academic employees | 0.004*** | 0.004*** | 0.003*** | |
| (0.000) | (0.000) | (0.000) | ||
| Share female population (20-40y) | 0.004*** | 0.004*** | ||
| (0.001) | (0.001) | |||
| λ | 0.185** | |||
| (0.069) | ||||
| R2 | 0.273 | 0.470 | 0.492 | 0.503 |
| Num. obs. | 401 | 401 | 401 | 401 |
| Log likelihood | 856.052 | 919.463 | 927.781 | 930.947 |
***p < 0.001; .