| Literature DB >> 26594609 |
Ryan Compton1, Craig Lee1, Jiejun Xu1, Luis Artieda-Moncada1, Tsai-Ching Lu1, Lalindra De Silva2, Michael Macy3.
Abstract
We demonstrate how one can generate predictions for several thousand incidents of Latin American civil unrest, often many days in advance, by surfacing informative public posts available on Twitter and Tumblr. The data mining system presented here runs daily and requires no manual intervention. Identification of informative posts is accomplished by applying multiple textual and geographic filters to a high-volume data feed consisting of tens of millions of posts per day which have been flagged as public by their authors. Predictions are built by annotating the filtered posts, typically a few dozen per day, with demographic, spatial, and temporal information. Key to our textual filters is the fact that social media posts are necessarily short, making it possible to easily infer topic by simply searching for comentions of typically unrelated terms within the same post (e.g. a future date comentioned with an unrest keyword). Additional textual filters then proceed by applying a logistic regression classifier trained to recognize accounts belonging to organizations who are likely to announce civil unrest. Geographic filtering is accomplished despite sparsely available GPS information and without relying on sophisticated natural language processing. A geocoding technique which infers non-GPS-known user locations via the locations of their GPS-known friends provides us with location estimates for 91,984,163 Twitter users at a median error of 6.65km. We show that announcements of upcoming events tend to localize within a small geographic region, allowing us to forecast event locations which are not explicitly mentioned in text. We annotate our forecasts with demographic information by searching the collected posts for demographic specific keywords generated by hand as well as with the aid of DBpedia. Our system has been in production since December 2012 and, at the time of this writing, has produced 4,771 distinct forecasts for events across ten Latin American nations. Manual examination of 2,859 posts surfaced by our method revealed that only 108 were discussing topics unrelated to civil unrest. Examination of 2,596 forecasts generated between 2013-07-01 and 2013-11-30 found 1,192 (45.9%) matched exactly the date and within a 100 km radius of a civil unrest event reported in traditional news media.Entities:
Keywords: Computational social science; Data and text mining; Information retrieval
Year: 2014 PMID: 26594609 PMCID: PMC4643851 DOI: 10.1186/s13388-014-0004-6
Source DB: PubMed Journal: Secur Inform
Figure 1Empirical CDF of the median absolute deviation of retweeter locations of 4,004 forecasts generated by our model. With over 80% probability the retweeters are dispersed by less than 500 km.
Figure 2Histogram of forecasts per retweeter dispersion level. Retweeters typically localize within a small radius. We take the center of the retweeter locations to be the forecast location.
Figure 3Example forecast. A march related to Petroleos Mexicanos (Pemex) is planned for March 18 in Mexico City. Our system detected the event on March 5th. The interactive map provides end-users with links to retweeter accounts.
Figure 4Cumulative sum of the number of forecasts generated since 2012-12-17. The increased number of warnings per day in November 2013 was due primarily to improvements in date tagging.
Number of forecasts generated for each country
|
|
|
|---|---|
| 500 | Argentina |
| 778 | Brazil |
| 317 | Chile |
| 557 | Colombia |
| 134 | Ecuador |
| 69 | El Salvador |
| 1235 | Mexico |
| 128 | Paraguay |
| 65 | Uruguay |
| 985 | Venezuela |
Mexico is highly active on twitter.com and receives the most coverage from our system. Timeframe: 2012-12-17 until 2014-01-14.
Total number of forecasts generated by our system
|
|
|
|
|---|---|---|
| Twitter only | 5150 | 3.91 |
| Tumblr only | 198 | 6.38 |
| Both | 1298 | 2.93 |
| Total | 6596 | 3.81 |
Timeframe: 2012-12-17 until 2014-03-10.
Figure 5Number of events forecast to happen per day in Brazil during June 2013. Our system under reported the initial wave of protests, but successfully captured a major uptick in late June. Average lead time: 5.58 days.
Figure 6Venn diagram showing the number of Tumblr posts passing each filter.
Figure 7Snapshots of Tumblr posts (detected by our system) showing planned future civil unrest events.
Number of forecasts generated for June 2013 from the different data feeds
|
|
|
|
|---|---|---|
| Twitter only | 525 | 5.57 |
| Tumblr only | 51 | 5.98 |
| Both | 4 | 2.75 |
| Total | 580 | 5.58 |
Surprisingly, of the 580 forecasts, only 4 were visible in both Twitter and Tumblr.