| Literature DB >> 35721404 |
Mohamed Reda Bouadjenek1, Scott Sanner2, Zahra Iman3, Lexing Xie4, Daniel Xiaoliang Shi2.
Abstract
Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the "Iran nuclear deal". The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.Entities:
Keywords: Data analysis; Social network analysis; Topic classification
Year: 2022 PMID: 35721404 PMCID: PMC9202616 DOI: 10.7717/peerj-cs.991
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Feature Statistics of our 811,683,028 tweet corpus.
|
| ||||
|
|
|
|
|
|
| 85,794,831 | 13,607,023 | 46,391,269 | 18,244,772 | 16,212,640 |
|
| ||||
|
|
|
|
|
|
|
| 10,196 | 8.67 | 2 | running_status |
|
| 1,653,159 | 13.91 | 1 | #retweet |
|
| 6,291 | 1.26 | 1 | tweet_all_time |
|
| 10,848,224 | 9,562.34 | 130 | london |
|
| 241,896,559 | 492.37 | 1 | rt |
|
| ||||
|
| 592,363 | 10.08 | 1 | #retweet |
|
| 26,293 | 5.44 | 1 | dimensionist |
|
| 739,120 | 641.5 | 2 | london |
|
| 1,799,385 | 6,616.65 | 1 | rt |
|
| ||||
|
| 18,167 | 2 | 0 | daily_astrodata |
|
| 2,440,969 | 1,837.79 | 21 | uk |
Figure 1(A–F) Per capita tweet frequency across different international and U.S. locations for different topics.
The legend provides the number of tweets per 1 million capita.
Train/Validation/Test Hashtag samples and statistics.
| Tennis | Space | Soccer | Iran nuclear deal | Human disaster | Celebrity death | Social issues | Natural disaster | Epidemics | LGBT | |
|---|---|---|---|---|---|---|---|---|---|---|
|
| 62 | 112 | 144 | 12 | 57 | 33 | 37 | 61 | 55 | 30 |
|
| 14 | 32 | 42 | 2 | 8 | 4 | 5 | 4 | 17 | 9 |
|
| 14 | 17 | 21 | 3 | 12 | 7 | 8 | 17 | 13 | 5 |
|
| 21,716 | 5,333 | 14,006 | 6,077 | 153,612 | 155,121 | 27,423 | 46,432 | 14,177 | 1,344 |
|
| 191,905 | 46,587 | 123,073 | 54,045 | 1,363,260 | 1,376,872 | 244,106 | 411,609 | 125,092 | 11,915 |
|
| 884 | 2,281 | 4,073 | 1,261 | 53,340 | 23,710 | 3,088 | 843 | 4,348 | 50 |
|
| 7,860 | 20,368 | 36,341 | 11,363 | 473,791 | 210,484 | 27,598 | 7,456 | 39,042 | 443 |
|
| 1,510 | 5,908 | 11,503 | 368 | 34,055 | 7,334 | 14,566 | 5,240 | 3,105 | 692 |
|
| 13,746 | 53,348 | 103,496 | 3,256 | 305,662 | 65,615 | 130,118 | 47,208 | 27,828 | 6,325 |
|
| #usopenchampion | #asteroids | #worldcup | #irandeal | #gazaunderattack | #robinwilliams | #policebrutality | #earthquake | #ebola | #loveislove |
| #novakdjokovic | #astronauts | #lovesoccer | #iranfreedom | #childrenofsyria | #ripmandela | #michaelbrown | #storm | #virus | #gaypride | |
| #wimbledon | #satellite | #fifa | #irantalk | #iraqwar | #ripjoanrivers | #justice4all | #tsunami | #vaccine | #uniteblue | |
| #womenstennis | #spacecraft | #realmadrid | #rouhani | #bombthreat | #mandela | #freetheweed | #abfloods | #chickenpox | #homo | |
| #tennisnews | #telescope | #beckham | #nuclearpower | #isis | #paulwalker | #newnjgunlaw | #hurricanekatrina | #theplague | #gaymarriage |
Cutoff threshold and corresponding number of unique values of candidate features CF for learning. Thresholds were chosen to balance the number of each type of feature.
| Frequency threshold | #Unique values | |
|---|---|---|
|
| 235 | 206,084 |
|
| 65 | 201,204 |
|
| 230 | 200,051 |
|
| 160 | 205,884 |
|
| 200 | 204,712 |
|
| ||
|
| – | 1,017,935 |
Performance of topical classifier learning algorithms across metrics and topics with the mean performance over all topics shown in the right column with ± 95% confidence intervals.
The best mean performance per metric is shown in bold.
| Tennis | Space | Soccer | Iran nuclear deal | Human disaster | Celebrity death | Social issues | Natural disaster | Epidemics | LGBT | Mean | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
| 0.6452 | 0.5036 |
|
| 0.9293 | 0.5698 |
| 0.4005 | 0.1559 | 0.6782 |
|
|
| 0.5859 | 0.8471 | 0.3059 | 0.9584 | 0.4224 | 0.4658 | 0.5030 | 0.3518 | 0.4050 | 0.1689 | 0.5014 |
|
|
| 0.702 | 0.840 |
| 0.586 | 0.603 | 0.469 | 0.370 | 0.248 | 0.136 | 0.082 | 0.471 |
|
|
| 0.9344 |
| 0.5509 | 0.9757 | 0.6658 |
|
| 0.8306 |
|
|
|
|
|
| 0.9550 | 0.7751 | 0.4739 | 0.9752 | 0.598 | 0.542 | 0.5078 | 0.9599 | 0.5317 | 0.1774 | 0.6496 |
|
|
|
| 0.2 | 0.3 |
| 0.5 | 0.8 | 0.2 |
| 0.5 |
| 0.61 |
|
|
| 0.1 |
| 0.0 | 0.9 | 0.7 | 0.1 | 0.0 | 0.3 | 0.1 | 0.0 | 0.3 |
|
|
|
|
|
| 0.8 | 0.4 | 0.3 | 0.0 | 0.1 | 0.0 | 0.2 | 0.42 |
|
|
|
| 0.5 | 0.5 |
|
|
|
|
|
| 0.5 |
|
|
|
|
| 0.0 | 1.0 |
| 0.7 | 0.9 | 0.0 | 0.9 | 0.3 | 0.4 | 0.62 |
|
|
|
| 0.65 |
|
| 0.74 | 0.94 | 0.59 |
| 0.45 | 0.2 | 0.696 |
|
|
| 0.56 |
| 0.0 | 0.98 | 0.39 | 0.36 | 0.16 | 0.37 | 0.48 | 0.1 | 0.435 |
|
|
| 0.73 | 0.72 | 0.31 | 0.70 |
| 0.44 | 0.48 | 0.34 | 0.02 | 0.100 | 0.472 |
|
|
|
| 0.94 | 0.43 | 0.98 | 0.62 |
|
| 0.9 |
|
|
|
|
|
| 1.0 | 0.59 | 0.34 | 1.0 | 0.72 | 0.54 | 0.39 | 0.96 | 0.54 | 0.24 | 0.632 |
|
|
| 0.653 | 0.703 | 0.545 | 0.299 |
| 0.884 | 0.574 | 0.919 | 0.267 | 0.076 |
|
|
|
| 0.551 | 0.667 | 0.29 |
| 0.338 | 0.542 | 0.655 | 0.287 | 0.319 |
| 0.4151 |
|
|
|
|
|
| 0.218 | 0.525 | 0.547 | 0.215 | 0.173 | 0.154 | 0.064 | 0.438 |
|
|
| 0.728 | 0.464 | 0.576 | 0.331 | 0.463 |
|
| 0.728 |
| 0.159 | 0.5549 |
|
|
| 0.571 | 0.821 | 0.53 | 0.329 | 0.476 | 0.84 | 0.49 |
| 0.234 | 0.083 | 0.5303 |
Top tweets for each topic from Logistic Regression method results, marked with ✗ as irrelevant, ✓ as relevant and labeled as topical, and ★ as relevant but labeled as non-topical (a mislabeled example).
|
|
|
| ✓ PHOTOS; @andy_murray in @usopen QF match v Novak Djokovic … @usta @BritishTennis #USOpen2014… | ✗ RT @wandakki: Chuck’s Story - My 600-lb Life — |
| ✓ PHOTOS; British #1 @andy_murray in @usopen Quarter-Finals match v Novak Djokovic … @usta @BritishTennis #USOpen2014… | ✗ RT @arist_brain: Path. #Switzerland (by Roman Burri) #travel #landscape #nature #path #sky #alps #clouds… |
| ✓ RT @fi_sonic: PHOTOS; @andy_murray in 75 75 64 win over Jo-wilfried Tsonga to reach @usopen QFs. @BritishTennis… | ✗ TeamFest Winner Circle by Dee n Ralph on Etsy--Pinned with |
| ✓ PHOTOS; #21 seed @sloanetweets in her @usopen 2nd round match v Johanna Larsson … @USTA @WTA #USOpen2014… | ✓ RT @NASA: Fire @YosemiteNPS as seen by NASA’s Aqua satellite on Sunday. #EarthRightNow… |
| ✓ “ @fi_sonic: PHOTOS; @DjokerNole celebrating his @usopen QF match win 76 67 62 64 v Andy Murray … @usta #USOpen2014… | ✓ RT @NASA: Arkansas April 27 tornado track seen by NASA’s EO-1 satellite. |
|
|
|
| ✓ RT @FOXSoccer: Cameron in for Beckerman #USA lineup: Howard, Gonzalez, Bradley, Besler, Beasley, Dempsey… | ✓ RT @JavadDabiran: #Iran-Executions, #Women rights abuse, #IranHRviolations soar under Hassan Rouhani #No2Rouhani… |
| ✓ RT @FOXSoccer: Cameron in for Beckerman #USA lineup: Howard, Gonzalez, Bradley, Besler, Beasley, Dempsey | ✓ RT @HellenaRezai: #Iran-Executions, #Women rights abuse, #IranHRviolations soar under Hassan Rouhani #No2Rouhani… |
| ★ RT @Gerrard8FanPage: Luis Suarez has scored seven goals in six Barclays Premier League appearances against Sunderland. | ✓ RT @peymaneh123: #Iran-Executions, #Women rights abuse, #IranHRviolations soar under Hassan Rouhani #No2Rouhani… |
| ★ RT @BBCMOTD: Federico Fazio is the first player sent off on his PL debut since Samba Diakite for #QPR in Feb 2012 #THFC… | ✓ RT @IACNT: #Iran nuclear threat bigger than claimed: |
| ★ @JamesYouCun* well I’d say Migs, moreno sakho toure (if fit) manquilio Lucas can gerrard sterling Coutinho markovic and borini | ✓ RT @YelloJackets: #Iran-Executions, Women rights abuse and #IranHRviolations soar under Hassan Rouhani |
|
|
|
| ✓ @IlenePrusher if one thinks of Gazan kids as potential Hamas fighters Gazan women as potential Hamas fighters’ mothers, yes! | ✓ #RIPRise Heaven gained another angel yet another angel, you will be happy with EunB, all our prayers are for you… |
| ★ RT @jala_leb: This is GAZA not Hiroshima @BarackObama @David_Cameron @un @hrw | ✓ RT @WeGotLoves: EunB, Manager, Driver Rise passed away. Very heartbreaking news. Deep condolences to their family. |
| ✓ RT @jallubk: THIS AGAIN: BOYCOTT ISRAEL OR WE WILL BOYCOTT YOU, @robbiewilliams ! #IsraelKillsKids… | ✓ RT @sehuntella: eunb, manager, driver and rise passed away. what a heartbreaking news. deep condolences to their family |
| ✗ RT @notdramadriven: Nailed it @KenWahl1 @DrMartyFox @jjauthor @shootingfurfun @CarmineZozzora… | ✓ RT @missA_TH: Our deep condolences to family, friends and fans of EunB Rise. May they rest in peace. Heaven has |
| ✓ RT @TelecomixCanada: @Op_Israel #Article51 of the Geneva Convention: | ✓ Rest in peace Rise! Heaven now gained two angels. #RipRise #PrayForLadiesCode My condolences :( |
|
|
|
| ✓ RT @RightCandidates: THANK YOU DEMOCRAT RACE BAITERS #tcot #america #women #millennials #tlot… | ✓ RT @ianuragthakur: I appeal to friends supporters @BJYM to help in the relief efforts fr #KashmirFloods… |
| ✗ RT @2AFight: The Bill of Rights IS my Patriot Act #2A #NRA #MolonLabe #RKBA #ORPUW #PJNET #tgdn… | ✓ RT @RSS_Org: RSS Press Release: An Appeal to the Society to donate for Relief Fund to help #KashmirFloods Victims… |
| ✗ The Supreme Court Judicial Tyranny | ✓ RT @punkboyinsf: #BREAKING California Gov. Jerry Brown has declared a state of emergency following… |
| ✓ RT @RightCandidates: THANK YOU DEMOCRAT RACE BAITERS FOR THIS #tcot #america #women #FergusonDecision… | ✓ RT @nbcbayarea: #BREAKING California Gov. Jerry Brown has declared a state of emergency following… |
| ✗ Race-Baiting for Profit RT | ✓ RT @coolfunnytshirt: Congress ke bure din! RT @timesnow: Congress leader Saifuddin Soz heckled by flood victims |
|
|
|
| ✓ RT @justgrateful: Surgeon General Nominee is Blocked by NRA #occupy #uppers #tcot #ccot #topprog #EbolaCzar… | ✗ RT @CSGV: Take a bite out of the crime. Oppose traitors preparing for war w/ our gov’t. #NRA #NRAAM Cliven Bundy… |
| ✓ RT @nhdogmom: Why don’t we have Surgeon General/Medical #EbolaCzar … GOP RWNJ’s is why!!… | ✗ IRS employee suspended for pro-Obama… - Washington Times: |
| ✗ New York seen like never before! #cool #photo #black white #atmospheric #moody | ✓ Pa. gay-marriage ban overturned |
| ✗ RT @ryangrannand:.@CouncilW9 asking developer for a sign plan. #waltham | ✓ RT @OR4Marriage: RT this AMAZING quote from yesterday’s ruling striking down #Oregon’s marriage ban! #OR4M #lgbt… |
| ✗ GOOD OFFER!! | ✓ @briansbrown YOU ANTI-GAY BIGOTS ARE BOX-OFFICE-POISON EVEN FOR MOST REPUBLICANS. #LGBT… |
Figure 2Longitudinal analysis of classifier generalization.
(A–D) The performance of the topic classifier (mean over all 10 topics with 95% confidence intervals) from 50 to 350 days after training, evaluated according to (A) mean AP (MAP), (B) P@10, (C) P@100, and (D) P@1000. Best fit linear regressions are shown as dashed lines. (E) Results averaged over time with 95% confidence intervals.
Figure 3Matrix of mean Mutual Information values for different feature types vs. topics.
The last column and last row represent the average of mean values across all topics and all features respectively. All values should be multiplied by 10−8.
Figure 4Scatter plot showing ranking of topics w.r.t. Mutual Information vs. Average Precision.
There is clearly a negative correlation, with a Kendall τ coefficient of −0.68.
Figure 5Box plots of Mutual Information values (y-axis) per feature type across topics (x-axis labels).
Figure 6Top p% features ranked by Mutual Information.
The top five features for each feature type and topic based on Mutual Information.
| Topics/Top10 | Natural disaster | Epidemics | Iran deal | Social issues | LBGT | Human disaster | Celebrity death | Space | Tennis | Soccer |
|---|---|---|---|---|---|---|---|---|---|---|
|
| from_japan | changedecopine | mazandara | debtadvisoruk | stevendickinson | witfp | boiknox | daily_astrodata | tracktennisnews | makeupbella |
|
| everyearthquake | stylishoz | freeiran9292 | nsingerdebtpaid | mgdauber | ydumozyf | jacanews | freesolarleads | novakdjokovic_i | sport_agent |
|
| quakestoday | drdaveanddee | hhadi119 | negativeequityf | lileensvf1 | syriatweeten | ewnreporter | sciencewatchout | i_roger_federer | yasmingoode |
|
| equakea | soliant_schools | balouchn2 | iris_messenger | kevinwhipp | rk70534 | rowwsupporter | houston_jobs | andymurrayfans1 | sportsroadhouse |
|
| davewinfields | msgubot | jeffandsimon | dolphin_ls | petermabraham | gosyrianews | flykiidchris | lenautilus | rafaelnadal_fan | losangelessrh |
|
| #earthquake | #health | #iran | #ferguson | #tcot | #syria | #rip | #science | #wimbledon | #worldcup |
|
| #haiyan | #uniteblue | #irantalks | #mikebrown | #pjnet | #gaza | #ripcorymonteith | #sun | #tennis | #lfc |
|
| #storm | #ebola | #iranian | #ericgarner | #p2 | #israel | #riprobinwilliams | #houston | #usopen | #football |
|
| #PrayForThePhilippines | #healthcare | #rouhani | #blacklivesmatter | #uniteblue | #gazaunderattack | #rippaulwalker | #starwars | #nadal | #worldcup2014 |
|
| #tornado | #fitness | #irantalksvienna | #icantbreathe | #teaparty | #isis | #robinwilliams | #scifi | #wimbledon2014 | #sports |
|
| With everyone | USA | France | St Louis MO | USA | Syria | South Africa | Houston TX | Worldwide | Liverpool |
|
| Earth | Francophone | Tehran Iran | Washington DC | Bordentown New Jersey | Palestine | Pandaquotescom | Germany | London | Manchester |
|
| Philippines | United States | Inside of Iran | St Louis | Global Markets | Syrian Arab Republic | Johannesburg South Africa | Houston | The Midlands | London |
|
| Don’t follow me am i a bot | Gainesville FL USA | Iran | Virginia US | The blue regime of Maryland | Israel | Johannesburg | Rimouski | London UK | Anfield |
|
| Global planet earth | Boulder Colorado | Washington DC | Saint Louis MO | Lancaster county PA | Washington DC | Cape Town | In a galaxy far far ebay | Wimbledon | Bangil East Java Indonesia |
|
| @oxfamgb | @foxtramedia | @ap | @natedrug | @jjauthor | @ifalasteen | @nelsonmandela | @nasa | @wimbledon | @lfc |
|
| @gabriele_corno | @obi_obadike | @afp | @deray | @2anow | @drbasselabuward | @realpaulwalker | @philae2014 | @usopen | @fifaworldcup |
|
| @weatherchannel | @who | @iran_policy | @antoniofrench | @gop | @revolutionsyria | @ddlovato | @maximaxoo | @atpworldtour | @ussoccer |
|
| @twcbreaking | @kayla_itsines | @4freedominiran | @bipartisanism | @pjnet_blog | @unicef | @robinwilliams | @esa_rosetta | @andy_murray | @mcfc |
|
| @redcross | @canproveit | @orgiac | @theanonmessage | @espuelasvox | @free_media_hub | @historicalpics | @astro_reid | @wta | @realmadriden |
|
| typhoon | health | nuclear | police | obama | israeli | robin | space | tennis | liverpool |
|
| philippines | ebola | regime | protesters | gun | israel | williams | solar | murray | cup |
|
| magnitude | outbreak | iran | officer | america | gaza | walker | moon | djokovic | supporting |
|
| storm | virus | iranian | cops | obamacare | palestinian | cory | houston | federer | match |
|
| usgs | acrx | mullahs | protest | gop | killed | paul | star | nadal | goal |
Figure 7Boxplots for the distribution of Mutual Information values (y-axis) of different features as a function of their attribute values (binned on x-axis).
Plots (A–E) respectively show attributes {favorite count, follower count, friend count, hashtag count, tweet count} for From feature. Plots (F–J) respectively show attributes tweetCount and userCount for Hashtag, userCount for Location feature, tweetCount for Mention and Term features.
Figure 8Density plots for the frequency values of feature attributes (x-axis) vs. Mutual Information (y-axis).
Plots (A–E) respectively show the following attributes: number of tweets for the User feature, number of tweets for the Hashtag feature, number of users using the Hashtag feature, number of tweets for the Mention feature, and number of tweets for the Term feature.