| Literature DB >> 34222857 |
Prashanth Rao1, Maite Taboada1.
Abstract
We present a topic modelling and data visualization methodology to examine gender-based disparities in news articles by topic. Existing research in topic modelling is largely focused on the text mining of closed corpora, i.e., those that include a fixed collection of composite texts. We showcase a methodology to discover topics via Latent Dirichlet Allocation, which can reliably produce human-interpretable topics over an open news corpus that continually grows with time. Our system generates topics, or distributions of keywords, for news articles on a monthly basis, to consistently detect key events and trends aligned with events in the real world. Findings from 2 years worth of news articles in mainstream English-language Canadian media indicate that certain topics feature either women or men more prominently and exhibit different types of language. Perhaps unsurprisingly, topics such as lifestyle, entertainment, and healthcare tend to be prominent in articles that quote more women than men. Topics such as sports, politics, and business are characteristic of articles that quote more men than women. The data shows a self-reinforcing gendered division of duties and representation in society. Quoting female sources more frequently in a caregiving role and quoting male sources more frequently in political and business roles enshrines women's status as caregivers and men's status as leaders and breadwinners. Our results can help journalists and policy makers better understand the unequal gender representation of those quoted in the news and facilitate news organizations' efforts to achieve gender parity in their sources. The proposed methodology is robust, reproducible, and scalable to very large corpora, and can be used for similar studies involving unsupervised topic modelling and language analyses.Entities:
Keywords: corpus linguistics; gender bias; machine learning; natural language processing; news media; topic modelling
Year: 2021 PMID: 34222857 PMCID: PMC8242240 DOI: 10.3389/frai.2021.664737
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Data for the study, October 1, 2018–September 30, 2020. Note: “people mentioned” is the number of all persons named in all the articles per outlet. “People quoted” is the number of mentioned persons who were quoted one or more times in each article. Each “quote” is only counted once per person per article.
| Outlet | Number of articles | People mentioned | Percentage women mentioned (%) | People quoted | Percentage women quoted (%) |
|---|---|---|---|---|---|
| CBC News | 151,288 | 749,351 | 30.2 | 286,770 | 32.6 |
| CTV News | 158,249 | 565,990 | 28.3 | 240,215 | 29.7 |
| Global News | 86,386 | 353,733 | 25.3 | 133,021 | 30.1 |
| HuffPost Canada | 15,765 | 86,429 | 29.2 | 25,975 | 30.8 |
| National Post | 27,925 | 166,080 | 21.4 | 45,663 | 23.9 |
| The Globe and Mail | 87,121 | 496,142 | 22.1 | 153,881 | 23.2 |
| The Toronto Star | 85,609 | 509,851 | 23.1 | 163,255 | 25.1 |
| Overall | 612,343 | 2,927,576 | 25.7 | 1,048,780 | 27.9 |
FIGURE 1Monthly counts of articles that contain majority male/female sources.
FIGURE 2Topic model preprocessing and transformation steps.
Curated stopwords by category.
| Category | Example words |
|---|---|
| Social media related | Post, sign, like, love, tag, star, call, group, video, photo, pic, inbox |
| URL and embeds | http, https, href, ref, com, cbc, ctv, src, twsrc, 5etfw |
| Frequent common nouns | People, man, woman, family, friend, news, report, press, page, story |
| Light verbs | Call, comment, continue, do, feel, give, get, take, like, make, tell, think |
| Time of the day/week | Morning, afternoon, evening, today, yesterday, tomorrow |
| Time periods | Day, week, month, year |
| Time zones | Edt, est, pdt, pst |
| Days of the week | Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday |
| Months of the year | January, February, March, ..., October, November, December |
| Year | 2018, 2019, 2020 |
Best LDA hyperparameters obtained for a monthly analysis of news data.
| Hyperparameter | Description | Value |
|---|---|---|
| K | Number of topics | 15 |
| maxIter | Maximum iterations | 150 |
| vocabSize | Vocabulary size | 5,000 |
| minTF | Minimum term frequency | 1 |
| minDF | Minimum document frequency | 2% |
| maxDF | Maximum document frequency | 80% |
FIGURE 3Monthly topic gender prominence for nine recurring topics (average over all outlets).
FIGURE 4Corpus analysis results for top 400 articles from “Sports” (June 2019).
FIGURE 5Corpus analysis results for top 200 articles from “Sports” (July 2019).
FIGURE 6Corpus analysis results for top 200 articles from “Sports” (August 2019).
FIGURE 7Corpus analysis results for top 200 articles from “Business” (October 2018).
FIGURE 8Corpus analysis results for top 200 articles from “Business” (December 2018).
FIGURE 9Corpus analysis results for top 200 articles from “Lifestyle” (December 2019).
FIGURE 10Corpus analysis results for top 200 articles from “Healthcare and medical research” (March 2019).
FIGURE 11Corpus results for top 400 articles from four “Healthcare and COVID-19” topics (March 2020).
FIGURE 12Corpus analysis results for top 200 articles from “Crimes and sexual assault” (February 2020).