| Literature DB >> 35199062 |
Kunal Khadilkar1, Ashiqur R KhudaBukhsh2, Tom M Mitchell3.
Abstract
We use a suite of cutting-edge natural language processing methods to quantify and characterize societal and gender biases in popular movie content. Our data set consists of English subtitles of popular movies from Bollywood-the Mumbai film industry-spanning 7 decades (700 movies). In addition, we include movies from Hollywood and movies nominated for the Academy Awards for contrastive purposes. Our findings indicate that while the overall portrayal of women has improved over time in popular movie dialogues from both Bollywood and Hollywood, modern films still exhibit considerable gender bias and are yet to achieve equal representation among genders. We also observe a strong bias favoring fair skin color in Bollywood content that occurred consistently across all time periods we considered. While our geographic representation analysis indicates improved inclusion over time for several Indian states, it also reveals a long-standing under-representation of many northeastern Indian states.Entities:
Keywords: Bollywood; Hollywood; gender bias; social bias
Year: 2021 PMID: 35199062 PMCID: PMC8848024 DOI: 10.1016/j.patter.2021.100409
Source DB: PubMed Journal: Patterns (N Y) ISSN: 2666-3899
Illustrative examples of misogynistic dialogues present in blockbuster Bollywood movies (movie names are presented in parentheses; movie revenues are presented in brackets)
| Akeli ladki khuli tijori ki tarah hoti hai (Jab We Met) [generated movie revenue ≈$14,899,137] | A girl who is alone is like an open treasure. (Jab We Met) |
| Marriage se pehle ladkiyajn sex object hoti hain, our marriage ke baad they object to sex! (Kambakkht Ishq) [generated movie revenue ≈$17,531,586] | Before marriage, girls are sex objects, and after marriage, girls object to sex. (Kambakkht Ishq) |
| Tu ladki ke peeche bhagega, ladki paise ke peeche bhagegi. Tu paise ke piche bhagega, ladki tere peeche bhagegi (Wanted) [generated movie revenue ≈$27,630,059] | You are chasing the girls, while the girls are chasing money. If you start chasing money, girls will automatically chase you. (Wanted) |
The dialogues (left) are in Romanized Hindi, and their approximate English translations are presented in the right column.
Figure 1Evolving trends in in our Bollywood and Hollywood corpora are contrasted with Google Books data set
A value greater than 50 indicates relatively fewer occurrences of female pronouns in the corpus. We present the confidence intervals in Table S1 in the supplemental information to avoid visual clutter.
Figure 2WEAT scores for Bollywood and Hollywood across different time periods
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively. For a given movie industry and a time period, the WEAT score is averaged over five runs with 95% confidence intervals shown. A larger positive value indicates greater bias toward men. Further experimental details are described in the WEAT section.
Figure 3WEAT scores for Bollywood, Hollywood, and world movies
The world movies corpus consists of English subtitles of 150 movies nominated at the foreign film category at the Academy Awards. The WEAT score is averaged over five runs with 95% confidence intervals shown. A larger positive value indicates greater bias toward men. Further experimental details are described in the WEAT section.
Figure 4WEAT scores for romance and action films
The WEAT score is averaged over five runs with 95% confidence intervals shown. A larger positive value indicates greater bias toward men. Further experimental details are described in the WEAT section.
Figure 5Nearest neighbors of man and woman over the years
The overall average valence of nearest neighbors according to the lexicon provided in Ramaswamy for a given time period is presented in blue font. , , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Cloze test results
| Probe | |||||
|---|---|---|---|---|---|
| man (0.091), widow (0.083), woman (0.083), doctor (0.077), slave (0.074), soldier /(0.074), bachelor (0.061), merchant (0.058), farmer (0.054), lawyer (0.053) [ | prostitute (0.081), servant (0.081), woman (0.081), slave (0.074), bachelor (0.074), doctor (0.071), lawyer (0.069), man (0.066), widow (0.066), maid (0.032) [ | doctor (0.093), woman (0.092), servant (0.088), lawyer (0.085), maid (0.082), Hindu (0.079), nurse (0.058), teacher (0.056), gardener (0.043), lady (0.037) [ | woman (0.071), slave (0.068), servant (0.067), nurse (0.064), lady (0.062), man (0.049), teacher (0.043), lawyer (0.037), peasant (0.028), maid (0.021) [ | woman (0.091), lawyer (0.085), doctor (0.082), nurse (0.078), teacher (0.077), man (0.073), writer (0.071), secretary (0.069), prostitute (0.065), professional (0.063) [ | |
| man (0.088), soldier (0.084), gentleman (0.079), farmer (0.076), merchant (0.073), woman (0.069), slave (0.069), bachelor (0.068), doctor (0.067), carpenter (0.053) [ | man (0.087), gentleman (0.085), lawyer (0.079), lawyer (0.077), servant (0.072), doctor (0.058), farmer (0.041), worker (0.029), craftsman (0.015), slave (0.009) [ | doctor (0.087), lawyer (0.083), policeman (0.074), man (0.069), farmer (0.049), bachelor (0.043), gardener (0.028), servant (0.023), soldier (0.021), mechanic (0.016) [ | carpenter (0.071), policeman (0.071), lawyer (0.067), soldier (0.066), farmer (0.062), gentleman (0.058), servant (0.053), man (0.049), peasant (0.043), slave (0.039) [ | man (0.097), lawyer (0.093), soldier (0.087), doctor (0.083), carpenter (0.074), gentleman (0.063), clergyman (0.061), farmer (0.039), writer (0.021), craftsman (0.017) [ |
Predicted tokens are ranked by decreasing probability with probabilities mentioned in parentheses.
BERT denotes the pre-trained BERT. BERT denotes BERT fine-tuned on corpus . and consist of movies between 1950 and 1969 and between 2000 and 2020 in our Bollywood data set, respectively. Similarly, and consist of movies between 1950 and 1969 and between 2000 and 2020 in our Hollywood data set, respectively. The number in the bracket represents the average valence score (computed using a well-known lexicon presented in Warriner et al. ) calculated for the cloze test outputs. Further experimental details are presented in the cloze test and free form completions section. Additional cloze test results are presented in the supplemental information (Table S4).
Percentage increase in average valence score for cloze test completions between old movies and new movies
| Bollywood (%) | Hollywood (%) | |
|---|---|---|
| Women | 22.84 | 7.55 |
| Men | 6.00 | 15.60 |
() calculated based on Bollywood movie dialogues
| 73.9 | 76.4 | 54.5 |
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Cloze test results for the probe A beautiful woman should have[MASK]skin
| BERT | ||||
|---|---|---|---|---|
| soft (0.092), beautiful (0.082), pale (0.079), tanned (0.059), smooth (0.043) | fair (0.089), no (0.081), pale (0.078), tanned (0.067), tan (0.065) | fair (0.082), tanned (0.081), golden (0.058), smooth (0.043), pale (0.039) | fair (0.081), pale (0.074), blue (0.069), golden (0.067), gold (0.056) | fair (0.086), pale (0.076), tanned (0.065), golden (0.041), dark (0.032) |
Predicted tokens are ranked by decreasing probability with probabilities mentioned in parentheses. BERT denotes the pre-trained BERT. BERT denotes BERT fine-tuned on corpus . and consist of movies between 1950 and 1969 and between 2000 and 2020 in our Bollywood data set, respectively. Similarly, and consist of movies between 1950 and 1969 and between 2000 and 2020 in our Hollywood data set, respectively. Further experiments details are presented in the cloze test and free form completions section. Additional cloze test results are presented in the supplemental information (Table S5).
Figure 6Nearest neighbors of beautiful over the years
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Figure 7Nearest neighbors of dowry over the years
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Religious distribution of six major religions in India according to decennial census conducted in 1951, 1961, 1971, 1981, 1991, 2001, and 2011
| Religion | 1951 (%) | 1961 (%) | 1971 (%) | 1981 (%) | 1991 (%) | 2001 (%) | 2011 (%) |
|---|---|---|---|---|---|---|---|
| Hinduism | 84.1 | 83.4 | 82.7 | 82.6 | 81.5 | 80.5 | 79.8 |
| Islam | 9.8 | 10.7 | 11.2 | 11.4 | 12.6 | 13.4 | 14.2 |
| Christianity | 2.3 | 2.4 | 2.6 | 2.4 | 2.3 | 2.3 | 2.3 |
| Sikhism | 1.9 | 1.8 | 1.9 | 2.0 | 1.9 | 1.9 | 1.7 |
| Buddhism | 0.7 | 0.7 | 0.7 | 0.7 | 0.8 | 0.8 | 0.7 |
| Jainism | 0.5 | 0.5 | 0.5 | 0.5 | 0.4 | 0.4 | 0.4 |
Figure 8Nearest neighbors of religion over the years
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Figure 9Nearest neighbors of Hindu and Muslim over the years
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Highly frequent surnames occurring in Bollywood movies (in decreasing order of frequency)
| Most-frequent surnames |
|---|
| Singh, Krishna, Khan, Rai, Ali, Kapoor, Sharma, Mohan, Prasad, Khanna, Shah, Lal, Thakur, Dev, Shekhar, Chaudhary, Gandhi, Verma, Gupta, Prakash, Rana, Nath, Patel, Pandey, Roy, Pandit, Saxena, Mathur, Roshan, Bachchan, Pal, Mehta, Narayan, Das, Rode, Dayal, Mehra, Bhagat, Shastri, Chandra, Patil, Banerjee, Tilak, Rao, Tripathi, Yadav, Kumari, Suman, Mukherjee, Bhatia, Acharya, Chatterjee, Rehman, Iyer |
Figure 10Religious representation in Bollywood movies (left) contrasted with ground truth census data (right)
, , and denote the time periods 1950–69, 1970–99, and 2000–20, respectively.
Surnames of doctors in Bollywood movies
| Surnames of doctors |
|---|
| Kapur, Chopra, Khurana, Tripathi, Kapoor, Ansari, Awasthi, Kothari, Mathur, Puri, Nayak, Bhalerao, Sawant, Tandon, Swamy, Banerjee, Verma, Rana, Ruby, Singh, Shrivastav, Khanna, Bhandari, Tiwari, Saxena, Shinde, Mehta, Goenka, Kumar, Goswami |
City mentions in movies from our Bollywood corpus. , , and denote the time periods 1950–1969, 1970–1999, and 2000–2020, respectively
| Bombay/Mumbai (51) | Bombay/Mumbai (68) | Bombay/Mumbai (83) |
| Delhi (27) | Delhi (45) | Delhi (52) |
| Kolkata/Calcutta (23) | Kolkata/Calcutta (18) | Amritsar (9) |
| Lucknow (14) | Lucknow (12) | Bangalore/Bengaluru (9) |
| Madras/Chennai (10) | Simla/Shimla (12) | Kolkata/Calcutta (9) |
| Agra (6) | Madras/Chennai (10) | Pune (8) |
| Srinagar (6) | Pune (8) | Lucknow (7) |
| Simla/Shimla (6) | Bangalore/Bengaluru (7) | Hyderabad (6) |
| Mathura (5) | Nagpur (6) | Madras/Chennai (6) |
Figure 11Geographic representation in Bollywood movies
(A) Geographical representation in films during the period 1950–1999.
(B) Geographical representation in films post 2000.
(C) States with least or no representation (less than 0.2% movies in the entire corpus) in our corpus in the last 70 years. The base maps used for this plot are sourced from the Government of India. The authors are aware that these maps include disputed territories. These maps do not constitute judgments on existing disputes.
Average amount for text completion results on the input sentence “The ransom amount is” using fine-tuned GPT-2 models
| Predicted ransom amount | 594,805 ± 43,159 | 10,959,940 ± 123,217.34 | 29,688,280 ± 119,544.28 |
| Inflation-adjusted amount | – | 2,194,830 | 21,000,280 |
The inflation-adjusted values for 594,805 INR in 1960 are presented in the bottom row.
Cloze test results for The biggest problem in India is[MASK]
| BERT | ||
|---|---|---|
| corruption (0.034), poverty (0.020), malaria (0.019), pollution (0.012), hunger (0.012), terrorism (0.009), unemployment (0.008), drought (0.008), famine (0.007), war (0.003), tourism (0.001) | poverty (0.078), love (0.072), war (0.067), hunger (0.049), unemployment (0.043), India (0.042), famine (0.029), money (0.023), marriage (0.012), education (0.011), Kashmir (0.009) | poverty (0.074), Pakistan (0.072), Kashmir (0.053), terrorism (0.051), corruption (0.037), India (0.031), drugs (0.021), dowry (0.016), unemployment (0.014), hunger (0.009), rape (0.006) |
Cloze test results for The biggest problem in America is[MASK]
| BERT | ||
|---|---|---|
| poverty (0.076), corruption (0.072), unemployment (0.061), crime (0.045), terrorism (0.042), racism (0.027), pollution (0.021), hunger (0.016), war (0.012), cancer (0.009), inequality (0.003) | war (0.092), poverty (0.083), money (0.062), unemployment (0.053), slavery (0.051), immigration (0.045), alcoholism (0.041), education (0.032), imperialism (0.023), Russia (0.023), hunger (0.019) | poverty (0.088), slavery (0.082), immigration (0.078), unemployment (0.073), money (0.071), war (0.065), racism (0.053), hunger (0.024), communism (0.016), America (0.011), education (0.006) |
Data set splits for Bollywood
| Corpus | Industry | Time period |
|---|---|---|
| Bollywood | 1950–69 | |
| Bollywood | 1970–99 | |
| Bollywood | 2000–20 |
Data set splits for Hollywood
| Corpus | Industry | Time period |
|---|---|---|
| Hollywood | 1950–69 | |
| Hollywood | 1970–99 | |
| Hollywood | 2000–20 |