| Literature DB >> 36188727 |
Lena Seewann1, Roland Verwiebe1, Claudia Buder1, Nina-Sophie Fritsch1.
Abstract
Social media platforms provide a large array of behavioral data relevant to social scientific research. However, key information such as sociodemographic characteristics of agents are often missing. This paper aims to compare four methods of classifying social attributes from text. Specifically, we are interested in estimating the gender of German social media creators. By using the example of a random sample of 200 YouTube channels, we compare several classification methods, namely (1) a survey among university staff, (2) a name dictionary method with the World Gender Name Dictionary as a reference list, (3) an algorithmic approach using the website gender-api.com, and (4) a Multinomial Naïve Bayes (MNB) machine learning technique. These different methods identify gender attributes based on YouTube channel names and descriptions in German but are adaptable to other languages. Our contribution will evaluate the share of identifiable channels, accuracy and meaningfulness of classification, as well as limits and benefits of each approach. We aim to address methodological challenges connected to classifying gender attributes for YouTube channels as well as related to reinforcing stereotypes and ethical implications.Entities:
Keywords: YouTube; authorship attribution; gender; machine learning; text based classification methods
Year: 2022 PMID: 36188727 PMCID: PMC9515904 DOI: 10.3389/fdata.2022.908636
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
Information regarded in single classification methods.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Channel name |
|
|
|
| . |
| Channel description |
|
|
| . |
|
| Channel profile picture |
| . | . | . | . |
| Video content |
| . | . | . | . |
| Information from other platform (e.g., Twitter) |
| . | . | . | . |
Source: own illustration.
Accuracy, precision and recall of classification methods.
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| Male | Accuracy | 0.688 | 0.598 | 0.593 | 0.675 | 0.718 |
|
| Precision | 0.972 | 0.838 | 0.747 | 0.869 |
| 0.801 | |
| Recall | 0.535 | 0.477 | 0.569 | 0.667 | 0.746 |
| |
| Female | Accuracy | 0.925 | 0.754 | 0.879 | 0.800 | 0.869 |
|
| Precision | 0.824 | 0.274 | 0.533 | 0.222 | 0.228 |
| |
| Recall | 0.538 | 0.538 |
| 0.667 | 0.500 | 0.500 | |
| Multi-Agent | Accuracy | 0.905 | 0.764 | - | 0.900 |
| 0.834 |
| Precision | 0.702 | 0.390 | - | 0.571 | 0.520 |
| |
| Recall | 0.868 | 0.421 | - | 0.800 |
| 0.368 | |
| NA | Accuracy | 0.678 | 0.819 | 0.573 | 0.975 |
| 0.709 |
| Precision | 0.047 | 0.030 | 0.200 | 1.000 | 0.250 |
| |
| Recall | 0.500 | 0.200 | 0.326 | 0.500 | 0.250 |
| |
| Total sample | Accuracy | 0.598 | 0.467 | 0.522 | 0.675 | 0.698 |
|
| Macro-Precision | 0.636 | 0.383 | 0.494 | 0.658 |
| 0.525 | |
| Macro-Recall | 0.610 | 0.409 | 0.503 | 0.666 |
| 0.478 | |
| Brier score | - | - | 0.158 | 0.040 | 0.061 | - |
Source: own calculations; N = 200. Accuracy is the ratio of correctly predicted cases within all observations. Precision is the ratio of all correctly predicted cases within all prediction in one class. Recall is the ratio of correctly predicted cases within all cases that actually belong to said class. The Brier-score shows the accuracy of the probabilistic prediction. Bold values represent the highest scores in each row.
Exemplary results of different classification methods.
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| Jana's Welt | Female | Female | NA | Male | Male | Male |
|
| ||||||
| Hi I am Felipe. I only do YouTube and Twitch as a hobby. On this channel is actually only gaming content such as Fortnite. | Male | NA | Female | NA | Male | Male |
| Male | Male | Female | NA | Multi-agent | Multi-agent | |
| NA (lgbtqia+) | Female | Female | Female | NA | Female | |
| Multi-agent | Multi-agent | Multi-agent | Male | Multi-agent | Multi-agent | |
| Female | NA | Female | Female | Female | Female |
Source: own calculations, texts were translated from German to English by the authors.
Overview of the results.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Performance | High, especially with multiple sources combined | Low, especially when text includes a lot of non-name noise | Moderate, depending on the noise within the text | High, especially for large samples and majority groups | High, even for minority groups |
| Limits and benefits | Time consuming, though with little requirements | Very low when already present dictionary (e.g., WGND) are used; text preprocessing might be necessary | Very low when small data volumes are processed, large volumes require a fee | Presence of a training sample is required. Otherwise, low number of parameters | Low, but dependent on existing models that are included and their requirements |
| Meaningfulness | High, though dependent on the openness of answers available to respondents | Dependent on the noise within the text, and number of words misidentified as names; identifies names can be accessed | Dependent on the noise within the text, and number of words misidentified as names; high accessibility of feature probabilities | High accessibility of feature probabilities, chance of stereotypical classification | Dependent on previous models included in the vote and their meaningfulness |
| Ethical challenges | Reinforcement of stereotypes based on individual experiences of respondents | Reinforcement of stereotypes based on country-specific name lists | Reinforcement of stereotypes based on structure of unknown online reference data | Reinforcement of stereotypes based on bias in the training data | Reinforcement on stereotypes and misclassification of included models |
Source: own illustration.