Literature DB >> 35108299

Text mining of Reddit posts: Using latent Dirichlet allocation to identify common parenting issues.

Elizabeth M Westrupp^1,2,3, Christopher J Greenwood^1,2,4, Matthew Fuller-Tyszkiewicz^1,2, Tomer S Berkowitz^1,2, Lauryn Hagg¹, George Youssef^1,2,4.

Abstract

Parenting interventions offer an evidence-based method for the prevention and early intervention of child mental health problems, but to-date their population-level effectiveness has been limited by poor reach and engagement, particularly for fathers, working mothers, and disadvantaged families. Tailoring intervention content to parents' context offers the potential to enhance parent engagement and learning by increasing relevance of content to parents' daily experiences. However, this approach requires a detailed understanding of the common parenting situations and issues that parents face day-to-day, which is currently lacking. We sought to identify the most common parenting situations discussed by parents on parenting-specific forums of the free online discussion forum, Reddit. We aimed to understand perspectives from both mothers and fathers, and thus retrieved publicly available data from r/Daddit and r/Mommit. We used latent Dirichlet allocation to identify the 10 most common topics discussed in the Reddit posts, and completed a manual text analysis to summarize the parenting situations (defined as involving a parent and their child aged 0-18 years, and describing a potential/actual issue). We retrieved 340 (r/Daddit) and 578 (r/Mommit) original posts. A model with 31 latent Dirichlet allocation topics was best fitting, and 24 topics included posts that met our inclusion criteria for manual review. We identified 45 unique but broadly defined parenting situations. The majority of parenting situations were focused on basic childcare situations relating to eating, sleeping, routines, sickness, and toilet training; or related to how to respond to child negative emotions or difficult behavior. Most situations were discussed in relation to infant or toddler aged children, and there was high consistency in the themes raised in r/Daddit and r/Mommit. Our results offer potential to tailor parenting interventions in a meaningful way, creating opportunities to develop content and resources that are directly relevant to parents' lived experiences.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35108299 PMCID： PMC8809584 DOI： 10.1371/journal.pone.0262529

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

A key limitation of community-based parenting interventions is that they are usually offered as one-size-fits-all packages, which stands in contrast to evidence that parents differ in the techniques that are most applicable to them [1-4], and therefore require tailored approaches [5]. Tailoring may take into account parents’ needs or their day-to-day parenting context. To-date, tailoring in parenting interventions has focused on parents’ perceived needs through adapting or adding intervention modules to be relevant to specific parent or child subpopulations [6-8], or determined by individual parent results from baseline surveys [9-11]. In contrast, tailoring related to parenting context has been minimal, despite the potential for this approach to increase parent engagement and intervention efficacy, in light of low rates of homework completion in parenting interventions [3], and recognized limitations in the human ability to generalize skills outside of specific learning contexts [12]. To enable a more systematic way of tailoring interventions to specific parenting situations, and thus facilitate parents to immediately implement intervention concepts in their daily lives, the field requires an accurate map of the wide range of day-to-day parenting situations that parents may need support with. The current study aims to meet this need, by investigating the most common issues that draw parents online to discuss their experiences of parenting with other parents. Parenting interventions offer a strong evidence-based method for the prevention or early intervention of childhood mental health problems [9, 13]. Mental disorders rank as one of the highest causes of the global burden of disease [14]. To assist prevention efforts, the World Health Organization Nurturing Care Framework published in 2018 called for greater investment in adult education to enhance ‘nurturing care’ and thus promote positive early childhood development [13]. Despite many decades of investment and research, the long-term and population-level effectiveness of parenting programs has been lower than expected [1]. Children are influenced by their parents through multiple pathways, including modeling of behavior [15, 16], the emotional climate of the family [15], the nature and quality of the home environment [17], and via their parents’ functioning and parenting practices [18-20]. Parenting interventions often target one or more of these domains, with the intention of modifying early determinants of children’s socio-emotional development. The field of parenting interventions is at an important juncture. Despite established efficacy for those who use the programs, parenting interventions have had very low reach (<10% population) [21, 22], particularly for fathers, working mothers, and disadvantaged families [3]. Most parenting interventions also struggle with low rates of parent engagement, adherence, homework completion, and participation [3, 23, 24]. A review of the qualitative evidence related to barriers for parent involvement in parenting programs found that parents identified two common reasons for dropping out: lack of fit with the therapist, and the nature of the program content or delivery method, for example, disliking the group activities, feeling that the content wasn’t relevant, or that their needs were not recognized [25]. Fathers also identified mother-oriented services as a key barrier [25]. The next wave of parenting interventions need to address these barriers by designing more widely acceptable and flexible interventions, and through motivating and engaging parents by ensuring that content is flexibly tailored to a range of parent contexts and needs. There are a number of different ways that community interventions can be tailored; for example, according to identified need or context. In regards to tailoring according to parent needs, there are two key approaches. The first approach has been to modify established interventions to meet the needs of specific parent sub-groups, such as parents of children with autism spectrum disorders [6], incarcerated parents [7], or parents struggling with sibling conflict [8]. This type of whole-of-intervention tailoring aims to increase the relevance of intervention content to the perceived needs of the targeted sub-group of parents, but does not systematically adapt content to needs or the context of individual parents. A second approach has been tested more recently, and involves adapting the allocation of intervention content based on parent-level needs. This is illustrated in an online parenting intervention program where parents are provided with individualized recommendations regarding the specific subset or order of modules they should complete based on their results from a baseline parenting survey [9-11]. This approach has promising results [9-11]. However, to-date, there has been much more limited focus on tailoring according to parents’ context. Although relatively neglected, there may be great potential to enhance parent engagement and learning through tailoring interventions to context, i.e., to specific parenting situations. In general, core intervention concepts within parenting interventions tend to be introduced and described at a higher-order conceptual level, or in relation to set hypothetical parenting examples. In face-to-face individual and group interventions, some degree of tailoring to context may occur in the sense that practitioners may discuss intervention concepts in relation to situations that parents raise during the session [26-28]. However, in most manualized parenting programs, there is only limited time to discuss parent-led examples, and this is even more limited in group-based or online programs. The fact that only a few examples are typically discussed, and that (in group settings) the examples may not even be relevant to individual parents, means that parents must later, and on their own, recall the higher-level concepts, and then reflect on how they might best be adapted to their own parenting situations. This mental gymnastics often must occur in-the-moment as parents attempt to change their usual patterns of responding to their child, while actually interacting with their child. Imagine the possibility that parents can instead select a recent parenting situation that they have experienced in the past 24 to 48 hours and that they found difficult to manage. Instead of core concepts being described generically, they could be discussed directly in relation to the parent-selected example. In this case, it is likely that the parents’ attention and interest would be heightened, given that the situation is immediately relevant to them. Description of content in relation to recent, known, examples is also likely to reduce parents’ cognitive load, by reducing their working memory requirements; parents no longer need to consider and remember the parameters of a hypothetical example while simultaneously learning about new parenting concepts and techniques, and then attempting to problem-solve how they could adapt it to their own relevant example. Instead, if intervention content is tailored to a recent parenting situation, it is likely they will have opportunity to implement the tailored intervention concepts when the situation occurs again, given that many parenting situations are often repeated, particularly those that draw parents online to discuss or seek advice from other parents. Tailoring interventions to specific parenting situations requires a detailed understanding of the common parenting situations and issues that parents face day-to-day. To our knowledge, this is currently lacking. The current study therefore aims to identify common parenting situations experienced by parents on a daily basis, which parents themselves identify as being difficult to manage. Our study seeks to better understand the types of parenting situations that motivate parents to reach out to other parents. Research shows that the internet has become a primary source of support for modern parents; parents go online, at any time of day or night, and seek parenting information and support from other parents on social media and online parenting forums [29]. Publically available data from online parenting forums offer an incredibly rich source of parent-centered data, expressed in parents’ own words, and therefore uniquely untouched by researcher bias. Parent response options in quantitative surveys, and even in qualitative data collection procedures, are inherently influenced by researcher-design factors and choices. In comparison, online parenting forums reflect a naturally occurring situation where parents are free to interact and discuss parenting issues without any constraints [30]. Therefore, data from online discussion forums offer considerable opportunity to create an accurate map of parenting situations to enable intervention tailoring. The current study will use internet scraping to retrieve rich, publically-available data from the online discussion forum, Reddit. Reddit is one of the most popular social media platforms in the world, and the r/Daddit and r/Mommit forums have 163,000 and 106,000 members (at the time of writing), respectively. Importantly, the use of Reddit enables us to address the limitations of previously mother-centric research, by offering scope to investigate fathers’ experiences in a rare father-specific forum where fathers are comfortable talking to other fathers about their parenting experiences [31]. We will utilise an increasingly popular machine learning text mining method, latent Dirichlet allocation (LDA) [32-35], to conduct a text analysis identifying the most common topics discussed in the Reddit posts. The LDA algorithm performs probability-based text mining to extract a set of topics from a corpus of text, based on patterns of words associated with each topic [32, 33]. LDA is a data-driven method, which minimizes researcher bias in the generation of topics, selection of key words, and ranking of relevant posts per topic [33]. This approach enables text analysis of very large datasets where more traditional qualitative text analysis approaches may not be feasible. LDA has been used to identify topics on Reddit discussion forums related to public health interest in same-sex marriage [36]; depression, mental health, treatment, and relationships [37]; Ebola, influenza, electronic cigarettes, and marijuana [38]; electronic cigarettes and Hookah use [39], and eating disorders [40]. LDA has also been used to investigate parenting, for example, one study compared LDA topics for mothers and fathers, and for fathers of preterm versus term children, derived from interview data assessing parents’ reflective functioning [41]. The current study will use LDA and qualitative synthesis (i.e., manual text analysis) to address the following research question: What are the specific parenting situations being discussed within topics that emerge from text analysis of the Reddit r/Daddit and r/Mommit forums? Extraction of content from both of these parent-related forums will enable greater coverage of parenting issues for both fathers and mothers, in light of research showing gender differences in parenting topics discussed on Reddit [30]. It will also enable evaluation of potential differences in the concerns raised by each.

Methods

Ethics approval

The current study was approved by the Deakin University Human Ethics Advisory Group—Health (Project number: HEAG-H 140_2019). In line with HEAG-H advice, it was not possible to directly quote individual posts from publicly available datasets, thus data are presented in aggregated form only. Data were collected and used in accordance with Reddit’s Terms and Conditions.

Data analysis

Fig 1 outlines our processing and data analysis pipeline. Broadly, this comprised (a) data extraction, (b) data cleaning, (c) LDA topic modeling, and (d) qualitative synthesis. Of relevance, our approach to data cleaning and topic modeling is based on previously published work [42]. R software v3.6.1 [43] was used for all processing and analysis.

Fig 1

Data processing and analysis pipeline.

Data extraction

We scraped publicly available data using ‘RedditExtractoR’ v2.1.5 [44] from two forums focused on experiences of mothers and fathers on Reddit (i.e., r/Daddit; r/Mommit). Data scraping was conducted on 11 February 2020 and we retained those posts that were the first of a particular thread (known as the ‘original post’) and that contained textual data in the body of the post, e.g., did not comprise just an image. Responses to the original posts were not included in analyses as this study aimed to only identify common parenting situations rather than describe the conversations and language used surrounding these situations. Due to limits in how data can be scraped from Reddit, only the first 10 pages of original posts from these forums could be scraped at any one time. Thus, we extracted 340 (r/Daddit) and 578 (r/Mommit) original posts that spanned an approximate 4–6 week period prior to our date of data scraping (r/Daddit earliest original post extraction date was 13 Jan 2020; r/Mommit earliest original post extraction date was 12 Dec 2019).

Data preparation

The corpus of data were prepared for analysis using the ‘tm’ v0.7–6 [45] package. This process involved removing all non-alphabetical characters, punctuation, and blank spaces; converting all text to lowercase; removing all non-words; and removing “stop words” (e.g., “the”, “is”, “at”, “which”, “on”). We also manually ‘stemmed’ the corpus by reviewing the entire list of unique words and ensuring words with different suffixes (e.g., “happy”, “happiness”, “happiest”) were coded as the same word (e.g., “happy”). During this review we also corrected obvious misspellings and international differences in spelling.

LDA topic modelling

Common topics within the corpus were identified using latent Dirichlet allocation (LDA) [33], which is a common method used to identify themes in collections of scholarly textual data [32-35], as well as large sets of documents or texts in research or clinical contexts [46]. An outline of the statistical methodology of LDA is presented in detail elsewhere [33, 47], but briefly described here. LDA is a Bayesian probabilistic modelling method that aims to identify the unknown number of latent topics that are assumed to underlie a body of text [33]. LDA draws from a Dirichlet distribution to generate distributions of probabilities that describe how (1) words (i.e., word-topic-probabilities) and (2) documents (i.e., document-topic-probabilities) are related to the latent topics underlying the dataset. Specifically, word-topic-probabilities are estimates of the probability a word is generated from a specific topic, whilst document-topic-probabilities are estimates of the probability that a topic has been generated in a specific document [48]. Inspection of the highest word-topic and document-topic probabilities for each topic can help characterize the theme of each latent topic. In the current study, we used LDA to find common topics within the corpus of original posts from r/Daddit and r/Mommit, and from these we aimed to identify parenting situations. To conduct the LDA, we converted the corpus to a document-term-matrix, comprising rows representing each original post (i.e., document = original post) and columns representing each word in the corpus (i.e., term = word). Each cell in the document-term-matrix contains the frequency of times a specific word (defined by the column) occurred in a specific post (defined by the row). From this document-term-matrix, the entire corpus was represented, including patterns of words that commonly occur together within the same post. The LDA was conducted using the ‘topicmodels’ v0.2–8 package [49]. As per Kosinski, Wang [50], LDA hyperparameters were set to delta = 0.1 and alpha = 50/k (where k is the number of LDA topics being estimated). To identify the number of LDA topics that best fit the patterns within the corpus, we first estimated a 2-topic model and then sequentially increased the number of LDA topics being modelled until we estimated a 50-topic model. We then selected the best fitting model based on the perplexity value for each model; a common method for evaluating model fit in LDA models [33, 51], where models with lower perplexity are considered to fit the data better than models with higher perplexity. Our estimate of perplexity for each model was derived as an average of the perplexity across a five-fold cross validation process. Within this process, the corpus of original posts was randomly split into five portions with a model generated (i.e., trained) on four of the portions, and then validated on the fifth portion. This process was repeated until each portion had been used for validation once. We also used the ‘LDAvis’ v0.3.2 package [52] to obtain the percentage of tokens (i.e. words) contributing to a specific topic.

Qualitative analysis and interpretation

As mentioned previously, to interpret the topics derived from the optimal model, we relied on two metrics that form the basis of interpretation for LDA models. The first is a matrix of values that quantify the probability that each word on the corpus would be generated from a specific topic (i.e., β matrix; higher β = the word is more likely to occur in the topic). The second is a matrix of values that quantifies the proportion of words in an original post estimated to be generated by a specific topic (i.e., γ matrix; higher γ = the original post is more aligned with topic). Specifically, we focused interpretation on the 10 most relevant posts per topic according to γ values. We completed a qualitative synthesis of the topics and original posts via a manual text analysis to interpret each topic and characterize specific parenting situations described in original posts within each topic. This procedure involved classifying each post according to whether they met a pre-specified definition of a ‘parenting situation’, defined as follows: (1) the post referenced a scenario involving a child aged 0–18 years; and (2) the scenario involved a parent and their child; and (3) referenced a potential or actual difficulty or issue, for which the parent was giving or seeking advice in how to manage or improve. Posts were also only included if one or more of the 10 topic words was used in a way that conveyed meaning relevant to the central themes of a given post and topic, rather than being peripheral.

Results

LDA results

Fig 2 presents the perplexity values for each of the topic model scenarios, modeling 2–50 LDA topics, based on the number of topics retained in our LDA. Dots represent perplexity scores for each of the 5-fold cross-validation models for each topic. The blue line represents the average perplexity scores across the 5-fold cross validation. The lowest average perplexity was for a model with 31 topics (mean perplexity = 917.64). This 31-topic model was found to be the best performing LDA model as it had the lowest average perplexity of all the models (see S1 Table). The β values for the 10 most probable words associated with each of the 31 LDA topics, and the range of γ values for the top 10 posts contributing to each of the 31 topics, are presented in S1 Fig and S2 Table, respectively. The average probability across all documents and the average and range of probabilities for the top 10 documents for each topic is presented in S3 Table. On inspection of the top 10 original posts related to each topic (i.e., with the highest γ values), 24 of the 31 LDA topics contained posts that met the inclusion criteria in terms of being consistent with our pre-specified definition of a parenting situation, and meaningfully reflecting the topic words (i.e., with the highest β values).

Fig 2

Perplexity scores as a function of number of topics estimated.

Parenting situations

Table 1 summarizes the LDA topic words, themes, and description of parenting situations for the 24 LDA topics that were included in the final analysis. The percentage of tokens contributing to each topic was relatively balanced, ranging from 2.6% (topic 31) to 4.1% (topic 2). Just eight of the 24 LDA topics included posts where parents were giving advice, either recounting their own recent experiences to illustrate a specific successful parenting moment that they hoped others might learn from (e.g., just found a good strategy for clearing their sick child’s congestion), or some parents posted with more generic advice (i.e., not specific to a recent parenting experience, but based on their accumulated parenting experience), such as providing advice on strategies for specific situations (e.g., toilet training advice). The remainder of the topics involved posts where the parent was seeking advice from other parents on the Reddit parenting forums, most commonly in the format of a parent describing a specific parenting situation and asking others for advice on how they might handle it.

Table 1

Summary of topic words, themes, and situation descriptions related to parenting situations described in Reddit parenting posts.

Topic (% of tokens)	LDA topic words	Topic themes	Descriptions of parenting situations
2 (4.1%)	Sleep, night, time, wake, bed, hour, put, nap, minute, crib	Waking in night	Child sleep: waking in night wanting to play, feed or be held.
3 (2.8%)	Get, just, nappy, use, train, toddler, toilet, bad, run, poo	Sickness and toilet training	Use of strategies for managing child with ’bad’ cold or other sickness (e.g., runny nose).
3 (2.8%)		Sickness and toilet training	Issues or stages of toilet training.
4 (2.8%)	Day, bit, one, today, little, spend, start, bad, part, small	Child refusing food or whining	Child refusing one food/part of a meal
4 (2.8%)		Child refusing food or whining	Child spending day crying/fussy; whinging if parent says no to something they want.
5 (3.1%)	Girl, boy, one, want, find, daughter, little, look, interest, anything	Gender stereotypes	Parent concern about gender stereotypes (boy/girl): e.g., son teased for wanting to wear nail polish; parent wants child to be involved in sports, not just gender-stereotypical activities.
7 (2.9%)	Old, year, month, now, almost, sure, seem, still, last, half, yesterday	Managing multiple children and sibling conflict.	Children (of different ages in months/years) only want to be picked up by mum.
			Child is acting younger than age (in months/years)—sleep or toilet training regression after birth of sibling.
			Managing multiple children of different ages with different needs and conflict.
8 (3.2%)	Will, partner, work, try, just, take, advice, anxiety, new, still	Managing child emotions	Parent trying to manage child emotions/anxiety in new situations or teaching new skill (e.g., toilet training).
9 (3.4%)	Feel, like, just, much, morning, guilt, still, issue, better, though	Parent guilt about time to self	Parent feeling guilty about wanting time to themselves, away from child.
10 (3.7%)	Say, tell, want, talk, know, school, let, made, break, ok	How to talk to upset children	How to talk to kids when they’re upset, want something they can’t have, or won’t do as asked.
11 (3.3%)	Feed, month, try, bottle, eat, breast, milk, give, breastfeed, formula	Child refusing food	Child refusing to eat, breastfeed or feed from bottle.
12 (3.5%)	Friend, really, can, get, move, since, new, good, support, live	Talking to children about change	Talking to child about moving to new daycare (e.g. leaving friends).
13 (3.1%)	Playroom, put, much, look, also, watch, come, dog, away, toy	Setting rules	How to manage rules around safe rooms to play in the house.
13 (3.1%)		Setting rules	How to manage rules for putting toys away after play.
15 (2.8%)	Amp, even, know, use, see, xb, sister, phone, wonder, lie	Child lying and phone use	Managing a child who has lied about something they did.
15 (2.8%)	Amp, even, know, use, see, xb, sister, phone, wonder, lie	Child lying and phone use	Managing child’s use of phone.
17 (3.0%)	Help, go, time, need, can, get, please, advice, appreciate, great	Bed-time and juggling multiple children	Child refusing to go to bed on time; can’t go to sleep.
17 (3.0%)		Bed-time and juggling multiple children	Advice on activities can do to manage multiple children at home.
20 (3.1%)	Get, hold, walk, cry, start, head, sit, hand, arm, push	Physical support to child learning or upset	How to hold toddler when crying/having tantrum.
			How to support child learning to walk.
			Child hitting herself when upset and crying.
			Managing when a child prefers (to be held when crying) by one parent over another.
21 (2.9%)	Can, story, love, book, read, yo, hear, way, mean, oh	Books to introduce difficult subjects	Child asking about death while reading books.
21 (2.9%)	Can, story, love, book, read, yo, hear, way, mean, oh	Books to introduce difficult subjects	Books to prepare a child for separation from parents.
22 (3.0%)	Baby, first, know, week, new, born, recommend, hair, newborn, able	How to support a crying baby	How to support baby crying during tummy time.
23 (3.7%)	Like, son, think, never, ask, know, love, thing, start, always	Managing child and parent autonomy; responding to child negative emotions and behaviour	How to respond to child always asking ’why’ about everything.
			Thinking about how to manage strong feelings without always hitting their child.
			Should parents ask baby’s permission before giving massage.
			Feeling judged for wanting time away from son.
			Child rude, disrespectful when asked to do something or when parent says no.
			Managing child lying about use of their phone .
24 (3.3%)	Go, month, back, want, just, end, daughter, come, happen, hear, around	Child develop-mental stages	Weaning a child from breastfeeding.
			Child going backwards in toileting.
			Child sleep and tantrums backwards (regression) after birth of sibling.
25 (3.6%)	Back, doctor, now, take, went, since, see, us, hospital, come	Managing child sleep	Daughter will not sleep through the night; went to see doctor.
26 (3.1%)	Go, time, get, try, just, thing, frustrate, lot, easy, still	Child or parent frustration	Baby frustrated trying tummy time.
			Father frustrated with daughter at bed-time or crying only with him.
			Parent frustrated child won’t listen to "no".
			Son always climbing on something, parent worried about falls.
28 (4.0%)	Day, home, work, time, stay, leave, week, take, job, daycare	Creating routine	Mum at home and trying to create routine for the day.
29 (2.7%)	Can, idea, anyone, us, good, buy, hard, lot, share, usual, favourite	Strategies for calming child	Ideas for good strategies to calm child.
30 (3.5%)	Child, year, want, parent, family, don, house, us, come, live	Child wants own way	Parent feels child wants their own way all the time.
31 (2.6%)	Tip, guy, advice, without, thank, car, look, infant, seat, appreciate	Burping baby	Advice for how to best burp baby.

Based on the manual text analysis of posts within the LDA topics, we identified 45 unique but broadly defined parenting situations. The most prevalent themes from LDA topics meeting our inclusion criteria involved parents managing parenting situations related to child sleep, sickness, toilet training, child food refusal or fussiness, children whining when they didn’t get what they wanted, gender stereotyping, managing multiple children/siblings, and managing children who are showing a regression in their behavior in relation to age. There were two very common higher-order themes, each identified in more than a third of the parenting situations, relating to (1) managing basic childcare situations, i.e., related to eating, sleeping, routines, sickness, toilet training; and (2) how to respond to child negative emotions and/or difficult behavior. Some topics were generated from parenting situations based on just one post e.g., Topic 31’s sole post was seeking advice on burping a baby. Other topics were generated from parenting situations based on a number of posts e.g., Topic 2 had multiple posts on child sleeping habits.

Parent and child characteristics

Table 2 summarizes the child ages and subreddit source for each of the included topic themes. The majority of the included posts (80%) referred to a child or children of infant or toddler age. Posts about children 4 years and older were focused on child use of technology, supporting children in gender non-stereotypical activities (e.g., sports, make-up), managing sibling conflict or the needs of multiple children at home, and using bibliotherapy to talk to children about change or potentially upsetting subjects. In relation to the source of posts, one third of the included posts were from the r/Daddit forum, while the remainder were from r/Mommit. The majority of the LDA topics included posts from both sources, meaning that most of the identified parenting situations were discussed by mothers and fathers online. The two most common higher-order themes (i.e., managing basic childcare situations and responding to child negative emotions and/or difficult behavior) were also reflected in both r/Mommit and r/Daddit posts. The identity (e.g., gender) of the person writing each post was not known, and is therefore not able to be described.

Table 2

Child ages and subreddit source reflected in included topic themes.

Topic	Topic themes	Child ages^a	subreddit source
2	Waking in night	Infant (8); Toddler (2)	r/Mommit; r/Daddit
3	Sickness and toilet training	Infant (4); Toddler (5)	r/Mommit; r/Daddit
4	Child refusing food or whining	Infant (1); Toddler (2); Young child (1)	r/Mommit; r/Daddit
5	Gender stereotypes	Young child (6)	r/Mommit; r/Daddit
7	Managing multiple children and sibling conflict.	Infant (2); Toddler (4); Young child (2)	r/Daddit
8	Managing child emotions	Toddler (2)	r/Mommit
9	Parent guilt about time to self	Infant (2); Toddler (1)	r/Mommit
10	How to talk to upset children	Toddler (1); Young child (3)	r/Mommit; r/Daddit
11	Child refusing food	Infant (6); Toddler (1)	r/Mommit; r/Daddit
12	Talking to children about change	Toddler (1)	r/Mommit
13	Setting rules	Toddler (2); Young child (1)	r/Mommit; r/Daddit
15	Child lying and phone use	Older child (3)	r/Daddit
17	Bed-time and juggling multiple children	Toddler (1); Young child (3)	r/Mommit
20	Physical support to child learning or upset	Infant (2); Toddler (2)	r/Mommit; r/Daddit
21	Books to introduce difficult subjects	Toddler (1); Young child (1)	r/Mommit
22	How to support a crying baby	Infant (1)	r/Mommit
23	Managing child and parent autonomy; responding to child negative emotions and behaviour	Infant (1); Toddler (2); Young child (2)	r/Mommit; r/Daddit
24	Child developmental stages	Infant (1); Toddler (2); Young child (1)	r/Mommit; r/Daddit
25	Managing child sleep	Toddler (1)	r/Mommit
26	Child or parent frustration	Infant (1); Toddler (4);	r/Mommit; r/Daddit
28	Creating routine	Infant (1)	r/Mommit
29	Strategies for calming child	Infant (1); Toddler (1)	r/Mommit
30	Child wants own way	Infant (1); Toddler (1)	r/Daddit
31	Burping baby	Infant (1)	r/Daddit

aChild age categories were coded as follows: Infant (0–11 months); Toddler (1–3 years); Young child (4–10 years); Older child (10+ years). Numbers in parentheses indicate the number of children mentioned within posts in the given topic related to that age group. Whilst we focused our interpretation on topics that met our inclusion criteria, interested readers can find a summary of themes for posts not meeting the inclusion criteria for the current study in S4 Table. In summary, parents posted on parenting subreddits for a large number of reasons, including medical issues not related to being a parent; managing relationships with family members other than their children, such as the relationship with their partner; and managing a range of situations related to being a parent, but not specific to a parent-child interaction, including moving house, seeking advice around their own feelings and wellbeing, navigating special occasions and holidays, their child’s school, and managing issues related to pets and children.

Discussion

Principal results

The objective of the current study was to support the necessary first step in developing parenting interventions that are tailored to parents’ context, by providing a detailed understanding of the common parenting situations and issues that parents face day-to-day. In this study, we sought to identify the most common day-to-day parenting situations discussed online on the r/Daddit and r/Mommit subreddit forums to inform meaningful tailoring in parenting interventions. Using the innovative and increasingly popular machine learning approach, latent Dirichlet allocation (LDA) [33], we were able to exploit the incredibly rich source of person-centered parenting data available online, in order to identify parenting situations that are most applicable to parents day-to-day. The LDA topic modeling extracted 31 LDA topics, of which 24 met our inclusion criteria in terms of being related to a difficult parenting situation involving their child. From these, we identified 45 unique but broadly defined parenting situations involving a parenting issue or difficult interaction between a parent and their child aged 0–18 years. Our study is the first we are aware of to systematically describe the most topical and common parenting situations that parents seek advice and support around. Over two-thirds of the LDA topics meeting our inclusion criteria for analysis related to two primary themes. First–managing basic childcare situations, i.e., related to eating, sleeping, routines, sickness, toilet training; and second–related to advice on how to respond to child negative emotions and/or difficult behavior. Although there is little in the way of published evidence documenting the specific parenting examples used in parenting interventions, it is the experience of the authors that these examples tend to relate to situations involving the second theme, i.e., related to managing child negative emotions and difficult behavior. Our results suggest that many parents are also very interested in how to manage children in day-to-day childcare tasks. The parent-child relationship is equally formed around management of these daily rhythms, as it is in the management of difficult child behavior or emotions. Thus, it seems appropriate that future parenting interventions could include content focused on parent-child interactions across the breadth of parenting situations described in our results. The inclusion of posts from r/Daddit and r/Mommit meant that our results reflected perspectives from both mothers and fathers. Despite clear evidence that fathers are central in influencing child developmental outcomes [53, 54], including child mental health outcomes [19, 55], fathers are chronically underrepresented in parenting research and parenting interventions (<20% of attendees) [56]. A systematic review of fathers’ representation in observational parenting studies identified that just 10% of 667 parenting studies included results separately for fathers; 1% were focused solely on fathers compared to 36% focused solely on mothers [57]. Our study findings address this gap by describing topics discussed by mothers and fathers. Although we identified some differences in LDA topics discussed in each of the forums, on the whole, there was a high level of consistency in the themes raised in r/Daddit and r/Mommit. However, our findings suggest that parents raise different questions depending on the age of their child, suggesting that parenting interventions should tailor intervention content to be age-specific. Further, the vast majority of included posts (80%) related to infant or toddler aged children. These findings suggest that early childhood may be an important time for intervention when parents are particularly open to receiving support and advice. We note that just 24 of the 31 LDA topics identified met our inclusion criteria for analysis as a parenting situation. The remaining posts traversed a wide range of themes, including managing family situations and relationships in the extended family context, parents’ own health concerns, and household-related issues. Parents increasingly use online tools in a multitude of ways to assist with parenting, including connecting through social media, assisting with managing family life, and providing multiple sources of online information [29]. Our data reflect that parents use Reddit to access support from other parents for a wide range of reasons. Although some of the LDA topics identified in the current study were not deemed to be immediately relevant in terms of tailoring parenting interventions, they may be useful for understanding how parents access support online.

Applications for tailoring parenting interventions

E-interventions, delivered online or via smartphone apps, offer potential for overcoming cost, time, and geographic barriers faced by traditional interventions [58]. However, results to-date have been disappointing. A systematic review showed that current technology-based parenting interventions are still under-servicing key groups, such as fathers, non-urban parents, and parents from a low socio-economic background [59]. For the most part, technology-based parenting interventions have not changed the format or content from traditional face-to-face, module-based approaches [58]. Tailoring in technology-based parenting interventions has been minimal to-date, despite evidence that that engagement is strongest when interventions are flexibly tailored to be maximally relevant to individual parents [58, 60]. Findings from the current study provide scope to address these limitations. We identify 45 of the most commonly discussed parenting situations on the parenting-specific forums of Reddit. Our results provide a detailed picture of the common scenarios that mothers and fathers face day-to-day, that could be applied to tailor resources in population-based parenting interventions.

Strengths and limitations

The use of LDA topic modeling and manual text analysis was both a strength and potential limitation of the current study. LDA offers a powerful data-driven method for analyzing large datasets, and may discern patterns in qualitative data that becomes infeasible to analyze with traditional qualitative coding approaches as data quantity increases. However, LDA lacks specific recommendations around best practice approaches in particular settings. For example, there is no single method for determining the most appropriate number of LDA topics to model within the data [32]. Further, decisions made in the data preparation steps to clean the data prior to running the LDA modeling have potential to introduce researcher bias and impact the results. Our findings may also be limited by the scope offered in the Reddit parenting forums. For example, while widely used, not all parents use Reddit. Further, for Reddit users, some parents may be more likely to comment on others’ posts, rather than create an original post themselves. In both cases, the experiences of these parents will not be reflected in our results. There may also be bias in the subjects that parents raise for discussion on Reddit, for example, based on parents’ prior interpretation of what is acceptable within particular forums. In addition, some topics may be less likely to be identified by the LDA process, such as those related to sensitive or less common situations, or where a range of different words are used to refer to the same meaning, thus making it less likely to be identified as a topic by the LDA modelling. Further research efforts to augment publicly available posts with data directly obtained from Reddit users about their demographics and other parenting-related constructs may provide added context with which to appraise the potential generalizability of results obtained from online forum conversations. Our results reflect data collected from the Reddit platform in 2020, but are likely to be generalizable to other online forums used by parents, such as Facebook and Instagram. There is emerging evidence investigating the way in which parents use other social media sites, including Facebook parenting groups [61, 62]. However, Reddit is uniquely amenable to topic modelling, given that all posts are publicly available, which is not the case with Facebook and Instagram. We expect that the way in which parents discuss parenting issues would be relatively consistent across platforms, but there is currently no evidence to assess this claim. This could be tested by future research investigating whether there are systematic differences between parent users of different platforms, such as demographic characteristics (e.g., parent user age, gender, country of residence, spoken language).

Conclusions

Our results support the use of LDA text mining for the purpose of understanding broad themes discussed in online parenting forums. We identified 45 unique parenting situations describing a wide range of parenting contexts, but most commonly related to basic childcare situations, i.e., related to eating, sleeping, routines, sickness, toilet training; or related to advice about how to respond to child negative emotions and/or difficult behavior. The current study was novel in analyzing parents’ own words to understand the most common day-to-day parenting issues experienced by parents. Internet forums represent a rich source of potentially unbiased data for researchers, offering a method for observing and analyzing naturally-occurring adult conversations online. These findings offer potential to tailor parenting interventions in a meaningful way, creating opportunities to develop content and resources that are directly relevant to parents’ lived experiences.

Mean perplexity values across 5-fold cross-validations for K-topic solutions.

(DOCX) Click here for additional data file.

Mean, standard deviation, and range of gamma values for the top 10 posts contributing to each topic.

(DOCX) Click here for additional data file.

Range and average document-topic-probabilities for the top ten documents for each topic.

(DOCX) Click here for additional data file.

Summary of themes from posts not meeting the study inclusion criteria.

(DOCX) Click here for additional data file.

β values for the 10 most probable words associated with each of the 31 LDA topics.

(TIF) Click here for additional data file. 27 Apr 2021 PONE-D-21-07563 Text mining of Reddit posts: Using latent Dirichlet allocation to identify common parenting issues PLOS ONE Dear Dr. Westrupp, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jun 11 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Ali B. Mahmoud, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1) Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2) PLOS ONE has specific requirements for studies using personal data from third-party sources, including social media, blogs, other internet sources, and phone companies (https://journals.plos.org/plosone/s/submission-guidelines#loc-personal-data-from-third-party-sources). These requirements include confirming data are collected and used in accordance with the company or website’s Terms and Conditions, obtaining appropriate ethics or data protection body review, and ensuring appropriate consent from individuals whose data are used in research. In this case, please ensure that your Ethics statement is in compliance with guidelines, and that you have complied with the company's (i.e., Reddit) Terms and Conditions, with appropriate permissions. 3) Please upload a copy of Figure 2, to which you refer in your text. If the figure is no longer to be included as part of the submission please remove all reference to it within the text. 4) We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The article topic is exciting and essential for all families and young people who want to form a family. However, there are some notes. 1. The research lacks a clear explanation of the methodology, while the Methods section focuses more on the LDA model's steps. 2. The size of the data used is relatively small, and it is preferable to increase the number or cite similar articles that use the same or less data volume. 3. The authors have deleted the stop words, but in table one, I find some words that are considered stop words like will, can, since and others. So is there a reason not to delete them, and if there is a reason, the authors should write it. 4. The authors stemmed the words manually; why? Note: some tools do it automatically 5. Authors do not write the percentage of each topic to the overall size of the dataset. Through this percentage, we can find out the actual weight of each topic. Note: the authors referred to this point on page 9, row 164, but did not implement it 6. The authors wrote on page 13, row 256, "Specifically, we focused interpretation on the 10 most relevant posts per topic according to values.". It is better to specify a percentage rather than a fixed number. 7. page 15, row 285 to 291, more explanation needs to be mentioned 8. Table 1 column "LDA topic words ", some words", not refer to the topic themes like topic 4. day, one, today topic 7. last, yesterday topic 9. like, much topic 10. ok it is general words that can be on different topics. Personally prefer to eliminate it as stop words , and some words are not clear like yo and oh on topic 21, 9. The authors wrote on page 24, "there is no single method for determining the most appropriate number of LDA topics to model within the data." this is true, for that we sometimes resort to the help of experts in the field of study to determine the topics' numbers "human judgment ". I recommend that authors use this method to determine the number of topics because the 31 topics selected in this study are a big number, and some of them refer to the same themes. Reviewer #2: The authors present a data mining-based approach to identify common parenting situations discussed by parents on parenting platforms such as r/Daddit and r/Mommit. LDA is used to identify the most common topics in a date set extracted and scraped from relevant websites. The work is technically sound and the findings are interesting and has potential to feed future applications/chatbots that may benefit and reach to larger number of parents in different geographical locations and variety of languages. I do have some concerns that requires to be answered before the paper is accepted for publication: 1- Why were only 2-topics being estimated at the beginning of the study? Isn't that more topics would contribute to an informative results? 2- How are the number of models being increased till the number reached 50 topics? The method used must be shared. 3- What is the perplexity value of each model not being shared? 4- On what basis the parenting situations were defined? Any logic behind the identification of those situations? Probably previous experiences or data from key informants has fed in such selection. For instance, posts that involves a child aged 0-18 may not involve parenting issues. How it was decided to included as a parenting situation? 5-Figure 2 that represents the perplexity values for topic models scenarios is missing from the article. 6-How did your study conclude that parents use Reddit to access support from other parents? I haven't seen any data that supports such a claim. 7-Some of LDA topics, that were considered not immediately relevant to parenting interventions, may have used different keywords to inquire about important parenting matters. How to ensure such topics are included in the analysis? 8-The authors should acknowledge the fact that Daddit and Mommit may have been used by specific communities/ countries so the issue of generalisation of results must be discussed in depth. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: PONE-D-21-07563_comment.docx Click here for additional data file. 19 Jul 2021 Please see attached formatted document with table showing reviewer comments and our response. The section pasted here might be harder to read. The article topic is exciting and essential for all families and young people who want to form a family. However, there are some notes. 1. The research lacks a clear explanation of the methodology, while the Methods section focuses more on the LDA model's steps. We have now added some more technical details about the LDA methodology in the method section (see “LDA topic modelling”) and also provided references for technical descriptions. The edited section is presented below (on page 12): “An outline of the statistical methodology of LDA is presented in detail elsewhere [47, 48], but briefly described here. LDA is a Bayesian probabilistic modelling method that aims to identify the unknown number of latent topics that are assumed to underlie a body of text.[33] LDA draws from a Dirichlet distribution to generate distributions of probabilities that describe how (1) words (i.e., word-topic-probabilities) and (2) documents (i.e., document-topic-probabilities) are related to the latent topics underlying the dataset. Specifically, word-topic-probabilities are estimates of the probability a word is generated from a specific topic, whilst document-topic-probabilities are estimates of the probability that a topic has been generated in a specific document [49]. Inspection of the highest word-topic and document-topic probabilities for each topic can help characterize the theme of each latent topic.” 2. The size of the data used is relatively small, and it is preferable to increase the number or cite similar articles that use the same or less data volume. The issue of sample size for LDA remains unclear in the literature, with no firm direction about required N nor sufficient empirical appraisal of impacts of sample size on obtained results. Indeed, a key challenge is that sample size calculations for machine learning algorithms cannot be calculated a priori because model performance is entirely based on the strength of the signals underlying a dataset, which can only be identified using machine learning itself. This means that the dataset sizes are not comparable since these will be modeling different sets of topics. We also note that many studies do not report on the number of documents they have used. In fact, we have just completed a scoping review of 47 papers using LDA in the psychological sciences (under review), and found only 34% of the papers reported this information. Some examples of papers not reporting: Ruiz, N., Witting, A., Ahnert, L., & Piskernik, B. (2020). Reflective functioning in fathers with young children born preterm and at term. Attachment & Human Development, 22(1), 32-45. https://doi.org/10.1080/14616734.2019.1589059 Barry, A. E., Valdez, D., Padon, A. A., & Russell, A. M. (2018). Alcohol Advertising on Twitter—A Topic Model. American Journal of Health Education, 49(4), 256-263. https://doi.org/10.1080/19325037.2018.1473180 Carpenter, J., Crutchley, P., Zilca, R. D., Schwartz, H. A., Smith, L. K., Cobb, A. M., & Parks, A. C. (2016). Seeing the 'big' picture: Big data methods for exploring relationships between usage, language, and outcome in Internet intervention data. Journal of Medical Internet Research, 18(8), e241-e241. https://doi.org/10.2196/jmir.5725 Hemmatian, B., Sloman, S. J., Cohen Priva, U., & Sloman, S. A. (2019). Think of the consequences: A decade of discourse about same-sex marriage. Behavior Research Methods, 51(4), 1565-1585. https://doi.org/10.3758/s13428-019-01215-3 Gerber, M. S. (2014). Predicting crime using Twitter and kernel density estimation. Decision Support Systems, 61, 115-125. https://doi.org/10.1016/j.dss.2014.02.003 As such, we argue that more relevant is the topics that are generated and the validity of those topics in understanding or being used for their intended purpose. We also wish to emphasize that the results from our study reflect the level of data available via Reddit for parenting conversations, and are not a subsample of possible content. This both constrains our capacity to expand the analysis, but also provides validity to results as they are reflective of the breadth and content of current parenting conversations on Reddit. 3. The authors have deleted the stop words, but in table one, I find some words that are considered stop words like will, can, since and others. So is there a reason not to delete them, and if there is a reason, the authors should write it. Please note that we have used the stop words function in the tm package, which uses a list of stop words that can be found here: http://snowball.tartarus.org/algorithms/english/stop.txt. The authors of this package explain that some words are not included as stop words since they are common homonyms (e.g., “can”, “will”). Nevertheless, we do not believe that removing these potential words would alter the interpretations of these topics, which should remain stable even with stop words removed given these words should not impact interpretation of the theme of a topic (i.e., because these words are not meaningful). Given the complexity of the data that is presented on an online forum, we appreciate some level of noise may be generated from these words. However, despite this noise, we believe that the results are still meaningful, with minimal impact on our aim to identify the parenting situations. 4. The authors stemmed the words manually; why? Note: some tools do it automatically We appreciate the availability of some algorithms that can be used to automatically stem words in a corpus. However, we are mindful that automatic stemming approaches can reduce precision because this can lead to dissimilar meaning words being treated as the same stemmed word in analysis (see pp34 ref, as below, for examples). Given the nature of our dataset, and the potential for some overlapping words, we believed it was worth engaging in a manual stemming approach rather than automation of this task. For example, in our manual stemming approach there was clear advantage in discriminating some words that would automatically be stemmed to the same stem for the purpose of analysis (e.g., “grandfather”, “grandmother”, “granny”, “granddad”, “grandpa” = all stemmed to “grandparent”; whilst “granddaughter”, “grandchild”, “grandson” = all stemmed to “grandchild”). As such, we believe our manual approach was appropriate in this instance. Reference: Christopher D. Manning, Prabhakar Raghavan, H. S. (2008). Introduction to Information Retrieval. 5. Authors do not write the percentage of each topic to the overall size of the dataset. Through this percentage, we can find out the actual weight of each topic. Note: the authors referred to this point on page 9, row 164, but did not implement it. We have now provided the average probability and range of probabilities for the top 10 documents for each topic in Supplementary Table 3. Whilst this does provide an indication of the relationship between documents to topics, we note that the aim of this study specifically was to identify the situations that were commonly experienced by parent, rather than focusing on the latent topics that were estimated. Consequently, we believe that this information is best presented as supplementary material. For convenience, we have presented this table below (at the end of the document). 6. The authors wrote on page 13, row 256, "Specifically, we focused interpretation on the 10 most relevant posts per topic according to 𝛾 values.". It is better to specify a percentage rather than a fixed number. As noted in response 5, we have now provided a summary of the range and average document-topic-probabilities for the top 10 documents for each topic. However, as noted in our previous response we believe this information is best suited to the supplementary material given that the primary focus of the manuscript was to identify the parenting situations within each topic. 7. page 15, row 285 to 291, more explanation needs to be mentioned We have added additional detail to better summarize the key information presented in Table 1 (page 16): “Just eight of the 24 LDA topics included posts where parents were giving advice, either recounting their own recent experiences to illustrate a specific successful parenting moment that they hoped others might learn from (e.g., just found a good strategy for clearing their sick child’s congestion), or some parents posted with more generic advice (i.e., not specific to a recent parenting experience, but based on their accumulated parenting experience), such as providing advice on strategies for specific situations (e.g., toilet training advice). The remainder of the posts involved the parent seeking advice from other parents on the Reddit parenting forums, most commonly in the format of a parent describing a specific parenting situation and asking others for advice on how they might handle it.” 8. Table 1 column "LDA topic words ", some words", not refer to the topic themes like topic 4. day, one, today topic 7. last, yesterday topic 9. like, much topic 10. ok it is general words that can be on different topics. Personally prefer to eliminate it as stop words, and some words are not clear like yo and oh on topic 21. We acknowledge that some of these words appear to be unrelated to the topics. However, we accept some level of “noise” within the dataset. Our revisions to the data were limited to stemming, so that patterns that emerged were likely to be a better reflection of how words relate, even if some words were unexpected. This was preferable – in our view – to overly interfering with the data to force a solution that made sense to us a priori. Moreover, given our study was aimed primarily on identifying the parenting situations that are commonly experienced, we believe that these examples will not impact on the identification of such situations. As such we accept that whilst not perfect, these represent common issues in collecting data from websites such as reddit and do not impact on our primary focus on identifying parenting situations. 9. The authors wrote on page 24, "there is no single method for determining the most appropriate number of LDA topics to model within the data." this is true, for that we sometimes resort to the help of experts in the field of study to determine the topics' numbers "human judgment ". I recommend that authors use this method to determine the number of topics because the 31 topics selected in this study are a big number, and some of them refer to the same themes. We thank the reviewer for this suggestion. Given that the primary aim of this study was to identify the parenting situations that were discussed within each topic, we believe that our data driven approach (which suggested a 31 topic model) was appropriate. In this case, having a large number of topics was not necessarily burdensome given that it was the identification of the situations that was the aim of the study, and that it was therefore possible that there could be more (or less) than 31 situations that could be identified within the topics. As noted in the discussion (page 23), we: “...extracted 31 LDA topics, of which 24 met our inclusion criteria in terms of being related to a difficult parenting situation involving their child. From these, we identified 45 unique but broadly-defined parenting situations involving a parenting issue or difficult interaction between a parent and their child aged 0-18 years.” As such, our approach was not reliant on identifying a small number of topics, and our inclusion criteria focused only on topics that comprised and there was clear differentiation within the situations. Consequently we believe our data driven approach appropriately addressed our research questions. Submitted filename: RevisionsResponse_FINAL 28-6-21.docx Click here for additional data file. 7 Sep 2021

PONE-D-21-07563R1

Text mining of Reddit posts: Using latent Dirichlet allocation to identify common parenting issues

PLOS ONE Dear Dr. Westrupp, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Moreover, the current manuscript omitted responses and revision work concerning the comments made by Reviewer 2. Therefore, I invite you to address/respond to the comments by both reviewers cautiously in your next revision. Please submit your revised manuscript by Oct 22 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Ali B. Mahmoud, Ph.D. Academic Editor PLOS ONE Journal Requirements: Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) Reviewer #2: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: (No Response) ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: (No Response) ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: 1. in my previous review, I wrote, "The research lacks a clear explanation of the methodology, while the Methods section focuses more on the LDA model's steps." I guess that the authors do not understand what I mean, so I will rewrite my review. Data extraction, data preparation, LDA topic modeling, and qualitative synthesis are general steps we apply in any LDA analysis. For this reason, the methodology paragraph should explain how the LDA modeling will be used to achieve the research goal. Otherwise, this research is nothing more than an application of LDA modeling and does not provide anything new. 2. There is not a complete stop words list; therefore, the possibility to update this list is available. stopwords = nltk.corpus.stopwords.words('english') stopwords.append('newWord') or stopwords = nltk.corpus.stopwords.words('english') newStopWords = ['stopWord1','stopWord2'] stopwords.extend(newStopWords) 3. in my previous review, I wrote, "Authors do not write the percentage of each topic to the overall size of the dataset. Through this percentage, we can find out the actual weight of each topic." in this comment, I mean the topic not the document. The authors can use tools like this "" ext-link-type="uri" xlink:type="simple">https://pyldavis.readthedocs.io/en/latest/modules/API.html" to get the percentage of each topic. This percentage is significant to know which topics are more representative. 4. The appearance of stop words in the table of topics is unacceptable; thus, it must be deleted before applying the LDA modeling, as I mentioned earlier. Furthermore, the analysis is based on the appearance of the words within the same document and in different documents; consequently, when we delete any word, a new word will appear in the topic's list of words, which can change the meaning of this topic. 5. The authors have mentioned that "we note that the aim of this study specifically was to identify the situations that were commonly experienced by parent." The identity of the situations will be identified through the results of LDA modeling; consequently, any problems in the application of the model will prevent us from reaching the real results "situations". Reviewer #2: The author hasn't' responded to my review that was sent with the decision letter. Please ensure that the below points are addressed clearly the response letter. The authors present a data mining-based approach to identify common parenting situations discussed by parents on parenting platforms such as r/Daddit and r/Mommit. LDA is used to identify the most common topics in a date set extracted and scraped from relevant websites. The work is technically sound and the findings are interesting and has potential to feed future applications/chatbots that may benefit and reach to larger number of parents in different geographical locations and variety of languages. I do have some concerns that requires to be answered before the paper is accepted for publication: 1- Why were only 2-topics being estimated at the beginning of the study? Isn't that more topics would contribute to an informative results? 2- How are the number of models being increased till the number reached 50 topics? The method used must be shared. 3- What is the perplexity value of each model not being shared? 4- On what basis the parenting situations were defined? Any logic behind the identification of those situations? Probably previous experiences or data from key informants has fed in such selection. For instance, posts that involves a child aged 0-18 may not involve parenting issues. How it was decided to included as a parenting situation? 5-Figure 2 that represents the perplexity values for topic models scenarios is missing from the article. 6-How did your study conclude that parents use Reddit to access support from other parents? I haven't seen any data that supports such a claim. 7-Some of LDA topics, that were considered not immediately relevant to parenting interventions, may have used different keywords to inquire about important parenting matters. How to ensure such topics are included in the analysis? 8-The authors should acknowledge the fact that Daddit and Mommit may have been used by specific communities/ countries so the issue of generalisation of results must be discussed in depth. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Eiad Yafi [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 21 Oct 2021 Ali B. Mahmoud, Ph.D. Academic Editor, PLOS ONE Wednesday, 15 September 2021 Dear Dr Mahmoud Thank you for your support of our paper. We want to pass on our sincere apologies to Reviewer 2 for unintentionally omitting their comments in our submitted response document. We have now addressed all comments, and provide a detailed response to both Reviewer 1 and 2’s comments in the table below. Warm regards Dr Elizabeth Westrupp Deakin University Submitted filename: Response to reviewers v2.docx Click here for additional data file. 29 Nov 2021

PONE-D-21-07563R2

Text mining of Reddit posts: Using latent Dirichlet allocation to identify common parenting issues

PLOS ONE Dear Dr. Westrupp, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, the reviewers have recommended publication but also suggested a few minor corrections before the paper is fully accepted. Therefore, I invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jan 13 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

21 Dec 2021 Dear Dr Mahmoud Thank you for your support of our paper. We have explained below that we disagree with Reviewer 1’s perspective in regards to stop words. We have included a detailed explanation below, and note that our approach is backed up in a number of published papers. Warm regards Dr Elizabeth Westrupp Deakin University Reviewer #1: i still have the same comment The appearance of stop words in the table of topics is unacceptable; thus, it must be deleted before applying the LDA modeling, as I mentioned earlier. Furthermore, the analysis is based on the appearance of the words within the same document and in different documents; consequently, when we delete any word, a new word will appear in the topic's list of words, which can change the meaning of this topic. the stop word forbids giving an explanation of the meaning of the topics OUR RESPONSE We believe we have conducted an appropriate level of stopword removal. Our approach focussed on removing the most commonly recognised stopwords, as well as stopwords common to the online setting. We used a standard stopword list implemented in the ‘tm’ package for our study (https://rdrr.io/rforge/tm/man/stopwords.html) containing common stopwords. In addition, we added to this list our own list of stopwords relevant to the online setting (e.g. “goodbye”, “howdy”, “html”, “com”, “login”), and removed single letters and common combinations of two letter words that would appear after removal of punctuation (e.g., “nt”). Schofield et al (2017) conducted simulation studies examining the impact of stopwords on interpretation and found little benefit in a continuously iterative process of updating a stoplist and then re-estimating the model. They argue that this approach does not impact the interpretation of the topics, which usually rely in inspection of words which have the highest word-topic probabilities. They suggest that simply ignoring those stopwords, should they appear in the list (which they term “post-hoc stopword removal”), is entirely appropriate and does not invalidate the estimates mode. Specifically, they note in their conclusion (p435-436): “Generating a corpus-specific stoplist to remove rarer contentless words provides relatively little utility to training a model. To obtain the benefit of a stoplist, it suffices to remove the most frequent, obvious stopwords from a corpus without developing a specific stoplist for the problem setting. If these methods are not sufficient, we find that post-hoc stopword removal can significantly improve coherence while avoiding many of the efficiency and epistemological bias issues of iterative stoplist curation.” Given that we have removed the most common stopwords as defined by well-used text mining packages, as well as words that would are specific to the online setting, we believe our topics are robust to the inclusion of any additional ad hoc stopwords identified post hoc. Moreover, we believe our findings are robust since our focus was on the parenting contexts identified within the topics. Consequently, we were not limited to interpreting any of the top words for each topic but rather interrogated the posts to identify the parenting situations . Reference: Schofield, A., Magnusson, M., Mimno, D. (2017). Pulling out the stops: Rethinking stopword removal for topic models. 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, 2, 432–436. https://doi.org/10.18653/v1/e17-2069. Submitted filename: Response to reviewers v2.docx Click here for additional data file. 29 Dec 2021 Text mining of Reddit posts: Using latent Dirichlet allocation to identify common parenting issues PONE-D-21-07563R3 Dear Dr. Westrupp, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Ali B. Mahmoud, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 6 Jan 2022 PONE-D-21-07563R3 Text mining of Reddit posts: Using latent Dirichlet allocation to identify common parenting issues Dear Dr. Westrupp: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Ali B. Mahmoud Academic Editor PLOS ONE

38 in total

1. Mining big data to extract patterns and predict real-life outcomes.

Authors: Michal Kosinski; Yilun Wang; Himabindu Lakkaraju; Jure Leskovec
Journal: Psychol Methods Date: 2016-12

2. Families Who Benefit and Families Who Do Not: Integrating Person- and Variable-Centered Analyses of Parenting Intervention Responses.

Authors: Jolien van Aar; Patty Leijten; Bram Orobio de Castro; Joyce Weeland; Walter Matthys; Rabia Chhangur; Geertjan Overbeek
Journal: J Am Acad Child Adolesc Psychiatry Date: 2019-02-11 Impact factor: 8.829

Review 3. Parents in prevention: A meta-analysis of randomized controlled trials of parenting interventions to prevent internalizing problems in children from birth to age 18.

Authors: Marie B H Yap; Amy J Morgan; Kathryn Cairns; Anthony F Jorm; Sarah E Hetrick; Sally Merry
Journal: Clin Psychol Rev Date: 2016-10-21

4. Who says what? Content and participation characteristics in an online depression community.

Authors: Johannes Feldhege; Markus Moessner; Stephanie Bauer
Journal: J Affect Disord Date: 2019-11-04 Impact factor: 4.839

5. Do Fathers' Home Reading Practices at Age 2 Predict Child Language and Literacy at Age 4?

Authors: Jon Quach; Anna Sarkadi; Natasha Napiza; Melissa Wake; Amy Loughman; Sharon Goldfeld
Journal: Acad Pediatr Date: 2017-10-19 Impact factor: 3.107

6. What Online Communities Can Tell Us About Electronic Cigarettes and Hookah Use: A Study Using Text Mining and Visualization Techniques.

Authors: Annie T Chen; Shu-Hong Zhu; Mike Conway
Journal: J Med Internet Res Date: 2015-09-29 Impact factor: 5.428

7. A Tailored Web-Based Intervention to Improve Parenting Risk and Protective Factors for Adolescent Depression and Anxiety Problems: Postintervention Findings From a Randomized Controlled Trial.

Authors: Marie Bee Hui Yap; Shireen Mahtani; Ronald M Rapee; Claire Nicolas; Katherine A Lawrence; Andrew Mackinnon; Anthony F Jorm
Journal: J Med Internet Res Date: 2018-01-19 Impact factor: 5.428

8. Examining Practitioner Competencies, Organizational Support and Barriers to Engaging Fathers in Parenting Interventions.

Authors: L A Tully; D A J Collins; P J Piotrowska; K S Mairet; D J Hawes; C Moul; R K Lenroot; P J Frick; V A Anderson; E R Kimonis; M R Dadds
Journal: Child Psychiatry Hum Dev Date: 2018-02

9. Medium-Term Effects of a Tailored Web-Based Parenting Intervention to Reduce Adolescent Risk of Depression and Anxiety: 12-Month Findings From a Randomized Controlled Trial.

Authors: Marie Bee Hui Yap; Mairead C Cardamone-Breen; Ronald M Rapee; Katherine A Lawrence; Andrew J Mackinnon; Shireen Mahtani; Anthony F Jorm
Journal: J Med Internet Res Date: 2019-08-15 Impact factor: 5.428

10. Barriers to, and facilitators of, parenting programmes for childhood behaviour problems: a qualitative synthesis of studies of parents' and professionals' perceptions.

Authors: J Koerting; E Smith; M M Knowles; S Latter; H Elsey; D C McCann; M Thompson; E J Sonuga-Barke
Journal: Eur Child Adolesc Psychiatry Date: 2013-04-06 Impact factor: 4.785