Literature DB >> 35263374

An observational analysis of the trope "A p-value of < 0.05 was considered statistically significant" and other cut-and-paste statistical methods.

Nicole M White¹, Thirunavukarasu Balasubramaniam², Richi Nayak², Adrian G Barnett¹.

Abstract

Appropriate descriptions of statistical methods are essential for evaluating research quality and reproducibility. Despite continued efforts to improve reporting in publications, inadequate descriptions of statistical methods persist. At times, reading statistical methods sections can conjure feelings of dèjá vu, with content resembling cut-and-pasted or "boilerplate text" from already published work. Instances of boilerplate text suggest a mechanistic approach to statistical analysis, where the same default methods are being used and described using standardized text. To investigate the extent of this practice, we analyzed text extracted from published statistical methods sections from PLOS ONE and the Australian and New Zealand Clinical Trials Registry (ANZCTR). Topic modeling was applied to analyze data from 111,731 papers published in PLOS ONE and 9,523 studies registered with the ANZCTR. PLOS ONE topics emphasized definitions of statistical significance, software and descriptive statistics. One in three PLOS ONE papers contained at least 1 sentence that was a direct copy from another paper. 12,675 papers (11%) closely matched to the sentence "a p-value < 0.05 was considered statistically significant". Common topics across ANZCTR studies differentiated between study designs and analysis methods, with matching text found in approximately 3% of sections. Our findings quantify a serious problem affecting the reporting of statistical methods and shed light on perceptions about the communication of statistics as part of the scientific process. Results further emphasize the importance of rigorous statistical review to ensure that adequate descriptions of methods are prioritized over relatively minor details such as p-values and software when reporting research outcomes.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35263374 PMCID： PMC8906599 DOI： 10.1371/journal.pone.0264360

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

An ideal statistical analysis uses appropriate methods to draw insights from data and inform the research questions. Unfortunately many current statistical analyses are far from ideal, with researchers often using the wrong methods, misinterpreting the results, or failing to adequately check their assumptions [1]. Some researchers take a “mechanistic” approach to statistics, copying the few methods they know regardless of their appropriateness, and then going through the motions of the analysis [2]. Applying this form of methodological illiteracy is at odds with the principles of scientific inquiry, yet continues to pervade published scientific research [3]. This paradox has been exemplified during the COVID-19 pandemic, which has led to unprecedented levels of published research of largely poor quality [4, 5]. Many researchers lack adequate training in research methods, and statistics is something they do with trepidation and even ignorance [6, 7]. However, using the wrong statistical methods can cause real harm [6, 8] and bad statistical practices are being to used abet weak science [2]. Statistical mistakes are a key source of research waste and are contributing to the current reproducibility crisis in science [9]. Even when the correct methods are used, many researchers fail to describe them adequately, making it difficult to reproduce the results [10, 11]. Poor statistical methods might not be caught by reviewers, as they may not be qualified to judge the statistics. A recent survey of editors found that only 23% of health and medical journals used expert statistical review for all articles [12], which was little different from a survey from 22 years ago [13]. There is guidance for researchers on how to write up their statistical methods and results. The International Committee of Medical Journal Editors recommend that researchers should: “Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results” [14]. More detailed guidance is given by the SAMPL and EQUATOR guidelines [15, 16] covering all aspects of research reporting tailored to different study designs. Both of these guidelines were led by Doug Altman, who spoke for many years about the need for better statistical reporting. The awareness and use of these guidelines could be improved. There were 303 Google Scholar citations to the SAMPL paper (as at 8 October 2021) which is a good citation number for most papers, but is low considering the millions of papers that use statistical analysis. A potential contributor to poor reporting is the temptation for researchers to re-use descriptions of the same default statistical methods, to make their papers resemble those of their peers and increase perceived chances of publication [17]. As these default choices become more common, valid criticism by reviewers and journal editors becomes increasingly difficult, as past use may be argued by researchers as offering precedent for the conduct of analysis within their discipline [18]. Two statisticians on this paper (AB and NW) have heard researchers admit that they have copied-and-pasted their statistical methods sections from other papers. To investigate the extent of this practice, we applied topic modelling to analyze text within statistical methods sections, as part of published journal articles and clinical trials protocols. Modelling results were used to estimate the extent that researchers are using cut-and-paste or ‘boilerplate’ statistical methods sections. Boilerplate text is that “which can be reused in new contexts or applications without significant changes to the original” [19]. The use of boilerplate text indicates that researchers are emphasizing the same details about chosen statistical analyses, and potentially giving little thought into the conduct and transparent reporting of statistical methods used.

Materials and methods

Data sources

We used two openly available data sources to find statistical methods sections: research articles published in PLOS ONE and study protocols registered on the Australian and New Zealand Clinical Trials Registry (ANZCTR). Data sources were chosen as examples of common research outputs that include descriptions of statistical methods that are either planned or were used for analyzing studies.

Public Library of Science (PLOS ONE)

PLOS ONE is a open access mega-journal that publishes original research across a wide range of scientific fields. Article submissions are handled by an academic editor who selects peer reviewers based on their self-nominated area(s) of expertise. Currently there are 324 academic editors out of 9,648 (3%) with the keywords of “statistics (mathematics)” or “statistical methods” in their expertise list (web search on 25-May-2021, https://journals.plos.org/plosone/static/editorial-board). Submissions do not undergo formal statistical review. Instead, reviewers are required to assess submissions against several publication criteria, including whether: “Experiments, statistics, and other analyses are performed to a high technical standard and are described in sufficient detail” [20]. All reviewers are asked the question: “Has the statistical analysis been performed appropriately and rigorously?”, with the possible responses of “Yes”, “No” and “I don’t know”. In September 2019, author instructions were updated to allow citations of established materials, methods and protocols, provided sufficient details are given for approaches to be understood independently of chosen references [21]. Authors are encouraged to follow published reporting guidelines such as EQUATOR, to ensure that chosen statistical methods are appropriate for the study design, and adequate details are provided to enable independent replication of results. Data on all PLOS ONE articles can be accessed via the PLOS Application Programming Interface (API). This enabled us to conduct searches of full-text articles and analyze data on articles’ text content and general attributes such as publication date and field(s) of research. All available papers regardless of publication date were considered. We applied a two-step approach to identify statistical methods sections: Step 1: Targeted API searches were completed using the R package ‘rplos’ [22]. Search queries targeted analysis-related terms, combining the words “data” or “statistical” with one of: “analysis”, “analyses”, “method”, “methodology” or “model(l)ing”. Terms could appear anywhere within the main body of the article, to account for the placement of relevant text in different sections, for example, in the Material and Methods section versus Results. Search results were indexed by a unique Digital Object Identifier (DOI). Attribute data collected per DOI included journal volume and subject classification(s). Step 2: PLOS ONE does not use standardized headings to preface statistical methods sections. To address this, we performed partial matching on available headings against frequently used terms in initial search results: ‘Statistical analysis’, ‘Statistical analyses’, ‘Statistical method’, ‘Statistical methods’, ‘Statistics’, ‘Data analysis’ and ‘Data analyses’. All available data were downloaded on 3 July 2020. Code to complete steps 1 and 2 is available at https://github.com/agbarnett/stats_section/code/plosone.

Australia and New Zealand Clinical Trials Registry (ANZCTR)

The ANZCTR was established in 2005 as part of a coordinated global effort to improve research quality and transparency in clinical trials reporting; observational studies can also be registered. All studies registered on ANZCTR are publicly available and can be searched via an online portal (https://www.anzctr.org.au). Details required for registration follow a standardized template [23], which covers participant eligibility, the intervention(s) being evaluated, study design and outcomes. The information provided must be in English. Studies are not peer reviewed. For the statistical methods section, researchers are asked to provide a brief description of sample size calculations, statistical methods and planned analyses, although this section is not compulsory [23]. Studies are reviewed by ANZCTR staff for completeness of key information, which does not include the completeness of the statistical methods sections. All studies available on ANZCTR were downloaded on 1 February 2020 in XML format. For our analysis, we used all text available in the “Statistical methods” section. We also collated basic information about the study including the study type (interventional or observational), submission date, number of funders and target sample size. These variables were chosen as we believed they might influence the completeness of the statistical methods section. For example, we hypothesized that larger studies and those with funding to be more complete. We were also interested in changes over time. Studies prior to 2013 were excluded as the statistical methods section appeared to be introduced in 2013. Some studies were first registered on the alternative trial database clinicaltrials.gov and then also posted to ANZCTR. We excluded these studies because they almost all had no completed statistical methods section as this section is not included in clinicaltrials.gov.

Statistical methods

Full-text processing

Text cleaning aimed to standardize notation and statistical terminology, whilst minimizing changes to article style and formatting. R code used for data extraction and cleaning is available from https://github.com/agbarnett/stats_section. Mathematical notation was converted from Unicode characters to plain text. Symbols outside of Unicode blocks including ‘%’ (percent) and ‘<’ (‘less-than’) were converted into plain text. General formatting was removed, including carriage returns, punctuation marks, in-text references (e.g. “[42]”) centered equations, and other non-ASCII characters. Bracketed text was retained with brackets removed to maximize content for analysis. Stop words including pronouns, contractions and selected prepositions were removed. We retained selected stop words that, if excluded, may have changed the context of statistical methods being described, for example ‘between’ and ‘against’. We compiled an extensive list of statistical terms to standardize reported descriptions of statistical methods. An initial list was compiled by calculating individual word frequencies and identifying relevant terms. Extra terms were sourced from index searches of three statistics textbooks [24-26]. Plurals (e.g., ‘chi-squares’) unhyphenated terms (e.g., ‘chi square’) and combined terms (e.g. ‘chisquare’) were transformed to singular, hyphenated form (e.g., ‘chi-square’). Common statistical tests were also hyphenated (e.g., ‘hosmer lemeshow’ to ‘hosmer-lemeshow’).

Analysis of missing statistical methods sections

Statistical methods sections were missing for some studies downloaded from ANZCTR, including sections labelled as “Not applicable”, “Nil” or “None”. Since these studies would be excluded from topic modeling, we examined if there were particular studies where the statistical methods section was more likely to be missing. Analysis considered a logistic regression model estimated in the Bayesian framework ([27]; www.r-inla.org), with missing statistical methods section (yes/no) as the dependent variable. The independent variables were date, study type, number of funders and target sample size which was log2 transformed because of a large positive skew. Results were reported as odds ratios with 95% credible intervals.

Topic modelling

Text from statistical methods sections was analyzed using Non-Negative Matrix Factorization (NMF). NMF is an established approach for topic modelling, and provides an effective solution for text-based clustering when dealing with high-dimensional data [28, 29]. For N studies, let P ∈ R denote a content matrix of text from statistical methods sections, comprising of M unique terms. Text clustering algorithms for identifying common topics across studies requires P to be represented with a vector space model. In our case, unique terms in P are modelled using the tf-idf (term frequency × inverse document frequency) weighting schema, to account for the relative importance of common and rare terms. A common problem facing text clustering algorithms is the curse of dimensionality due to the high number of terms in the doc × term matrix representation [30, 31]. Applying text-based methods based on distance, density or probability therefore face difficulties in high-dimensional settings [32-34]. Specifically, distances between near and far points becomes negligible [31]. This behavior directly affects the performance of distance-based clustering methods such as k-means [35] in accurately identifying subgroups (topics) present in the data. Furthermore, sparseness associated with high-dimensional matrix representations does not allow for differentiation between topics based on density differences [32, 36]. To address these limitations, NMF deals with high-dimensional data by mapping it to a lower-dimensional space. This mapping is achieved by approximating P with two factor matrices: W ∈ R and H ∈ R [31], such that P ≈ WH. The number of subgroups of common topics inferred from the data is given by g. The matrix factorization process approximates the lower dimensional non-negative factor matrices W and H such that they can represent high dimensional P with the least error. Estimation of W and H is achieved by optimizing an objective function; for NMF, the Fronbeius norm is used, equivalent to minimizing the sum of squares for all elements of P: Following estimation, H contains the information regarding topic membership for all studies. In our case, topic membership (1, …, g) for a statistical methods section is inferred from the maximum coefficient value in the corresponding row of H, also known as the topic coherence score. For our two datasets, we applied NMF with g = 10 topics.

Content analysis

Results were summarized by word clouds and n-gram analysis to identify frequently occurring terms within topics. Evidence of boilerplate text was assessed at the section and sentences levels using a modified version of the Jaccard similarity index. We chose the Jaccard index as an easy to interpret measure; for two pieces of tokenized text A and B, we defined the similarity score as J(A, B) = |A∩B|/|B|. Calculating similarities relative to a target piece of text (B) allowed us to identify instances of similar text either as a complete sentence, or embedded within larger sentences. Analyses considered text tokenised at the word level, with locality-sensitive hashing applied to reduce the number of pairwise comparisons [37]. Instances of boilerplate text were defined by a Jaccard index of 0.9 or higher.

Results

Public Library of Science (PLOS ONE)

Targeted keyword searches using the PLOS ONE application programming interface (API) returned 131,847 papers, of which 111,731 (85%) included a statistical methods section (S1 Fig). In the final sample, 94,608 (85%) papers returned an exact match against one or more common section headings: 63,982 for ‘statistical analysis’, 13,343 for ‘statistical analyses’ and 13,510 for ‘data analysis’. All papers included “Biology and life sciences” (n = 107,584), “Earth sciences” (n = 7,605) and/or “Computer and information sciences” (n = 5,190) in their top 3 subject classifications. Statistical methods sections had a median length of 129 words and inter-quartile range of 63 to 258 words. 7,701 articles (7%) had a statistical methods section of 500 words or more. 19,077 articles (17%) had statistical methods sections with 50 words or less, equal to the length of this paragraph. Topics reflected the use of statistical software (topics 3 and 5), descriptive statistics (topic 6), group based hypothesis testing (topics 1 and 4) and statistical significance (topics 1 and 9) (Fig 1). Also identified were topics related to regression (topic 2), meta-analysis (topic 7) and experimental designs (topic 10). At the section level, 528 studies (0.47%) were a direct cut-and-paste from another paper; 37,333 studies (33%) included at least one exact match at the sentence level.

Fig 1

Word clouds for ten topics for statistical methods sections published in PLOS ONE.

Definitions of statistical significance at α = 0.05 were the most common form of boilerplate text, found in approximately 1 in 10 of all included studies (Table 1). Topic 1 (n = 3,775) combined statistical significance with Student’s t-test. Topic 9 (n = 6,104) focused on multiple thresholds for declaring statistical significance such as “*p < 0.05, **p < 0.01 and ***p < 0.001”, a practice that has been criticized [38]. Minor variations of this phrase were identified in 40% of all studies assigned to this topic.

Table 1

Examples of boilerplate text from PLOS ONE papers based on targeted n-gram searches (sentence level).

Topic	Statistical methods text	Potential matches	Jaccard score
Topic	Statistical methods text	Potential matches	Median (IQR)	Boilerplate
1	Statistical analysis was performed using student t-test	3,015	0.5 (0.5 to 0.75)	189
2	Continuous variables were expressed as mean plus-or-minus standard deviation	1,228	0.82 (0.73 to 0.91)	494
2	Categorical variables were expressed as frequencies and percentages	643	0.75 (0.63 to 0.88)	38
3	All statistical analysis was performed using Graphpad Prism software	6,844	0.56 (0.44 to 0.78)	263
4	One-way analysis of variance (ANOVA) was used for multiple comparisons and a Tukey post-hoc test was applied where appropriate	6,660	0.43 (0.33 to 0.52)	6
5	Statistical analysis was performed using SPSS version 17.0 SPSS Inc Chicago IL USA	9,005	0.58 (0.42 to 0.75)	539
6	Data are expressed as mean plus-or-minus SEM	4,455	0.78 (0.67 to 0.89)	321
7	Summary estimates including 95 percent confidence intervals (CIs) were calculated	4,057	0.4 (0.3 to 0.5)	6
8	The significance level was set at p equal-to 0.05	3,397	0.5 (0.5 to 0.7)	262
9	p less-than 0.05 p less-than 0.01 **p less-than 0.001	5,559	0.83 (0.83 to 0.92)	2,510
10	All data are representative of at least three independent experiments	1,722	0.6 (0.47 to 0.7)	83
All topics	A p-value less-than 0.05 was considered statistically significant	64,639	0.6 (0.5 to 0.8)	12,675
	Data are presented as mean plus-or-minus SEM	33,471	0.67 (0.67 to 0.78)	1,648
	Statistical analysis was performed using Student’s t-test	44,699	0.5 (0.38 to 0.63)	1,043

N-grams are marked in bold. Potential matches refers to the number of studies that contained the target n-gram at least once. Boilerplate text was defined by a Jaccard score of 0.9 or higher. IQR: Inter-quartile range. Statistical software topics differentiated between GraphPad Prism (topic 3: n = 9,879) and SPSS (topic 5: n = 9,574). Targeted searches for the n-gram “GraphPad Prism” returned 6,844 potential matches, including 263 studies that used the boilerplate text “statistical analysis was performed using GraphPad Prism” (Table 1). Common variants included software version (e.g. “version 5.0 for windows”) and location information (e.g.“La Jollie/San Diego CA USA”). Similar instances were identified for “SPSS” in topic 5, with 539 out of 9,005 studies (6%) identified as boilerplate text. Software details in both topics were frequently paired with hypothesis testing methods and definitions of statistical significance (S2 Fig). Boilerplate text for descriptive statistics reflected the presentation of data as means plus or minus standard errors or standard deviations (Topic 6: 321/4,746 studies; 6.7%). In topic 2, an example of recycled text was “Continuous variables were expressed as mean ± standard deviation” (494 studies; 2.5%). Similar to other topics, descriptions were often paired with univariate hypothesis tests followed by more complex analyses, software and statements of statistical significance (S3 Fig).

Australia and New Zealand Clinical Trials Registry (ANZCTR)

We downloaded 28,008 studies and found that 9,523 (34%) had a completed statistical methods section (S1 Fig). The median length of sections was 136 words with an inter-quartile range of 74 to 230 words. Eight studies were only one word, including “ANOVA”, “SPSS” and even “SSPS”. Observational studies were less likely to have a missing statistical methods section compared with interventional studies (Table 2). Missing sections became less likely over time. Studies with more funders and a larger target sample size were less likely to have a missing statistical methods section.

Table 2

Logistic regression results for study characteristics associated with missing statistical methods sections in ANZCTR.

Variable	Odds ratio	95% CI
Study type = Observational	0.78	(0.69, 0.89)
Date (per year)	0.90	(0.88, 0.91)
Number of funders	0.80	(0.74, 0.86)
Target sample size (per doubling)	0.90	(0.88, 0.92)

Since studies registered with ANZCTR described planned analyses, we hypothesized that some studies did not specify statistical methods because they had yet to consult with a statistician. Targeted searches for “statistician” across all topics returned 381 studies, with examples including “Statistical analysis will be done in collaboration with a statistician” and “Pilot study at this point will use a statistician professionally to determine sample size calculations as required”. Topic modelling results reflected sample size calculations (topics 2 and 5), study designs (topics 4, 5, 6 and 8), quantitative methods (topics 3, 7, 9 and 10) and qualitative methods (topic 1) (Fig 2).

Fig 2

Word clouds for ten topics for statistical methods sections published in ANZCTR.

Evaluation of boilerplate text revealed sections from 484 studies (5.1%) were close matches and 251 (2.6%) were an exact cut-and-paste from another study (Table 3). At the sentence level, the proportion of studies with shared text varied by topic, from 12% in topic 5 (pilot studies) to 38% in topic 3 (student’s t-test) (S4 Fig).

Table 3

Results of boilerplate analysis applied to the ANZCTR dataset.

Topic (Number of studies)	Word count Median (IQR)	Sentences Median (IQR)	Matching studies
Topic (Number of studies)	Word count Median (IQR)	Sentences Median (IQR)	Section	1+ sentences
1: Qualitative methods (842)	116 (58 to 207)	6 (3 to 10)	46 (23)	196 (171)
2: Sample size calculations (1,753)	147 (92 to 231)	6 (3 to 9)	40 (22)	311 (259)
3: Student’s t-test (923)	119 (75 to 178)	6 (4 to 8)	62 (32)	354 (292)
4: Efficacy and safety studies (871)	174 (97 to 268)	7 (4 to 12)	56 (7)	190 (162)
5: Pilot studies (737)	78 (40 to 129)	4 (2 to 6)	39 (24)	88 (78)
6: Safety and tolerability studies (507)	127 (73 to 220)	6 (4 to 10)	40 (23)	182 (159)
7: Descriptive analysis (328)	39 (20 to 65)	2 (1 to 4)	43 (41)	59 (57)
8: Intervention studies (826)	174 (98 to 275)	7 (4 to 11)	14 (6)	129 (106)
9: Linear models (1,728)	172 (95 to 298)	7 (4 to 12)	85 (44)	554 (486)
10: Analysis of variance (1,008)	131 (76 to 214)	5 (3 to 9)	59 (29)	236 (209)

The number of studies with Jaccard similarity scores greater than or equal to 0.9 from pairwise comparisons are presented; the number of studies with cut-and-pasted text is given in brackets.

The number of studies with Jaccard similarity scores greater than or equal to 0.9 from pairwise comparisons are presented; the number of studies with cut-and-pasted text is given in brackets. Thematic analysis of n-grams differentiated between study designs and statistical methods topics (S5 Fig). At the n-gram level, we noted the use of similar methods across multiple topics. For example, while topic 3 (student’s t-test) was dominated by mentions of group-based hypothesis tests as expected, the same topic also referenced linear modelling/regression methods and descriptive statistics. Similarly, the use of linear modelling/regression methods was referenced across multiple topics covering quantitative and qualitative methods. Among study design topics, matching sentences highlighted the planned use of intention-to-treat analysis and descriptive statistics. For topic 6 (safety and tolerability studies), approximately 1 in 3 studies had evidence of boilerplate text at the sentence level, which included different combinations of summary statistics for presenting study variables. In contrast, topic 4 (efficacy and safety studies) returned 211 matches against the n-gram “95 percent”; subsequent review identified 28 studies that were close matches to be phrase “at a confidence level of 95% and a precision around the estimate of 5%, a minimum of 73 patients will be included”. Among methods topics, definitions of statistical significance was a recurring theme. Some topics simply stated the main analysis method, for example, “descriptive statistics” (topic 7; 16 exact matches). Examples of close matching sentences and Jaccard similarity scores are given in Table 4. Finally, we noted the use of the same methods among subgroups of topics. For example,

Table 4

Example boilerplate text from ANZCTR studies with the highest number of matches per topic (sentence level).

Topic	Statistical methods text	Potential matches	Jaccard score
Topic	Statistical methods text	Potential matches	Median (IQR)	Boilerplate
1	All analyses will be conducted on an intention-to-treat basis	153	0.55 (0.52 to 0.73)	11
2	The sample size is adjusted for a 10% drop-out rate	1,224	0.42 (0.33 to 0.5)	9
3	Continuous normally distributed variables will be compared using student t-test and reported as means standard deviation while non-normally distributed data will be compared using wilcoxon rank-sum tests and reported as medians inter-quartile range	134	0.32 (0.2 to 0.4)	8
4	At a confidence level of 95 percent and a precision around the estimate of 5% a minimum of 73 patients will be included	211	0.46 (0.33 to 0.58)	28
5	No formal sample size calculation was performed	163	0.43 (0.43 to 0.57)	4
6	Continuous variables will be summarized by mean standard deviation median minimum and maximum	65	0.77 (0.69 to 0.85)	15
7	Descriptive statistics will be used	246	0.8 (0.55 to 0.8)	69
8	Analyses will be conducted on an intention-to-treat basis	149	0.55 (0.46 to 0.73)	16
9	Linear mixed models will be used to analyze the data	238	0.6 (0.6 to 0.8)	20
10	Data will be analyzed using standardised non-parametric or parametric statistical methods where appropriate (using) repeated measures ANOVA	206	0.29 (0.24 to 0.35)	5
All topics	A p-value less-than 0.05 will be considered statistically significant	1,967	0.55 (0.36 to 0.73)	267
	Analyses will be conducted on an intention-to-treat basis	1,630	0.6 (0.5 to 0.7)	191
	Baseline characteristics will be summarised using descriptive statistics	1,375	0.5 (0.5 to 0.63)	23

The number of matching to each sentence was based on a Jaccard score of 0.9 or higher. Potential matches refers to the number of studies that contained the target n-gram at least once.

Discussion

The aim of our analysis was to identify common themes in statistical methods sections both in terms of chosen methods and how these methods are being communicated. Our findings provide evidence of boilerplate statistical methods sections, resulting from likely cut-and-pasting and slight modifications to existing text descriptions. Results from topic modeling further identified distinct themes across statistical methods sections that emphasised details about study design, chosen methods, p-values and software. This is a strong sign of the ritualistic practice of statistics where researchers go through the motions rather than using conscientious practice [2]. Despite the extensive array of statistical tests available, our results show that authors are often reporting the same few methods. In related work, a content-based analysis of ecology and conservation journals summarised trends in linear modelling using n-grams including “t-test”, “ANOVA” and “regression”; results provided evidence of a movement towards model-based inference [39]. We found that Student’s t-test and ANOVA were commonly cited methods for comparing groups in both PLOS ONE and ANZCTR datasets. For statistical methods sections in PLOS ONE, we also found that many studies followed a generic template, combining chosen statistical methods with descriptive statistics for summarizing data, statements of statistical significance and/or choice of software. When investigating cases of boilerplate text, results based on n-grams versus close matches at the sentence level varied considerably by topic. These findings suggest that there is a tendency for researchers to default to the same common statistical methods when completing analyses, in line with the view of statistical analysis as a mechanistic process. However, for studies that use the same statistical methods, text used to describe important details may vary. Defining statistical significance at p < 0.05 was the most common example of boilerplate text in both datasets. The widespread use of statistical significance is troubling given the bright-line thinking it engenders [40] and the common misinterpretations of p-values [41, 42]. Nonetheless, conflicting views about the use of statistical significance remain. In a follow-up survey of signatories to an article calling for the end of statistical significance [43], 22% of respondents said they were likely to continue using the concept in future publications [44]. Reasons cited included the mindful use of statistical significance in combination with other evidence and concerns about the feasibility of abandoning statistical significance given its engrained usage in published literature. At the same time, null hypothesis significance testing has been cited as a root cause fueling the reproducibility crisis, and a problem that has been difficult to shift [45]. Two topics identified in the PLOS ONE dataset highlighted statistical software. Similarly, some sections extracted from ANZCTR only stated the software, implying that this was the primary criterion for statistical analysis. As Doug Altman said, “Many people think that all you need to do statistics is a computer and appropriate software” [6]. This is far from the truth, and whilst it is important for researchers to mention the software and version used for reproducibility purposes, it is a minor detail compared with explaining which methods were used and why. One reason inadequate methods sections get published is because many journals do not use statistical reviewers, despite empirical evidence showing they improve manuscript quality [12]. It is possible that the exact details of statistical methods are viewed as relatively unimportant by authors and reviewers, and something that can be read last or even skipped [46]. Some journals foster this lack of importance by putting the methods section last. Statistical methods sections may be getting less scrutiny than other sections both because of their position in the paper, relatively low word counts, and because they so often contain boilerplate text. Another potential reason is that authors resort to boilerplate text is because of the overly-critical approach to statistics by some reviewers who pounce on anything outside the accepted dogma [47]. Whilst checklists are a useful tool to improve statistical reporting, peer review by nonstatistical reviewers and editors cannot replace expert appraisal on the appropriateness of statistical methods used [48]. Mechanisms to encourage authors to share their analysis code would provide an alternative route for checking what statistical methods were used. This is not a perfect solution, as we still want authors to accurately report their methods in their paper, but it does increase transparency. A recent paper found that code sharing was very low in biomedical papers, with just 2% of a sample of over 6,000 papers sharing code [49]. The introduction of incentives for code sharing such as article badges has to date shown limited efficacy [50], however further research in this area may offer potential solutions for promoting reproducibility. Our approach for identifying boilerplate text was not intended as a form of plagarism detection, but rather as evidence of standardised descriptions being used. For simple study designs, a boilerplate description might be adequate to promote consistency in reporting and meet reporting requirements. For example, ANZCTR sections commonly reported sample size justifications and planned analyses using intention-to-treat principles. Beyond statistical methods sections, initiatives such as 2WeekSR have been developed to streamline the completion of systematic reviews, including the use of automation to generate consistent descriptions of results suitable for using in papers [51]. However, if boilerplate descriptions are to be used, they must provide readers with sufficient details to confirm that appropriate methods were used and enable independent verification of results. Unfortunately, this is not always the case. For example, a study of papers that used ANOVA found 95% did not contain the information needed to determine what type of ANOVA was performed. This lack of information could well be because the authors used a boilerplate statistical methods section that was missing key details. Our analysis focused on studies with a clearly marked statistical methods section, based on predefined section headings. It is therefore possible that some of the papers excluded from our analysis conducted statistical analyses but placed descriptions elsewhere. For PLOS ONE, excluded papers may have described statistical methods as part of the supplementary material, which tend to be less structured than the main text. Similarly, since submissions to both PLOS ONE and ANZCTR do not undergo compulsory statistical review, our results may not be generalizable to all journals and registries, especially those that consistently use a statistical reviewer. Given the large sample sizes for both datasets, it was not feasible to check whether papers used the correct methods.

Inclusion flowchart for studies downloaded for each case study.

A: PLOS ONE; B: ANZCTR. For (A), “Non-specific analysis” refers to studies where the use of statistical methods could not be determined by on section headings; e.g., “Microarray analysis”. (TIF) Click here for additional data file.

Common combinations of statistical methods in topic related to the use of GraphPad Prism (topic 3) and SPSS (topic 5).

General themes for statistical methods were based on targeted word searches and categorized into statistical significance, descriptive statistics, parametric hypothesis tests, nonparametric hypothesis tests, linear modelling/regression and software. The most frequent combinations of themes are given on the x-axis, with the corresponding number of studies on the y-axis. (TIF) Click here for additional data file.

Common combinations of statistical methods in topic related to descriptive statistics (topics 2 and 6).

Total matching sentences by topic for the ANZCTR dataset.

A match was defined any pair of sentences between ANZCTR studies with a Jaccard score equal to 0.9 or higher. (TIF) Click here for additional data file.

Summary of close matches at the sentence level (x-axis) by ANZCTR themes inferred from common n-grams (y-axis), organized by study design (A) and methods-based (B) topics.

A close match was defined any pair of sentences between ANZCTR studies with a Jaccard score equal to 0.9 or higher. (TIF) Click here for additional data file. 20 Dec 2021

PONE-D-21-33560

An observational analysis of the trope "A p-value of < 0.05 was considered statistically significant" and other cut-and-paste statistical methods

PLOS ONE Dear Dr. White, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Kindly address all comments from reviewers 1 and 2. Reviewer 1 has suggested some additional literature to include in your discussion, I would be very grateful if you could consider these suggestions. Please submit your revised manuscript by Feb 03 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Benedikt Ley, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that Figure 1 and 2 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: a. You may seek permission from the original copyright holder of Figure 1 and 2 to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. 3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I think the authors of this paper have made a commendable effort to screen two major public scientific databases. I have two major comments, one of which is somewhat personal. 1) The paper is at least partly framed as if the main problem is that sentences are copied from other papers; at least this was my first impression from the abstract. However, I think one could well argue that for the sake of clarity and comprehensibility, the same words *should* be used by different authors for the same methods. Thus, if a study used the default alpha of 0.05, a sentence like “a p-value of < 0.05 was considered statistically significant” *should* be written in the paper. Given that probably most screened studies did use the default 0.05, I think it is actually a bad sign that only 1 in 10 studies explicitly wrote it. If we want that people become aware that they are using the same statistical default options since almost a century, it would be helpful if they wrote in their papers, e.g., “we used the default p = 0.05 significance level and the default null hypothesis of zero effect”. I thus suggest you should clarify and discuss that your aim was not to study plagiarism but the extent to which people use the same old default methods – or at least the extent to which they explicitly write that they used those methods, which is probably not the same thing, which you should discuss. I suggest that your results on the use of boilerplate text strongly underestimate the extent of a ritualistic practice of statistics, and that it would be desirable if more people would clearly say that they are committed to a ritualistic practice. In other words, I suggest that more use of boilerplate text might be desirable if people use boilerplate methods. 2) You did not extensively repeat the discussion about statistical significance, which is fine. However, I happen to be author of the article calling for the end of statistical significance that you discussed but did not cite (here’s the article: https://doi.org/10.1038/d41586-019-00857-9). I thus have three minor comments on your discussion starting from line 269: a) You cited the paper by Diaz-Quijano et al. (2020) who performed a survey on our signatories and found that around 22% of the 151 respondents said that they would likely use the concept of “statistical significance” again in future publications. I think you should write “22% of respondents *said they* were likely to continue using the concept in future publications”. b) You then wrote that “Reasons cited included the mindful use of p-values in combination with other evidence and concerns about the feasibility of abandoning p-values given their engrained usage in published literature.” You should exchange both mentions of “p-values” with “statistical significance”, as is written in Diaz-Quijano et al. (2020) and in our original article “Retire statistical significance”. Of course, statistical significance and P-values are not the same. I and my co-authors have nothing against mindful use of P-values, which is what we try to advocate. On a side note, I would probably be among the people saying they would likely use the concept of statistical significance in future publications; for example, I’m using it whenever I write about the problems with this concept. c) Your reference for misinterpretations of P-values is Goodman’s “A Dirty Dozen”, which is a good paper; an updated version that you could cite as well and on which Goodman is co-author is https://doi.org/10.1007/s10654-016-0149-3. Signed review: Valentin Amrhein Reviewer #2: This is an interesting paper which aims to quantify the extent of ‘cutting and pasting’ or ‘recycling’ of statistical methods sections across publications and trial registries. This is an attempt to quantify and better understand poor reporting of statistical methods and statistical methods used in published papers and registered trials. The authors included statistical methods sections from articles published in PLOS ONE and study protocols registered in the Australian and New Zealand Clinical Trials Registry. This was an interesting paper with unsurprising but novel findings. Poor statistical methods is an ongoing problem and this is highlighted by the authors’ finding that only 17% of articles had a statistical methods section with 50 words or less! I have only minor comments/suggestions to the manuscript, which are provided below. The authors should also be commended for making all of their code freely available. Minor comments: Methods: 1. I note that the authors didn’t include a heading of ‘statistical methods’ in their manuscript, which of course means that their own study wouldn’t be picked up in future update of this study. 2. In Steps 1 and 2 for searching PLOS ONE, it would be helpful to provide the link to the code in Github. 3. The last paragraph of the Data sources/ANZCTR section describes a statistical analysis to investigate if particular studies were more likely to be missing a statistical methods section. This section doesn’t seem to fit with the rest of the section as it describes a statistical method (specific to the ANZCTR). Results: 4. In Figure S1, it’s not clear what ‘non-specific analysis’ means. 5. In lines 193-195, the reported figures (95518, 64144, 13380 and 13627) don’t match the figures in S1. It would be good if there was a more obvious link between the text and the figure. 6. Similar to the above comment, in lines 230, the figure ‘9623 had a completed statistical methods section’ doesn’t match the `analysed’ figure in Fig S1. It would be good if there was an obvious link between the text and the figure. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Valentin Amrhein Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

6 Feb 2022 The editor asked us to consider some additional literature to include in the discussion, suggested by Reviewer 1. All of these suggestions have been incorporated in the revised manuscript. Reviewer #1: I think the authors of this paper have made a commendable effort to screen two major public scientific databases. I have two major comments, one of which is somewhat personal. 1) The paper is at least partly framed as if the main problem is that sentences are copied from other papers; at least this was my first impression from the abstract. However, I think one could well argue that for the sake of clarity and comprehensibility, the same words *should* be used by different authors for the same methods. Thus, if a study used the default alpha of 0.05, a sentence like “a p-value of < 0.05 was considered statistically significant” *should* be written in the paper. Given that probably most screened studies did use the default 0.05, I think it is actually a bad sign that only 1 in 10 studies explicitly wrote it. If we want that people become aware that they are using the same statistical default options since almost a century, it would be helpful if they wrote in their papers, e.g., “we used the default p = 0.05 significance level and the default null hypothesis of zero effect”. I thus suggest you should clarify and discuss that your aim was not to study plagiarism but the extent to which people use the same old default methods – or at least the extent to which they explicitly write that they used those methods, which is probably not the same thing, which you should discuss. I suggest that your results on the use of boilerplate text strongly underestimate the extent of a ritualistic practice of statistics, and that it would be desirable if more people would clearly say that they are committed to a ritualistic practice. In other words, I suggest that more use of boilerplate text might be desirable if people use boilerplate methods. RESPONSE: Thank you for offering this interesting perspective. We agree that boilerplate text may be beneficial to provide clarity about the same method and/or key assumptions made as part of analysis. There may also be cases where a boilerplate section may be adequate to meet reporting requirements, provided that sufficient details are given to allow readers to assess study quality. We alluded to this point in the original manuscript (Discussion, p9 lines 296 – 299), but agree that this discussion could be expanded. We hope that our paper will reinforce the statistical community’s concern around the ritualistic practice of statistics and poor reporting. To address this feedback, we have added new text throughout the manuscript to clarify the purpose of our analysis as follows: Abstract: the opening sentence of the second paragraph has been revised to “Instances of boilerplate text suggest a mechanistic approach to statistical analysis, where the same default methods are being used and described using standardized text. To investigate the extent of this practice, we analyzed text extracted from published statistical methods sections from PLOS ONE and the Australian and New Zealand Clinical Trials Registry (ANZCTR).” Introduction, page 2 -Line 33: The opening sentence has been modified to “A potential contributor to poor reporting is the temptation for researchers to re-use descriptions of the same default statistical methods, to make their papers resemble those of their peers and increase perceived chances of publication [17]. As these default choices become more common, valid criticism by reviewers and journal editors becomes increasingly difficult, as past use may be argued by researchers as offering precedent for the conduct of analysis within their discipline [18]” -Line 46: the final sentence of the Introduction has been revised to “The use of boilerplate text indicates that researchers are emphasizing the same details about chosen statistical analyses, and potentially giving little thought into the conduct and transparent reporting of statistical methods used.” Results: ANZCTR, page 9 -lines 258 – 263: Additional text has been added to highlight the use of the same statistical methods across topics, as reflected in Figure S5 “At the n-gram level, we noted the use of similar methods across multiple topics. For example, while topic 3 (student's t-test) was dominated by mentions of group-based hypothesis tests as expected, the same topic also referenced linear modelling/regression methods and descriptive statistics. Similarly, the use of linear modelling/regression methods was referenced across multiple topics covering quantitative and qualitative methods.”. This finding is somewhat expected as statistical methods sections can either state a single method for analysis, or multiple related methods as part of the same analysis. Discussion, pages 8 – 11 -line 277: Added the opening sentence “The aim of our analysis was to identify common themes in statistical methods sections both in terms of chosen methods and how these methods are being communicated”. -line 280: Added the sentence “Results from topic modeling further identified distinct themes across statistical methods sections that emphasized details about study design, chosen methods, p-values and software” -page 9, lines 285 – 299: This paragraph has been completely revised to emphasize the finding that researchers tend to use the same statistical methods repeatedly, although the text used to describe them may vary. In the original manuscript, this paragraph was the fourth paragraph of the Discussion. In the revised manuscript, this paragraph has been moved up to the second paragraph of the Discussion given its focus on the use of the same methods across studies. -page 10, lines 339 – 353: This is a new paragraph incorporating text from the original Discussion section. The paragraph expands on the discussion of cases when boilerplate text might be sufficient for consistency/to meet reporting requirements. In this paragraph, we suggest cases where boilerplate text may be helpful but emphasise that such text must contain sufficient details to enable independent replication of results. As part of this discussion, we provide an example of an automation initiative (2WeekSR) for assisting with completion of systematic reviews. 2) You did not extensively repeat the discussion about statistical significance, which is fine. However, I happen to be author of the article calling for the end of statistical significance that you discussed but did not cite (here’s the article: https://doi.org/10.1038/d41586-019-00857-9). I thus have three minor comments on your discussion starting from line 269: a) You cited the paper by Diaz-Quijano et al. (2020) who performed a survey on our signatories and found that around 22% of the 151 respondents said that they would likely use the concept of “statistical significance” again in future publications. I think you should write “22% of respondents *said they* were likely to continue using the concept in future publications”. b) You then wrote that “Reasons cited included the mindful use of p-values in combination with other evidence and concerns about the feasibility of abandoning p-values given their engrained usage in published literature.” You should exchange both mentions of “p-values” with “statistical significance”, as is written in Diaz-Quijano et al. (2020) and in our original article “Retire statistical significance”. Of course, statistical significance and P-values are not the same. I and my co-authors have nothing against mindful use of P-values, which is what we try to advocate. On a side note, I would probably be among the people saying they would likely use the concept of statistical significance in future publications; for example, I’m using it whenever I write about the problems with this concept. c) Your reference for misinterpretations of P-values is Goodman’s “A Dirty Dozen”, which is a good paper; an updated version that you could cite as well and on which Goodman is co-author is https://doi.org/10.1007/s10654-016-0149-3. RESPONSE: Thank you for bringing this missing reference to our attention. We have now cited this article in the revised Discussion (page 10, reference [43]). Further revisions have been completed as follows: For (a) and (b), the relevant text has been updated as suggested. For (c) The suggested reference has been added to the Discussion (page 10, reference [42]). Reviewer #2: This is an interesting paper which aims to quantify the extent of ‘cutting and pasting’ or ‘recycling’ of statistical methods sections across publications and trial registries. This is an attempt to quantify and better understand poor reporting of statistical methods and statistical methods used in published papers and registered trials. The authors included statistical methods sections from articles published in PLOS ONE and study protocols registered in the Australian and New Zealand Clinical Trials Registry. This was an interesting paper with unsurprising but novel findings. Poor statistical methods is an ongoing problem and this is highlighted by the authors’ finding that only 17% of articles had a statistical methods section with 50 words or less! I have only minor comments/suggestions to the manuscript, which are provided below. The authors should also be commended for making all of their code freely available. Minor comments: Methods: 1. I note that the authors didn’t include a heading of ‘statistical methods’ in their manuscript, which of course means that their own study wouldn’t be picked up in future update of this study. RESPONSE: Thank you for identifying this oversight! We have added the heading ‘Statistical methods’ to the Materials and Methods section of the revised manuscript; see page 4, line 120. We have also updated Figure S1 to include the number of studies that included either “Statistical methods” or “Statistical methodology” as a section heading. Both section headings were included in our initial analysis as partial matches to “Statistical method”, but we have included for clarity. 2. In Steps 1 and 2 for searching PLOS ONE, it would be helpful to provide the link to the code in Github. RESPONSE: We have added the following sentence after the explanation of Steps 1 and 2 for searching PLOS ONE (page 3, line 94) “Code to complete steps 1 and 2 is available at https://git/hub.com/agbarnett/stats_section/code/plosone.” To improve the organisation of the GitHub repository, we have further updated the README file and folder structure for the cited GitHub repository has been updated for clarity. We hope that these changes make our code more accessible to readers. 3. The last paragraph of the Data sources/ANZCTR section describes a statistical analysis to investigate if particular studies were more likely to be missing a statistical methods section. This section doesn’t seem to fit with the rest of the section as it describes a statistical method (specific to the ANZCTR). RESPONSE: We agree that this paragraph would be better placed in the Statistical Methods section. In the revised manuscript, we have moved this paragraph to the Statistical Methods section of the revised manuscript and added the subsection heading ‘Analysis of missing statistical methods sections’ (see page 4, line 141 – 150) The revised text is: “Statistical methods sections were missing for some studies downloaded from ANZCTR, including sections labelled as ``Not applicable'', ``Nil'' or ``None''. Since these studies would be excluded from topic modeling, we examined if there were particular studies where the statistical methods section was more likely to be missing.” Results: 4. In Figure S1, it’s not clear what ‘non-specific analysis’ means. RESPONSE: We used the term ‘non-specific analysis’ to describe studies that including the word ‘analysis’ as part of its section heading, but could not be matched against our set of predefined section headings; e.g., ‘Microarray analysis’. We agree that this could be communicated better and have amended the caption for Figure S1 to include a definition for non-specific analysis (page 11, line 359) 5. In lines 193-195, the reported figures (95518, 64144, 13380 and 13627) don’t match the figures in S1. It would be good if there was a more obvious link between the text and the figure. RESPONSE: Thank you for pointing out this discrepancy. Figure S1 has been updated to include numbers reported in the main text. Please note that revised numbers in the manuscript have changed to resolve double counting of studies. For the PLOS ONE dataset, we have also included the exclusion reason “Other” as part of the flowchart, to flag remaining studies where initial search terms could not be attributed to standard section headings within XML files. 6. Similar to the above comment, in lines 230, the figure ‘9623 had a completed statistical methods section’ doesn’t match the `analysed’ figure in Fig S1. It would be good if there was an obvious link between the text and the figure. RESPONSE: Figure S1 has been updated to match reported numbers in the main text. When revising, we noted a small error in the final sample size where 9,632 should be 9,523. This discrepancy has been resolved in the Abstract, the main text and in Tables 3 and 4. A copy of reviewer responses has been uploaded as the file 'Response to Reviewers.pdf' Submitted filename: Response to Reviewers.pdf Click here for additional data file. 9 Feb 2022 An observational analysis of the trope "A p-value of < 0.05 was considered statistically significant" and other cut-and-paste statistical methods PONE-D-21-33560R1 Dear Dr. White, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Benedikt Ley, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Many thanks for addressing all comments by the reviewers and commenting on the resolved coding issues. Reviewers' comments: 15 Feb 2022 PONE-D-21-33560R1 An observational analysis of the trope “A p-value of < 0.05 was considered statistically significant” and other cut-and-paste statistical methods Dear Dr. White: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr Benedikt Ley Academic Editor PLOS ONE

25 in total

An observational analysis of the trope "A p-value of < 0.05 was considered statistically significant" and other cut-and-paste statistical methods.

Introduction

Materials and methods

Data sources

Public Library of Science (PLOS ONE)

Australia and New Zealand Clinical Trials Registry (ANZCTR)

Statistical methods

Full-text processing

Analysis of missing statistical methods sections

Topic modelling

Content analysis

Results

Public Library of Science (PLOS ONE)

Australia and New Zealand Clinical Trials Registry (ANZCTR)

Discussion

Inclusion flowchart for studies downloaded for each case study.

Common combinations of statistical methods in topic related to the use of GraphPad Prism (topic 3) and SPSS (topic 5).

Common combinations of statistical methods in topic related to descriptive statistics (topics 2 and 6).

Total matching sentences by topic for the ANZCTR dataset.

Summary of close matches at the sentence level (x-axis) by ANZCTR themes inferred from common n-grams (y-axis), organized by study design (A) and methods-based (B) topics.

Review 1. Peer review of statistics in medical research: the other problem.

Review 2. A dirty dozen: twelve p-value misconceptions.

3. Scientists rise up against statistical significance.

4. The Harm Done to Reproducibility by the Culture of Null Hypothesis Significance Testing.

5. Five ways to fix statistics.

6. Waste in covid-19 research.

7. Reproducibility: A tragedy of errors.

8. Regression assumptions in clinical psychology research practice-a systematic review of common misconceptions.

9. Poor statistical reporting, inadequate data presentation and spin persist despite editorial advice.

10. How feasible is it to abandon statistical significance? A reflection based on a short survey.