Literature DB >> 34911737

Ghost in the machine or monkey with a typewriter-generating titles for Christmas research articles in The BMJ using artificial intelligence: observational study.

Abstract

OBJECTIVE: To determine whether artificial intelligence (AI) can generate plausible and engaging titles for potential Christmas research articles in The BMJ.
DESIGN: Observational study.
SETTING: Europe, Australia, and Africa. PARTICIPANTS: 1 AI technology (Generative Pre-trained Transformer 3, GPT-3) and 25 humans. MAIN OUTCOME MEASURES: Plausibility, attractiveness, enjoyability, and educational value of titles for potential Christmas research articles in The BMJ generated by GPT-3 compared with historical controls.
RESULTS: AI generated titles were rated at least as enjoyable (159/250 responses (64%) v 346/500 responses (69%); odds ratio 0.9, 95% confidence interval 0.7 to 1.2) and attractive (176/250 (70%) v 342/500 (68%); 1.1, 0.8 to 1.4) as real control titles, although the real titles were rated as more plausible (182/250 (73%) v 238/500 (48%); 3.1, 2.3 to 4.1). The AI generated titles overall were rated as having less scientific or educational merit than the real controls (146/250 (58%) v 193/500 (39%); 2.0, 1.5 to 2.6); this difference, however, became non-significant when humans curated the AI output (146/250 (58%) v 123/250 (49%); 1.3, 1.0 to 1.8). Of the AI generated titles, the most plausible was "The association between belief in conspiracy theories and the willingness to receive vaccinations," and the highest rated was "The effects of free gourmet coffee on emergency department waiting times: an observational study."
CONCLUSIONS: AI can generate plausible, entertaining, and scientifically interesting titles for potential Christmas research articles in The BMJ; as in other areas of medicine, performance was enhanced by human intervention. © Author(s) (or their employer(s)) 2019. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34911737 PMCID： PMC8684048 DOI： 10.1136/bmj-2021-067732

Source DB: PubMed Journal: BMJ ISSN： 0959-8138

Introduction

Recent developments in machine learning and artificial intelligence (AI) are likely to revolutionise aspects of medical practice over the next decade. Although simple human applied rule based algorithms have been used in medical settings for decades, more recent developments in computer processing power and exponential increases in available data have enabled the development of systems that can optimise their own performance without human intervention. These are already in routine use in non-medical settings—for example, to target advertisements or articles of interest on social media, and to generate art and music. Increasing evidence shows that when given access to large imaging databases these algorithms can already be used effectively to diagnose breast and lung cancer, retinal disease, and intracranial haemorrhage, with similar accuracy to that of human experts.1 Such tools are likely to be able to offer decision support in other areas of medical practice soon, and frameworks for reporting AI and machine learning research are being developed to match this need.2 A detailed description of how AI works is beyond the scope of this article, but essentially AI comprises multilayered neural networks, which themselves are a group of linked algorithms, with outputs tuned to collectively respond to a stimulus from a particular input. Most traditional AIs are task specific (ie, trained on one form of labelled data), so that they become experts at, for example, categorising images or playing chess. More recent methods allow unsupervised learning by identifying patterns within massive datasets. Once developed, however, AIs are metaphorical black boxes, with an input and output but an inability to explain or interrogate the workings; if trained on a dataset with an unknown inherent bias, the AI might inherit this in a way that is difficult to detect.3 The current most up to date general purpose language AI is the Generative Pre-trained Transformer 3 (GPT-3) developed by OpenAI (San Francisco, CA). GPT-3 was trained using 175 billion varied items of text, including the entirety of Wikipedia and a collection of books and websites.4 From a starting prompt GPT-3 is capable of translation, answering questions, and even writing newspaper articles.5 GPT-3 is a commercial product, and, because of concerns about the potential for misuse, it can only be accessed by submitting a proposal and being accepted onto a Beta program. Although traditionally computers have been thought incapable of innovative or independent thought, given the developments in technology it seemed timely to evaluate the capability of AI to generate worthwhile hypotheses for medical research. Since 1982 The BMJ has published a special Christmas edition, featuring articles in which evidence based science is combined with more light hearted or quirky themes.6 In this study we determined whether AI generated titles for potential Christmas research articles in The BMJ would meet the brief of combining scientific merit with engaging and entertaining subject matter.

Methods

We took the titles of the 13 most read Christmas research articles of the past 10 years in The BMJ and used these to construct a prompt instructing GPT-3 to generate similar titles (supplementary file). Both authors independently scored the 57 titles GPT-3 generated on a scale of 1 to 6 for scientific merit, entertainment, and plausibility. We used the mean composite scores from this process to rank the titles and select the 10 highest rated and 10 lowest rated newly generated titles. Despite an extensive review of the literature on the use of AI to generate titles for Christmas research articles in The BMJ, we were unable to identify any articles that could provide the required sample size. For this small study to disprove our null hypothesis that AI would be incapable of generating plausible titles, we used a convenience sample of 25 medical doctors from a range of specialties and settings: paediatricians, physicians in adult medicine, general practitioners, and anaesthetists from Africa, Australia, and Europe. The participants were required to self-declare that they were familiar with the usual content and format of the Christmas issue of The BMJ. They were then asked to complete an online survey containing 10 randomly selected titles of Christmas research articles obtained from the archive of The BMJ and the 10 highest rated and 10 lowest rated AI generated article titles (fig 1). The titles were presented to each participant in a random order, blinded to which of the three categories (real articles, AI generated 10 highest rated and 10 lowest rated titles) the articles belonged. The participants were told that the list contained a mixture of real and AI generated titles but not the proportion of each.

Fig 1

Ten randomly selected actual titles of Christmas research articles in The BMJ and 10 highest rated and 10 lowest rated artificial intelligence (AI) generated articles

Ten randomly selected actual titles of Christmas research articles in The BMJ and 10 highest rated and 10 lowest rated artificial intelligence (AI) generated articles Using a seven level Likert scale (absolutely not, probably not, maybe not, unsure, maybe, probably, absolutely), the participants rated each paper according to four statements: This a real BMJ paper; I want to read this; This would be funny/enjoyable to read; and This would be scientifically/educationally useful. They were also asked to select which of the 30 titles was the most plausible overall and which the funniest. We assessed the ability of GPT-3 to generate titles unaided by comparing the proportion of real titles with positive Likert scores (5 to 7) with the proportion of the 10 highest and 10 lowest rated titles combined with positive scores. To determine if human curation was beneficial to AI, we performed the same comparison between the real titles and the 10 highest rated titles. Ordinal regression was used to test statistical significance between groups. Data were analysed using R version 4.0.5,7 the Tidyverse,8 and Likert packages.

Patient and public involvement

Although the topic of this paper does not directly apply to specific patient groups, we did speak to patients about the study. We also asked a member of the public to comment on our manuscript after submission.

Results

AI generated highest and lowest rated titles combined

When the titles of real Christmas research articles in The BMJ were compared with the combined list of highest and lowest rated AI generated titles (fig 2), the real titles were rated as more likely to be an actual article (182/250 responses (73%) v 238/500 responses (48%); odds ratio 3.1, 95% confidence interval 2.3 to 4.1; P<0.001) and more likely to be scientifically or educationally useful (146/250 (58%) v 193/500 (39%); 2.0, 1.5 to 2.6; P<0.001). AI generated titles were equally as attractive to read as the real article titles (176/250 (70%) v 342/500 (68%); 1.1, 0.8 to 1.4; P=0.49) and rated as equally enjoyable (159/250 (64%) v 346/500 (69%); 0.9, 0.7 to 1.2; P=0.55).

Fig 2

Real titles of Christmas research articles in The BMJ compared with top 10 and bottom 10 ranked AI generated titles using seven point Likert scales

Curated AI generated titles

When the real titles were compared with the top ranked AI generated ones curated by humans (fig 3), the real titles were still believed to be more likely to represent an actual article (182/250 (73%) v 147/250 (59%); 2.2, 1.6 to 3.0; P<0.001) and were considered as educationally useful (146/250 (58%) v 123/250 (49%); 1.3, 1.0 to 1.8; P=0.08). The selected group of top ranked AI titles were still rated as equally attractive to read as the real titles (176/250 (70%) v 185/250 (74%); 0.9, 0.6 to 1.2; P=0.45) and as enjoyable (159/250 (64%) v 180/250 (72%); 0.8, 0.6 to 1.1; P=0.25).

Fig 3

Real titles of Christmas research articles in The BMJ compared with curated top 10 ranked AI generated titles using seven point Likert scales

Real titles of Christmas research articles in The BMJ compared with curated top 10 ranked AI generated titles using seven point Likert scales When the participants were asked to choose the single most plausible title, 10 (40%) chose one that had been AI generated—the most popular being “The association between belief in conspiracy theories and the willingness to receive vaccinations.” For the single funniest title only six (24%) participants chose a real article (fig 4).

Fig 4

Most plausible and funniest titles chosen by participants. Logo with Santa’s hat indicates titles of real Christmas research articles in The BMJ

Discussion

In this small study, AI generated titles for potential Christmas research articles in The BMJ were at least as entertaining and attractive to readers in our sample as titles of actual articles that were published in the Christmas issue of The BMJ. Real titles performed significantly better than AI generated ones (both curated and non-curated by the human participants) in terms of plausibility, although it was not possible to differentiate inherent plausibility from the participants’ familiarity with previous published Christmas research articles in The BMJ. A small number of well known articles included by chance in our sample could have substantially skewed the results. The only two titles to be rated both as the most plausible and the funniest were “The survival time of chocolates on hospital wards: covert observational study” (which was the third most accessed Christmas research article in the month of its publication, with 298 841 readers) and “The effects of free gourmet coffee on emergency department waiting times: an observational study,” now our potential submission for the 2022 Christmas issue of The BMJ. When we considered the perceived scientific value of the articles in our sample, AI generated titles not selected by humans performed noticeably more poorly than real titles. When a subsequent step of human curation was applied, the performance of the AI generated titles came within the range of the real titles. This finding fits with previous work on AI, suggesting that the best results come from combining machine learning with human oversight.9 Both human and machine decision making are limited by the quality and quantity of inputs. Humans are psychologically limited by how much data they can review, retain, and process, whereas machines are more likely to be constrained by the method of input. In our study, GPT-3 “knew” about the subject matter, wording, and associations of previously successful article titles but did not have the experience of clinical practice shared by the authors and study participants. Although humans might see the real world application of a study about clinician sleep deprivation on mortality in the intensive care unit, AI, with its inputs, sees this as no more or less useful than understanding the effects of applying superglue to nipples as a distraction from erectile dysfunction at work, nor can it understand if the titles are offensive. One limitation of our study is that we compared articles that had been accepted by, rather than submitted to, The BMJ for publication in its Christmas issue with the outputs of GPT-3. The performance of GPT-3 might have been better if this broader sample had been used. Although our study might be the first to consider the use of AI to generate titles of research articles and to determine the attractiveness of those articles to potential readers, interest in the use of AI to generate research hypotheses is growing. For example, it has been proposed that the Euretos platform, mainly used by preclinical researchers to identify potential targets and biomarkers, could be used to generate hypotheses based on published papers, with subsequent expert review determining which of these are appropriate research directions to pursue.10 The findings of our study reinforce the essential role humans have in directing AI and curating its output. It is overwhelmingly likely, however, that recent developments in AI and machine learning will change the way work is done in healthcare, whether this is through improving diagnostic speed and accuracy, decision support, or reducing medical error. AI has the potential to change the way we select and interact with the medical literature; our study is an early demonstration of the way these technologies might also change the way we produce that literature.

Conclusion

Even in the context of quirky titles such as those that appear in the Christmas issues of The BMJ, AI has the potential to generate plausible outputs that are engaging and could attract potential readers. Attracting interest can only be done with expert guidance, however, as some of the article titles in our study were irrelevant or offensive. This finding mirrors the potential use of AI in clinical medicine, as decision support rather than as outright replacement of clinicians. Recent parallel advances in technology and digitisation have led to a rapid development of artificial intelligence (AI) and machine learning In medicine, early applications of AI have been based around image recognition and diagnostics but with great potential for broader use The most recent AI systems are capable of advanced language recognition, interpretation, and generation Titles of potential Christmas research articles in The BMJ generated by AI were as attractive and entertaining to readers as real titles published in the Christmas issue of The BMJ With an additional stage of human intervention, the titles also performed similarly in terms of potential scientific and educational value AI could have a role in generating hypotheses or directions for future research

13 in total

1. Christmas crackers: highlights from past years of The BMJ's seasonal issue.

Authors: Navjoyt Ladher
Journal: BMJ Date: 2016-12-15

2. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness.

Authors: Sebastian Vollmer; Bilal A Mateen; Gergo Bohner; Franz J Király; Rayid Ghani; Pall Jonsson; Sarah Cumbers; Adrian Jonas; Katherine S L McAllister; Puja Myles; David Granger; Mark Birse; Richard Branson; Karel G M Moons; Gary S Collins; John P A Ioannidis; Chris Holmes; Harry Hemingway
Journal: BMJ Date: 2020-03-20

Review 3. The three ghosts of medical AI: Can the black-box present deliver?

Authors: Thomas P Quinn; Stephan Jacobs; Manisha Senadeera; Vuong Le; Simon Coghlan
Journal: Artif Intell Med Date: 2021-08-28 Impact factor: 5.326

4. Morphology and size of stem cells from mouse and whale: observational study.

Authors: Martin J Hoogduijn; Johanna C van den Beukel; Lidewij C M Wiersma; Jooske Ijzer
Journal: BMJ Date: 2013-12-12

5. Are "armchair socialists" still sitting? Cross sectional study of political affiliation and physical activity.

Authors: Adrian Bauman; Joanne Gale; Karen Milton
Journal: BMJ Date: 2014-12-11

6. Efficacy of educational video game versus traditional educational apps at improving physician decision making in trauma triage: randomized controlled trial.

Authors: Deepika Mohan; Coreen Farris; Baruch Fischhoff; Matthew R Rosengart; Derek C Angus; Donald M Yealy; David J Wallace; Amber E Barnato
Journal: BMJ Date: 2017-12-12

7. Intellectual engagement and cognitive ability in later life (the "use it or lose it" conjecture): longitudinal, prospective study.

Authors: Roger T Staff; Michael J Hogan; Daniel S Williams; L J Whalley
Journal: BMJ Date: 2018-12-10

8. Working 9 to 5, not the way to make an academic living: observational analysis of manuscript and peer review submissions over time.

Authors: Adrian Barnett; Inger Mewburn; Sara Schroter
Journal: BMJ Date: 2019-12-19

Review 9. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis.

Authors: Ravi Aggarwal; Viknesh Sounderajah; Guy Martin; Daniel S W Ting; Alan Karthikesalingam; Dominic King; Hutan Ashrafian; Ara Darzi
Journal: NPJ Digit Med Date: 2021-04-07

10. Effect of therapeutic suggestions during general anaesthesia on postoperative pain and opioid use: multicentre randomised controlled trial.

Authors: Hartmuth Nowak; Nina Zech; Sven Asmussen; Tim Rahmel; Michael Tryba; Guenther Oprea; Lisa Grause; Karin Schork; Manuela Moeller; Johannes Loeser; Katharina Gyarmati; Corinna Mittler; Thomas Saller; Alexandra Zagler; Katrin Lutz; Michael Adamzik; Ernil Hansen
Journal: BMJ Date: 2020-12-10