Robin Marlow1,2, Dora Wood3. 1. Bristol Royal Hospital for Children, Bristol, BS2 8BJ, UK robin.marlow@bristol.ac.uk. 2. Centre for Academic Child Health, University of Bristol, Bristol, UK. 3. Bristol Royal Hospital for Children, Bristol, BS2 8BJ, UK.
Recent developments in machine learning and artificial intelligence (AI) are likely to revolutionise aspects of medical practice over the next decade. Although simple human applied rule based algorithms have been used in medical settings for decades, more recent developments in computer processing power and exponential increases in available data have enabled the development of systems that can optimise their own performance without human intervention. These are already in routine use in non-medical settings—for example, to target advertisements or articles of interest on social media, and to generate art and music.Increasing evidence shows that when given access to large imaging databases these algorithms can already be used effectively to diagnose breast and lung cancer, retinal disease, and intracranial haemorrhage, with similar accuracy to that of human experts.1 Such tools are likely to be able to offer decision support in other areas of medical practice soon, and frameworks for reporting AI and machine learning research are being developed to match this need.2A detailed description of how AI works is beyond the scope of this article, but essentially AI comprises multilayered neural networks, which themselves are a group of linked algorithms, with outputs tuned to collectively respond to a stimulus from a particular input. Most traditional AIs are task specific (ie, trained on one form of labelled data), so that they become experts at, for example, categorising images or playing chess. More recent methods allow unsupervised learning by identifying patterns within massive datasets. Once developed, however, AIs are metaphorical black boxes, with an input and output but an inability to explain or interrogate the workings; if trained on a dataset with an unknown inherent bias, the AI might inherit this in a way that is difficult to detect.3The current most up to date general purpose language AI is the Generative Pre-trained Transformer 3 (GPT-3) developed by OpenAI (San Francisco, CA). GPT-3 was trained using 175 billion varied items of text, including the entirety of Wikipedia and a collection of books and websites.4 From a starting prompt GPT-3 is capable of translation, answering questions, and even writing newspaper articles.5 GPT-3 is a commercial product, and, because of concerns about the potential for misuse, it can only be accessed by submitting a proposal and being accepted onto a Beta program.Although traditionally computers have been thought incapable of innovative or independent thought, given the developments in technology it seemed timely to evaluate the capability of AI to generate worthwhile hypotheses for medical research. Since 1982 The BMJ has published a special Christmas edition, featuring articles in which evidence based science is combined with more light hearted or quirky themes.6 In this study we determined whether AI generated titles for potential Christmas research articles in The BMJ would meet the brief of combining scientific merit with engaging and entertaining subject matter.
Methods
We took the titles of the 13 most read Christmas research articles of the past 10 years in The BMJ and used these to construct a prompt instructing GPT-3 to generate similar titles (supplementary file). Both authors independently scored the 57 titles GPT-3 generated on a scale of 1 to 6 for scientific merit, entertainment, and plausibility. We used the mean composite scores from this process to rank the titles and select the 10 highest rated and 10 lowest rated newly generated titles.Despite an extensive review of the literature on the use of AI to generate titles for Christmas research articles in The BMJ, we were unable to identify any articles that could provide the required sample size. For this small study to disprove our null hypothesis that AI would be incapable of generating plausible titles, we used a convenience sample of 25 medical doctors from a range of specialties and settings: paediatricians, physicians in adult medicine, general practitioners, and anaesthetists from Africa, Australia, and Europe.The participants were required to self-declare that they were familiar with the usual content and format of the Christmas issue of The BMJ. They were then asked to complete an online survey containing 10 randomly selected titles of Christmas research articles obtained from the archive of The BMJ and the 10 highest rated and 10 lowest rated AI generated article titles (fig 1). The titles were presented to each participant in a random order, blinded to which of the three categories (real articles, AI generated 10 highest rated and 10 lowest rated titles) the articles belonged. The participants were told that the list contained a mixture of real and AI generated titles but not the proportion of each.
Fig 1
Ten randomly selected actual titles of Christmas research articles in The BMJ and 10 highest rated and 10 lowest rated artificial intelligence (AI) generated articles
Ten randomly selected actual titles of Christmas research articles in The BMJ and 10 highest rated and 10 lowest rated artificial intelligence (AI) generated articlesUsing a seven level Likert scale (absolutely not, probably not, maybe not, unsure, maybe, probably, absolutely), the participants rated each paper according to four statements: This a real BMJ paper; I want to read this; This would be funny/enjoyable to read; and This would be scientifically/educationally useful. They were also asked to select which of the 30 titles was the most plausible overall and which the funniest.We assessed the ability of GPT-3 to generate titles unaided by comparing the proportion of real titles with positive Likert scores (5 to 7) with the proportion of the 10 highest and 10 lowest rated titles combined with positive scores. To determine if human curation was beneficial to AI, we performed the same comparison between the real titles and the 10 highest rated titles. Ordinal regression was used to test statistical significance between groups. Data were analysed using R version 4.0.5,7 the Tidyverse,8 and Likert packages.
Patient and public involvement
Although the topic of this paper does not directly apply to specific patient groups, we did speak to patients about the study. We also asked a member of the public to comment on our manuscript after submission.
Results
AI generated highest and lowest rated titles combined
When the titles of real Christmas research articles in The BMJ were compared with the combined list of highest and lowest rated AI generated titles (fig 2), the real titles were rated as more likely to be an actual article (182/250 responses (73%) v 238/500 responses (48%); odds ratio 3.1, 95% confidence interval 2.3 to 4.1; P<0.001) and more likely to be scientifically or educationally useful (146/250 (58%) v 193/500 (39%); 2.0, 1.5 to 2.6; P<0.001). AI generated titles were equally as attractive to read as the real article titles (176/250 (70%) v 342/500 (68%); 1.1, 0.8 to 1.4; P=0.49) and rated as equally enjoyable (159/250 (64%) v 346/500 (69%); 0.9, 0.7 to 1.2; P=0.55).
Fig 2
Real titles of Christmas research articles in The BMJ compared with top 10 and bottom 10 ranked AI generated titles using seven point Likert scales
Real titles of Christmas research articles in The BMJ compared with top 10 and bottom 10 ranked AI generated titles using seven point Likert scales
Curated AI generated titles
When the real titles were compared with the top ranked AI generated ones curated by humans (fig 3), the real titles were still believed to be more likely to represent an actual article (182/250 (73%) v 147/250 (59%); 2.2, 1.6 to 3.0; P<0.001) and were considered as educationally useful (146/250 (58%) v 123/250 (49%); 1.3, 1.0 to 1.8; P=0.08). The selected group of top ranked AI titles were still rated as equally attractive to read as the real titles (176/250 (70%) v 185/250 (74%); 0.9, 0.6 to 1.2; P=0.45) and as enjoyable (159/250 (64%) v 180/250 (72%); 0.8, 0.6 to 1.1; P=0.25).
Fig 3
Real titles of Christmas research articles in The BMJ compared with curated top 10 ranked AI generated titles using seven point Likert scales
Real titles of Christmas research articles in The BMJ compared with curated top 10 ranked AI generated titles using seven point Likert scalesWhen the participants were asked to choose the single most plausible title, 10 (40%) chose one that had been AI generated—the most popular being “The association between belief in conspiracy theories and the willingness to receive vaccinations.” For the single funniest title only six (24%) participants chose a real article (fig 4).
Fig 4
Most plausible and funniest titles chosen by participants. Logo with Santa’s hat indicates titles of real Christmas research articles in The BMJ
Most plausible and funniest titles chosen by participants. Logo with Santa’s hat indicates titles of real Christmas research articles in The BMJ
Discussion
In this small study, AI generated titles for potential Christmas research articles in The BMJ were at least as entertaining and attractive to readers in our sample as titles of actual articles that were published in the Christmas issue of The BMJ. Real titles performed significantly better than AI generated ones (both curated and non-curated by the human participants) in terms of plausibility, although it was not possible to differentiate inherent plausibility from the participants’ familiarity with previous published Christmas research articles in The BMJ. A small number of well known articles included by chance in our sample could have substantially skewed the results.The only two titles to be rated both as the most plausible and the funniest were “The survival time of chocolates on hospital wards: covert observational study” (which was the third most accessed Christmas research article in the month of its publication, with 298 841 readers) and “The effects of free gourmet coffee on emergency department waiting times: an observational study,” now our potential submission for the 2022 Christmas issue of The BMJ.When we considered the perceived scientific value of the articles in our sample, AI generated titles not selected by humans performed noticeably more poorly than real titles. When a subsequent step of human curation was applied, the performance of the AI generated titles came within the range of the real titles.This finding fits with previous work on AI, suggesting that the best results come from combining machine learning with human oversight.9 Both human and machine decision making are limited by the quality and quantity of inputs. Humans are psychologically limited by how much data they can review, retain, and process, whereas machines are more likely to be constrained by the method of input. In our study, GPT-3 “knew” about the subject matter, wording, and associations of previously successful article titles but did not have the experience of clinical practice shared by the authors and study participants. Although humans might see the real world application of a study about clinician sleep deprivation on mortality in the intensive care unit, AI, with its inputs, sees this as no more or less useful than understanding the effects of applying superglue to nipples as a distraction from erectile dysfunction at work, nor can it understand if the titles are offensive. One limitation of our study is that we compared articles that had been accepted by, rather than submitted to, The BMJ for publication in its Christmas issue with the outputs of GPT-3. The performance of GPT-3 might have been better if this broader sample had been used.Although our study might be the first to consider the use of AI to generate titles of research articles and to determine the attractiveness of those articles to potential readers, interest in the use of AI to generate research hypotheses is growing. For example, it has been proposed that the Euretos platform, mainly used by preclinical researchers to identify potential targets and biomarkers, could be used to generate hypotheses based on published papers, with subsequent expert review determining which of these are appropriate research directions to pursue.10The findings of our study reinforce the essential role humans have in directing AI and curating its output. It is overwhelmingly likely, however, that recent developments in AI and machine learning will change the way work is done in healthcare, whether this is through improving diagnostic speed and accuracy, decision support, or reducing medical error. AI has the potential to change the way we select and interact with the medical literature; our study is an early demonstration of the way these technologies might also change the way we produce that literature.
Conclusion
Even in the context of quirky titles such as those that appear in the Christmas issues of The BMJ, AI has the potential to generate plausible outputs that are engaging and could attract potential readers. Attracting interest can only be done with expert guidance, however, as some of the article titles in our study were irrelevant or offensive. This finding mirrors the potential use of AI in clinical medicine, as decision support rather than as outright replacement of clinicians.Recent parallel advances in technology and digitisation have led to a rapid development of artificial intelligence (AI) and machine learningIn medicine, early applications of AI have been based around image recognition and diagnostics but with great potential for broader useThe most recent AI systems are capable of advanced language recognition, interpretation, and generationTitles of potential Christmas research articles in The BMJ generated by AI were as attractive and entertaining to readers as real titles published in the Christmas issue of The BMJWith an additional stage of human intervention, the titles also performed similarly in terms of potential scientific and educational valueAI could have a role in generating hypotheses or directions for future research
Authors: Sebastian Vollmer; Bilal A Mateen; Gergo Bohner; Franz J Király; Rayid Ghani; Pall Jonsson; Sarah Cumbers; Adrian Jonas; Katherine S L McAllister; Puja Myles; David Granger; Mark Birse; Richard Branson; Karel G M Moons; Gary S Collins; John P A Ioannidis; Chris Holmes; Harry Hemingway Journal: BMJ Date: 2020-03-20
Authors: Deepika Mohan; Coreen Farris; Baruch Fischhoff; Matthew R Rosengart; Derek C Angus; Donald M Yealy; David J Wallace; Amber E Barnato Journal: BMJ Date: 2017-12-12
Authors: Ravi Aggarwal; Viknesh Sounderajah; Guy Martin; Daniel S W Ting; Alan Karthikesalingam; Dominic King; Hutan Ashrafian; Ara Darzi Journal: NPJ Digit Med Date: 2021-04-07
Authors: Hartmuth Nowak; Nina Zech; Sven Asmussen; Tim Rahmel; Michael Tryba; Guenther Oprea; Lisa Grause; Karin Schork; Manuela Moeller; Johannes Loeser; Katharina Gyarmati; Corinna Mittler; Thomas Saller; Alexandra Zagler; Katrin Lutz; Michael Adamzik; Ernil Hansen Journal: BMJ Date: 2020-12-10