| Literature DB >> 30010941 |
Liliana Laranjo1, Adam G Dunn1, Huong Ly Tong1, Ahmet Baki Kocaballi1, Jessica Chen1, Rabia Bashir1, Didi Surian1, Blanca Gallego1, Farah Magrabi1, Annie Y S Lau1, Enrico Coiera1.
Abstract
Objective: Our objective was to review the characteristics, current applications, and evaluation measures of conversational agents with unconstrained natural language input capabilities used for health-related purposes.Entities:
Year: 2018 PMID: 30010941 PMCID: PMC6118869 DOI: 10.1093/jamia/ocy072
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Characterization of conversational agents
|
| Platform supporting the conversational agent: software application delivered via mobile device (eg smartphone, tablet), laptop or desktop computer, or via web browser; SMS; telephone; or multimodal platform. | |
|
| Finite-state | The user is taken through a dialogue consisting of a sequence of pre-determined steps or states. |
| Frame-based | The user is asked questions that enable the system to fill slots in a template in order to perform a task. The dialogue flow is not pre-determined but depends on the content of the user’s input and the information that the system has to elicit. | |
| Agent-based | These systems enable complex communication between the system, the user and the application. There are many variants of agent-based systems, depending on what particular aspects of intelligent behavior are designed into the system. In agent-based systems, communication is viewed as the interaction between two agents, each of which is capable of reasoning about its own actions and beliefs, and sometimes also about the actions and beliefs of the other agent. The dialogue model takes the preceding context into account with the result that the dialogue evolves dynamically as a sequence of related steps that build on each other. | |
|
| User | The user leads the conversation |
| System | The system leads the conversation | |
| Mixed | Both the user and the system can lead the conversation | |
|
| Spoken | The user uses spoken language to interact with the system |
| Written | The user uses written language to interact with the system | |
|
| Written, spoken, visual (eg non-verbal communication like facial expressions or body movements) | |
|
| Yes | The system is designed for a particular task and set up to have short conversations, in order to get the necessary information to achieve the goal (eg booking a consultation) |
| No | The system is not directed to the short-term achievement of a specific end-goal or task (eg purely conversational chatbots) | |
Adapted from McTear 2002;
Adapted from Chu-Carroll et al. 1997;
Adapted from McTear et al. 2016
Example of technical evaluation measures for conversational agents and their individual modules
| Conversational agent as a whole (global measures) | Dialogue success rate (% successful task completion), dialogue-based cost measures (duration, number of turns necessary to achieve a task, number of repetitions, corrections or interruptions) |
| Automatic speech recognition | Word accuracy, word error rate, word insertion rate, word substitution rate, sentence accuracy |
| Natural language understanding | Percentage of words correctly understood, not covered or partially covered; % sentences correctly analyzed; % words outside the dictionary; % sentences whose final semantic representation is the same as the reference; % correct frame units, considering the actual frame units; frame-level accuracy; frame-level coverage |
| Dialogue management | Percentage of correct responses; % half-answers; % times the system works trying to solve a problem; % times the user acts trying to solve a problem |
| Natural language generation | Number of times the user requests a repetition of the reply provided by the system; user response time; number of times the user does not answer; rate of out-of-vocabulary words |
| Speech synthesis | Intelligibility of synthetic speech and naturalness of the voice |
Abbreviations: %, percentage
Adapted from López-Cózar et al. 2011; Walker et al. 1997
Figure 1.Flow diagram of included studies in which 17 studies (14 conversational agents) were identified from 1513 articles in the initial database search (April 2017). Search updates were conducted until February 2018, with 3 new papers being identified for full-text screening.
Characteristics of the conversational agents evaluated in the included studies
| First author, year | Dialogue initiative | Input | Output | Task-oriented | ||
|---|---|---|---|---|---|---|
| Fitzpatrick et al. 2017 | Platform independent app | Frame-based | Mixed | Written | Written | No |
| Tanaka et al., 2017 | Windows computer app; ECA | Finite-state | System | Spoken | Spoken, written, visual | Yes |
| Miner et al., 2016 | Mobile device app | Agent-based | User | Spoken | Spoken, written | No |
| Ireland et al., 2016 | Mobile device app; chatbot | Frame-based | Mixed | Spoken | Spoken, written | No |
| Rhee et al., 2014 | SMS | Frame-based | Mixed | Written | Written | No |
| Hudlicka, 2013 | Web browser app; ECA | Frame-based | Mixed | Written | Written | No |
| Crutzen et al., 2011 | Windows computer app; chatbot | Frame-based | Mixed | Written | Written | No |
| Philip et al., 2017 | Windows computer app; ECA | Finite-state | System | Spoken | Spoken | Yes |
| Lucas et al., 2017 | Multimodal platform; ECA | Finite-state | System | Spoken | Spoken | Yes |
| Philip et al., 2014 | Windows computer app; ECA | Finite-state | System | Spoken | Spoken | Yes |
| Beveridge and Fox, 2006 | Telephone and web browser app | Frame-based | Mixed | Spoken | Spoken | Yes |
| Black et al. 2005, | Telephone | Finite-state | System | Spoken | Spoken | Yes |
| Levin and Levin, 2006 | Telephone | Finite-state | System | Spoken | Spoken | Yes |
| Giorgino et al. 2005, | Telephone | Frame-based | Mixed | Spoken | Spoken | Yes |
Abbreviations: app: application; ECA: Embodied Conversational Agent; SMS, Short Message Service
Type of conversational agent considered unspecific, where not ECA nor chatbot;
Woebot, Woebot Labs: instant messenger app, platform independent;
Automated skills trainer developed from MMDAgent (http://www.mmdagent.jp);
Harlie the Chatbot (http://www.itee.uq.edu.au/cis/harlie);
mASMAA, an extension of TRIPS (The Rochester Interactive Panning System);
Virtual Mindfulness Coach;
Bzz Dutch chatbot for Windows Live Messenger;
SimSensei Virtual Agent, based on the MultiSense perception system, a multimodal sensing platform which fuses information from web cameras, the Microsoft Kinect and audio capture, and processing hardware (http://multicomp.ict.usc.edu/? p=1799);
HOMEY project – home monitoring through an intelligent dialogue system (http://www.openclinical.org/dm_homey.html#);
DI@L-log: although the system allows for dual tone multi frequency input this is rarely used, as all interactions can occur via spoken language;
Pain Monitoring Voice Diary, developed by Spacegate, Inc;
Not objectively reported in the paper, but inferred from descriptions of the CA, sample dialogues, or other published material on the system
Figure 2.Characteristics of included conversational agents in terms of task-orientation, dialogue management, and dialogue initiative.
Study characteristics and results from the evaluation of conversational agents supporting patients and consumers
| Author, year | Health domain | CA purpose | Study type and methods | Evaluation measures and main findings | ||
|---|---|---|---|---|---|---|
| Technical performance | User experience | Health-related measures | ||||
| Technology supporting patients and consumers | ||||||
| Fitzpatrick et al., 2017 | Mental health (depression, anxiety) | Psychotherapy support, education | NR | • High overall satisfaction (4.3/5 Likert scale)• Participants interacted with the CA 12.1 times• Issues in spoken language understanding | • Reduced depression symptoms (PHQ-9): effect size | |
| Tanaka et al., 2017 | Mental health (autism) | Social skills practice, education | NR | NR | Improved narrative skills scores (pre-post, one-tailed):• Study 1 (audiovisual feedback): | |
| Miner et al., 2016 | Mental and physical health, violence | Question answering, personal assistance, conversational | NR | • CAs frequently did not recognize the health concern• Responses were often incomplete and inconsistent• Referral to appropriate health resources was rare• No variation in responses by tone or sex of the user• Issues in spoken language understanding and/or dialogue management | • Siri, Google Now, and S Voice responded appropriately to the statement “I want to commit suicide”; Siri and Google Now referred the user to a suicide prevention helpline• Siri recognized physical concerns and referred to nearby medical facilities | |
| Ireland et al., 2016 | Language impairment | Education, practice (feedback on speech and communication) | NR | • High overall satisfaction (nq)• Issues in spoken language understanding and/or dialogue management; low speed of processing | NR | |
| Rhee et al., 2014 | Asthma | Data collection, self-monitoring | NR | • High overall satisfaction (nq)• Average response rates to each diary question: 81-97%• Common topic of user questions: symptoms• Issues: technical, spoken language understanding | • Improved self-management, and treatment adherence (nq)• Improved awareness of symptoms and triggers (nq) | |
| Hudlicka, 2013 | Mental health | Education, practice | NR | • High overall satisfaction (nq)• Issues: spoken language understanding | • Increased self-reported meditation frequency and duration | |
| Crutzen et al., 2011 | Sexual health, substance abuse | Education | Average duration of conversations: 3 min and 57 secs | • Ease of use: mean 47.8, SD 31.4; Reliability: mean 73.7, SD 27.4; Usefulness: mean 56.4, SD 51.5. [Scores 0-100; scale not validated] | NR | |
Abbreviations: CA: conversational agent; CBT: cognitive behavioral therapy; d: Cohen’s d, effect size indicating the standardized difference between two means; ECA: Embodied Conversational Agent; GAD-7: Generalized Anxiety Disorder 7-item scale, measures the frequency and severity of anxious thoughts and behaviors over the past 2 weeks; min: minutes; nq: not quantified in the paper; NR: not reported; p: p-value, measure of statistical significance; PANAS: positive and negative affect schedule 20-item scale; PHQ-9: Patient Health Questionnaire 9-item scale, measures the frequency and severity of depressive symptoms; RCT: randomized controlled trial; SD: standard deviation
Studies evaluating the same conversational agent were grouped together;
Technology supporting patients and consumers: systems that support individuals with health-related aspects of their lives.
Study characteristics and results from the evaluation of conversational agents supporting clinicians and both patients and clinicians
| Author, yeara | Health domain | CA purpose | Study type and methods | Evaluation measures and main findings | ||
|---|---|---|---|---|---|---|
| Technical performance | User experience | Health-related measures | ||||
| Technology supporting cliniciansb | ||||||
| Philip et al., 2017 | Mental health (depression) | Clinical interview (major depressive disorder diagnosis) | NR | • High acceptability of the ECA: score 25.4 (0-30) with the Acceptability e-Scale (validated) | • Sens.=49%, spec.=93%, PPV=63%, NPV=88% (severe depressive symptoms: sens.=73% and spec.=95%); AUC: 0.71 (95% CI 0.59–0.81) | |
| Lucas et al., 2017 | Mental health (PTSD) | Clinical interview (PTSD diagnosis) | NR | NR | • Study 1: Participants reported more PTSD symptoms when asked by the ECA than the other 2 modalities ( | |
| Philip et al., 2014 | Obstructive Sleep Apnea (daytime sleepiness) | Clinical interview (excessive daytime sleepiness diagnosis) | NR | • Most subjects had a positive perception of the ECA and considered the ECA interview as a good experience (non-validated questionnaire, 7 questions) | • Sens.>0.89, spec.>0.81 (sleepiest patients: sens. and spec.>98%)• ESS scores from ECA and physician interviews were correlated (r=0.95; | |
| Beveridge and Fox, 2006 | Breast cancer | Data collection and clinician decision support (referral to a cancer specialist) | • | • Ease of use: moderate (nq)• 691 system responses; 79.2% “appropriate,” 4.6% “borderline appropriate/; inappropriate,” 14.5% “completely inappropriate,” 1.2% “incomprehensible,” and 0.6% “total failure”• Issues: spoken language understanding and dialogue management | NR | |
| Technology supporting patients and cliniciansb | ||||||
| Black et al. 2005, | Type 2 diabetes | Data collection, telemonitoring | • | |||
| Levin and Levin, 2006 | Pain monitoring | Data collection | • Data capture rate: 98% (2% flagged for transcription)• Task-oriented dialogue turns: 82% | • Users became more efficient with experience, increasing the % of interrupted prompts and task-oriented dialogue | NR | |
| Giorgino et al. 2005, | Hypertension | Data collection, telemonitoring | • Authors mention satisfying performance but evaluation data is not reported in detail• 80% successful task completion; 35% confirmation questions | NR | NR | |
Abbreviations: AUC: Area Under the Curve; CA: conversational agent; CI: confidence interval; ECA: Embodied Conversational Agent; ESS: Epworth Sleepiness Scale; nq: not quantified in the paper; NR: not reported; p: p-value, measure of statistical significance; PTSD: Post Traumatic Stress Disorder; r: correlation coefficient; RCT: randomized controlled trial; sens.: sensitivity; spec.: specificity
aStudies evaluating the same conversational agent were grouped together; bTechnology supporting clinicians: systems that support clinical work at the healthcare setting (e.g. CA substituting a clinician in a clinical interview with diagnostic purposes); Technology supporting patients and clinicians: systems that support both consumers in their daily lives and clinical work at the healthcare setting (e.g. telemonitoring systems involving a CA).