| Literature DB >> 34824135 |
Robert P Lennon1, Robbie Fraleigh2, Lauren J Van Scoy3, Aparna Keshaviah4, Xindi C Hu4, Bethany L Snyder5, Erin L Miller6, William A Calo7, Aleksandra E Zgierska6, Christopher Griffin2.
Abstract
Qualitative research remains underused, in part due to the time and cost of annotating qualitative data (coding). Artificial intelligence (AI) has been suggested as a means to reduce those burdens, and has been used in exploratory studies to reduce the burden of coding. However, methods to date use AI analytical techniques that lack transparency, potentially limiting acceptance of results. We developed an automated qualitative assistant (AQUA) using a semiclassical approach, replacing Latent Semantic Indexing/Latent Dirichlet Allocation with a more transparent graph-theoretic topic extraction and clustering method. Applied to a large dataset of free-text survey responses, AQUA generated unsupervised topic categories and circle hierarchical representations of free-text responses, enabling rapid interpretation of data. When tasked with coding a subset of free-text data into user-defined qualitative categories, AQUA demonstrated intercoder reliability in several multicategory combinations with a Cohen's kappa comparable to human coders (0.62-0.72), enabling researchers to automate coding on those categories for the entire dataset. The aim of this manuscript is to describe pertinent components of best practices of AI/machine learning (ML)-assisted qualitative methods, illustrating how primary care researchers may use AQUA to rapidly and accurately code large text datasets. The contribution of this article is providing guidance that should increase AI/ML transparency and reproducibility. © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.Entities:
Keywords: qualitative research
Mesh:
Year: 2021 PMID: 34824135 PMCID: PMC8627418 DOI: 10.1136/fmch-2021-001287
Source DB: PubMed Journal: Fam Med Community Health ISSN: 2305-6983
Figure 1Procedural diagram for the application of AQUA to free-text data used in the Illustration. AQUA, automated qualitative assistant.
Figure 2Unsupervised clustering (circle hierarchy) of survey responses. (A) Circles represent a linguistic community or topic. Nested circles represent hierarchical organisation. The smallest circles are individual survey responses. The seven parent unsupervised topics are labeled. (B) Parent unsupervised topics.
Kappa values evaluating the inter-rater reliability (agreement) between human coders and supervised training of three-topic training models
| Topics | Train percentage | |||
| 0.2 | 0.4 | 0.6 | 0.8 | |
| 124 | 0.67 (±0.01) | 0.70 (±0.02) | 0.68 (±0.03) | 0.70 (±0.04) |
| 135 | 0.66 (±0.02) | 0.66 (±0.02) | 0.67 (±0.05) | 0.67 (±0.06) |
| 145 | 0.67 (±0.02) | 0.71 (±0.02) | 0.70 (±0.02) | 0.72 (±0.02) |
| 245 | 0.67 (±0.04) | 0.69 (±0.03) | 0.71 (±0.03) | 0.72 (±0.06) |
| 257 | 0.68 (±0.02) | 0.70 (±0.01) | 0.71 (±0.02) | 0.70 (±0.02) |
Topic labels are: (1) distrust, (2) media messaging, (3) trusted sources of information, (4) personal medical concerns, (5) family concerns, (6) societal concerns, (7) barriers to recommendations, (8) no worries, (9) other Broad.
Figure 3Human-coded topics (rows) are compared with unsupervised topic communities (columns). Each row depicts the distribution of responses for each humancoded topic across unsupervised topic communities. The heat chart to the right shows the gradient color scheme for percent-agreement (darker is better agreement).