| Literature DB >> 36040760 |
Chau Tong1, Drew Margolin1, Rumi Chunara2,3, Jeff Niederdeppe1,4, Teairah Taylor1, Natalie Dunbar5, Andy J King6,7.
Abstract
BACKGROUND: Common methods for extracting content in health communication research typically involve using a set of well-established queries, often names of medical procedures or diseases, that are often technical or rarely used in the public discussion of health topics. Although these methods produce high recall (ie, retrieve highly relevant content), they tend to overlook health messages that feature colloquial language and layperson vocabularies on social media. Given how such messages could contain misinformation or obscure content that circumvents official medical concepts, correctly identifying (and analyzing) them is crucial to the study of user-generated health content on social media platforms.Entities:
Keywords: NLP; computational textual analysis; health communication; health information retrieval; natural language processing; network analysis; public health; search term identification; social media; word embeddings; word2vec
Year: 2022 PMID: 36040760 PMCID: PMC9472050 DOI: 10.2196/37862
Source DB: PubMed Journal: JMIR Med Inform
Neighbor terms to “colonoscopy” and similarity scores.
| Terma | Similarity score | Rank |
| “suprep” | 0.9722890 | 1 |
| “peg” | 0.9519246 | 2 |
| “sutab” | 0.9513488 | 3 |
| “plenvu” | 0.9504289 | 4 |
| “glycol” | 0.9498276 | 5 |
| “miralax” | 0.9449067 | 6 |
| “rectal” | 0.9435940 | 7 |
| “cleanse” | 0.9422708 | 8 |
| “cologuard” | 0.9421358 | 9 |
| “colorectal” | 0.9403084 | 10 |
aNeighbor terms are terms with the most semantic similarity (with corresponding high similarity scores or low ranks) to “colonoscopy” based on YouTube video data. Score refers to the cosine similarity metric between word embeddings (ie, terms) in a multidimensional vector space.
Retrieval statistics in the sampled videos for the top 6 neighbors of “colonoscopy.”
| Terms | Sample of coded videos, N | Relevant (precision), n (%) | Relevant and mention of “colonoscopy,” n (%) | Relevant and does not mention “colonoscopy” (recall improvement), n (%) |
| “suprep” | 25 | 18 (72) | 9 (36) | 9 (36) |
| “peg” | 25 | 1 (4) | 0 (0) | 1 (4) |
| “sutab” | 25 | 4 (16) | 4 (16) | 0 (0) |
| “plenvu” | 25 | 23 (92) | 15 (60) | 8 (32) |
| “glycol” | 25 | 0 (0) | 0 (0) | 0 (0) |
| “miralax” | 25 | 5 (20) | 2 (8) | 3 (12) |
| Total | 150 | 51 (34) | 30 (20) | 21 (14) |
Euclidean distance between the text features of original “colonoscopy” video set and video sets generated from top 6 neighbor termsa.
| Term | 1 | 2 | 3 | 4 | 5 | 6 |
| “colonoscopy” | 0 | 255.61 | 257.97 | 241.5 | 248.9 | 254.68 |
| “miralax” | N/Ab | 0 | 6.32 | 20.1 | 21.8 | 7.14 |
| “peg” | N/A | N/A | 0 | 22.2 | 23.1 | 6.86 |
| “plenvu” | N/A | N/A | N/A | 0 | 20.6 | 19.08 |
| “suprep” | N/A | N/A | N/A | N/A | 0 | 20.57 |
| “sutab” | N/A | N/A | N/A | N/A | N/A | 0 |
aCell values indicate dissimilarities of the text features belonging to any pair of video sets. Larger values indicate larger distances, and 0 indicates identical text features. “Glycol” was removed because of 0 relevant videos retrieved.
bN/A: not applicable.
Figure 1Relative frequencies of words in the colonoscopy video set and the combined top 5 neighbor term video set. Words that are “key” to each video set were plotted. Original: the set of videos found with the search query “colonoscopy.” Reference: the set of videos found with 5 nearest terms to “colonoscopy” (“suprep,” “peg,” “sutab,” “plenvu,” and “miralax”). chi2: chi-square value.
Figure 2Visualization of distances between video sets. Hierarchical cluster analysis indicating dissimilarities and distances between original (set of videos found with the search query “colonoscopy”) and sets of videos found with 5 nearest terms to “colonoscopy” (“suprep,” “peg,” “sutab,” “plenvu,” and “miralax”).
Relevance of newly found videos by the number of links to the original set of colonoscopy videos (total degree).
| Total degreea | Count of videos with total degree, N | Number of videos coded as “relevant” (relevancy), n (%) | Cumulative count of nonduplicate videos, N | Cumulative count of nonduplicate relevant videos, n | Cumulative precisionb (%) | Cumulative recallc, (%) | Cumulative |
| 44 | 1 | 1 (100) | 1 | 1 | 100 | 2.7 | 5.3 |
| 41 | 1 | 1 (100) | 2 | 2 | 100 | 5.4 | 10.3 |
| 26 | 1 | 1 (100) | 3 | 3 | 100 | 8.1 | 15.0 |
| 23 | 1 | 1 (100) | 4 | 4 | 100 | 10.8 | 19.5 |
| 22 | 1 | 1 (100) | 5 | 5 | 100 | 13.5 | 23.8 |
| 21 | 1 | 1 (100) | 6 | 6 | 100 | 16.2 | 27.9 |
| 20 | 2 | 2 (100) | 8 | 8 | 100 | 21.6 | 35.6 |
| 19 | 1 | 1 (100) | 9 | 9 | 100 | 24.3 | 39.1 |
| 18 | 1 | 1 (100) | 10 | 10 | 100 | 27.0 | 42.6 |
| 17 | 2 | 2 (100) | 12 | 12 | 100 | 32.4 | 49.0 |
| 16 | 1 | 1 (100) | 13 | 13 | 100 | 35.1 | 52.0 |
| 15 | 2 | 2 (100) | 15 | 15 | 100 | 40.5 | 57.7 |
| 14 | 1 | 1 (100) | 16 | 16 | 100 | 43.2 | 60.4 |
| 13 | 1 | 1 (100) | 17 | 17 | 100 | 45.9 | 63.0 |
| 12 | 2 | 2 (100) | 19 | 19 | 100 | 51.4 | 67.9 |
| 11 | 2 | 2 (100) | 21 | 21 | 100 | 56.8 | 72.4 |
| 10 | 1 | 1 (100) | 22 | 22 | 100 | 59.5 | 74.6 |
| 9 | 1 | 1 (100) | 23 | 23 | 100 | 62.2 | 76.7 |
| 7 | 2 | 1 (50) | 25 | 24 | 96 | 64.9 | 77.4 |
| 6 | 1 | 1 (100) | 26 | 25 | 96 | 67.6 | 79.4 |
| 5 | 2 | 0 (0) | 28 | 25 | 89 | 67.6 | 76.9 |
| 4 | 2 | 2 (100) | 30 | 27 | 90 | 73.0 | 80.6 |
| 3 | 2 | 1 (50) | 32 | 28 | 88 | 75.7 | 81.2 |
| 2 | 5 | 1 (20) | 37 | 29 | 78 | 78.4 | 78.4 |
| 1 | 5 | 1 (20) | 42 | 30 | 71 | 81.1 | 75.9 |
| 0 | 71 | 7 (10) | 113 | 37 | 33 | 100 | 49.3 |
aThe sum of connections each new video has with the videos in the original colonoscopy video set.
bThe cumulative count of relevant videos divided by the cumulative count of all videos.
cCumulative count of relevant videos divided by the total number of new and nonduplicate 37 relevant videos.
dThe harmonic mean of cumulative precision and cumulative recall.
Summary retrieval statistics for “colonoscopy,” “FOBT,” “mammogram,” and “pap test.”
| Focal term | Top nearest neighbor terms | Sample of coded videos (videos per term) | New and nonduplicate relevant videos (set A), N | Videos with degree ≥1 (set B)a, N | Videos with degree ≥1a and coded as new and relevant, n (A∩B) | Precision, n/N (%) | Recall, n/N (%) |
| Colonoscopy |
“suprep” “peg” “sutab” “plenvu” “glycol” “miralax” | 150 (25) | 37 | 42 | 30 | 30/42 (75) | 30/37 (81) |
| FOBTb |
“iFOBT” “hemosure” “immunochemical” “immunostics” “guaiac” | 125 (25) | 50 | 33 | 27 | 27/33 (82) | 27/50 (54) |
| Mammogram |
“smartcurve” “breastcheck” “biopsy” “ultrasound” “breastcancerawareness” | 250 (50) | 77 | 28 | 23 | 23/28 (82) | 23/77 (30) |
| Pap test |
“Colposcopy” “Smear” “ASCUS”c “papsmear” “STD”d | 250 (50) | 87 | 65 | 59 | 59/65 (91) | 59/87 (68) |
aVideos with at least one connection to the original set of videos resulted from the focal terms.
bFOBT: fecal occult blood test.
cASCUS: atypical squamous cells of undetermined significance.
dSTD: sexually transmitted disease.