| Literature DB >> 35128039 |
Chaoyuan Zuo1, Kritik Mathur1, Dhruv Kela1, Noushin Salek Faramarzi1, Ritwik Banerjee1.
Abstract
Natural language undergoes significant transformation from the domain of specialized research to general news intended for wider consumption. This transition makes the information vulnerable to misinterpretation, misrepresentation, and incorrect attribution, all of which may be difficult to identify without adequate domain knowledge and may exist even in the presence of explicit citations. Moreover, newswire articles seldom provide a precise correspondence between a specific claim and its origin, making it harder to identify which claims, if any, reflect the original findings. For instance, an article stating "Flagellin shows therapeutic potential with H3N2, known as Aussie Flu." contains two claims ("Flagellin ... H3N2," and "H3N2, known as Aussie Flu") that may be true or false independent of each other, and it is prima facie unclear which claims, if any, are supported by the cited research. We build a dataset of sentences from medical news along with the sources from peer-reviewed medical research journals they cite. We use these data to study what a general reader perceives to be true, and how to verify the scientific source of claims. Unlike existing datasets, this captures the metamorphosis of information across two genres with disparate readership and vastly different vocabularies and presents the first empirical study of health-related fact-checking across them.Entities:
Keywords: Check-worthiness; Claim extraction; Cross-genre information retrieval; Fact-checking; Misinformation; Natural language processing
Year: 2022 PMID: 35128039 PMCID: PMC8807956 DOI: 10.1007/s41060-022-00310-7
Source DB: PubMed Journal: Int J Data Sci Anal
An article citing and (mis-) quoting peer-reviewed research while presenting medical information. General trust in the publisher and the mere existence of the hyperlink (Bold) are powerful markers of credibility. The reader often trusts such information without further verification
| Similarly, the | |
| Source: | |
| Title: Comprehension of Top 200 Prescribed Drugs in the US as a Resource for Pharmacy Teaching, Training and Practice | |
| Source: |
Sentences with at least one primary claim worth verifying along with embedded citations (bold). Claims unsupported by the cited research are marked by a red asterisk (*). All sources last accessed on May 5, 2020
| (A) | In a research published in the |
| (B) | Flaxseed fiber reportedly helps |
| (b) Flaxseed fiber helps... pressure | |
| (C) | Health workers have been using a vaccine made by Merck, which has been |
| (b) It showed some success ... DRC | |
| (D) | Some experts say they can |
| [ | |
| (b) gaming ... problem-solving abilities | |
| (c) gamers have sedentary lifestyles | |
| (d) gamers ... mental health issues |
Fig. 1Compared to other fact-checking datasets (FEVER [81] and CLEF-2019 CheckThat! [5]), sentences in medical newswire are long and complex, often positioning the primary claim(s) within a larger context of other information
Annotations on medical newswire claims perceived as verifiable and check-worthy, showing the number of sentences with dis/agreements. The two main types of disagreements in sentences with only one embedded hyperlink (bold) to peer-reviewed research are over the (1) inclusion of the post-modifier and (2) scope of the primary claim itself
| No. of embedded citations in a sentence | Total | |||
|---|---|---|---|---|
| 1 | 2 | 3 | ||
| Annotators agree | 4757 | 151 | 51 | 4959 |
| Annotators disagree | 125 | 23 | 9 | 157 |
| Disagreement (%) | 2.56 | 13.22 | 15.00 | 3.07 |
| (1) This may help to prevent delay | ||||
| Annotator 1: “This may ... with aging” (inclusion of complex post-modifier of sarcopenia) | ||||
| Annotator 2: “This may ... sarcopenia” | ||||
| (2) Your body starts a fever because the flu virus | ||||
| Annotator 1: “body starts ... temperatures” (causality perceived as the primary check-worthy claim) | ||||
| Annotator 2: “flu virus ... temperatures” (only the effect perceived as the primary check-worthy claim) | ||||
Fig. 2Distribution of citations over publication sources. Only top ten shown for brevity, including “others”
Fig. 3The distribution of sentence lengths, lengths of the embedded hyperlink text spans, and the length of the check-worthy claims
Fig. 4Prominent words (size proportional to a word’s frequency) in a complete sentences containing check-worthy claims, and b check-worthy claims alone
Fig. 5The BiLSTM+CRF architecture, with pretrained word embeddings serving as the input
Training, development, and test sets for claim–extraction. : sentences with a single embedded citation, and : sentences with multiple citations are included, but repeated with one citation per copy
| Number of sentences | ||
|---|---|---|
| Training | 3550 | 3868 |
| Development | 627 | 695 |
| Test | 580 | 549 |
| Total | 4757 | 5212 |
Claim extraction results on the test set (models marked with use the position embedding), showing the Precision, Recall, and F. Pretrained BERT fine-tuned on the training set, but without the BiLSTM-CRF layer, serves as the baseline (marked with )
| Embedding | Strict | Relaxed | |||||
|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | ||
| GloVe | 70.0 | 68.7 | 69.3 | 71.1 | 70.3 | 71.0 | |
| BioNLP | 71.9 | 68.2 | 70.0 | 73.9 | 70.2 | 72.0 | |
| Flair | 77.5 | 74.2 | 75.8 | 78.2 | 75.0 | 76.6 | |
| RoBERTa | 78.5 | 75.1 | 76.8 | 79.3 | 75.9 | 77.5 | |
| BioBERT | 74.7 | 77.1 | 75.3 | 77.8 | |||
| Glove+RoBERTa | 79.4 | 75.1 | 77.2 | 80.2 | 75.9 | 78.0 | |
| BioBERT+RoBERTa | 79.0 | 76.0 | 77.5 | 79.6 | 76.6 | 78.1 | |
| BioBERT+Flair | 78.7 | 79.6 | |||||
| BERT | 72.3 | 80.7 | 76.3 | 73.7 | 81.7 | 77.5 | |
| Glove | 72.8 | 74.4 | 73.6 | 74.5 | 76.1 | 75.2 | |
| GloVe | 73.6 | 73.7 | 73.6 | 75.1 | 75.3 | 75.2 | |
| RoBERTa | 79.9 | 81.1 | 80.5 | 81.1 | 82.4 | 81.7 | |
| RoBERTa | 82.1 | 81.5 | 81.8 | 83.1 | 82.5 | 82.8 | |
| BioBERT+Flair | |||||||
The best results are in bold
Likert-type rating scale used in the annotation task for cross-genre claim verification
| Score | Relation between the sentence from the abstract of the cited research publication ( |
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 |
Fig. 6The distribution of scores indicating how well a claim from newswire is supported by (1) a specific sentence from the abstract of the cited peer-reviewed research, and (2) the entire abstract of that publication
Pairs of (1) a claim (italics) perceived by readers as being supported by a cited research, and (2) a sentence from the cited research
| (A) | |
| Substituting 5% energy intake from vegetable protein for animal protein was associated with a 23% (95% CI: 16, 30) reduced risk of T2D. | |
| (B) | |
| Hamstrings were loaded isometrically during good-mornings but dynamically during deadlifts. | |
| (C) | |
| It is believed that genetic factors, host immune system disorders, intestinal microbiota dysbiosis, and environmental factors contribute to the pathogenesis of UC |
Size of the training, development, and test sets for cross-genre (newswire and medical research literature) verification of claims
| Number of claim–abstract pairs | ||||
|---|---|---|---|---|
| Supported | Unsupported | Uncertain | Total | |
| Training | 28 | 7 | 45 | 80 |
| Development | 15 | 3 | 23 | 41 |
| Test | 25 | 6 | 51 | 82 |
| Total | 68 | 16 | 119 | 203 |
Claim–verification results, with the models fine-tuned further on STS () and MedSTS (), evaluating the ranking of sentences in a cited research by the mean squared error (MSE) and Pearson correlation coefficient (PCC). The subsequent classification results Precision, Recall, F, Accuracy) are shown in italics
| BERT | 0.865 | 0.506 | ||||
| BERT | 0.851 | 0.536 | ||||
| BERT | 0.832 | 0.540 | ||||
| BioBERT | 0.518 | 0.729 | ||||
| BioBERT | ||||||
| BioBERT | ||||||
| XLNet | 0.646 | 0.663 | ||||
| XLNet | 0.522 | |||||
| XLNet | 67.1 | 67.6 | 67.1 |
Significant results are highlighted in bold
Fig. 7BioBERT and XLNet (both fine-tuned on STS and MedSTS benchmarks) results, showing the percentage of true labels classified across the three categories