| Literature DB >> 35206359 |
Stefano Di Sotto1, Marco Viviani1.
Abstract
The increasing availability of online content these days raises several questions about effective access to information. In particular, the possibility for almost everyone to generate content with no traditional intermediary, if on the one hand led to a process of "information democratization", on the other hand, has negatively affected the genuineness of the information disseminated. This issue is particularly relevant when accessing health information, which impacts both the individual and societal level. Often, laypersons do not have sufficient health literacy when faced with the decision to rely or not rely on this information, and expert users cannot cope with such a large amount of content. For these reasons, there is a need to develop automated solutions that can assist both experts and non-experts in discerning between genuine and non-genuine health information. To make a contribution in this area, in this paper we proceed to the study and analysis of distinct groups of features and machine learning techniques that can be effective to assess misinformation in online health-related content, whether in the form of Web pages or social media content. To this aim, and for evaluation purposes, we consider several publicly available datasets that have only recently been generated for the assessment of health misinformation under different perspectives.Entities:
Keywords: consumer health; data science; deep learning; health misinformation; information access; information disorder; machine learning; social Web
Mesh:
Year: 2022 PMID: 35206359 PMCID: PMC8872515 DOI: 10.3390/ijerph19042173
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1The DISCERN questionnaire. Graphic elaboration of the 16 questions extracted from http://www.discern.org.uk/discern_instrument.php (accessed on 3 February 2022).
Dimensionality of the original datasets.
| Data | CoAID | ReCOVery | FakeHealth (Release) | FakeHealth (Story) |
|---|---|---|---|---|
| Textual contents | 3555 | 2029 | 606 | 1690 |
| Tweet IDs | 151,964 | 140,820 | 47,338 | 384,073 |
| Retweet IDs | - | - | 16,959 | 92,758 |
| Reply IDs | 122,150 | - | 1575 | 20,644 |
Dimensionality of the gathered and cleaned datasets.
| Data | CoAID | ReCOVery | FakeHealth (Release) | FakeHealth (Story) |
|---|---|---|---|---|
| Textual contents | 1820 | 1910 | 594 | 1498 |
| Tweet IDs | 74,722 | 42,153 | 44,547 | 315,709 |
| Retweet IDs | 65,464 | 43,024 | 16,070 | 99,971 |
| Replies IDs | 29,969 | - | 1253 | 14,472 |
| User IDs | 164,891 | 58,495 | 28,893 | 206,798 |
List of the linguistic-stylistic features considered.
| Features | Examples/Explanations |
|---|---|
| Strong modals | might, could, can, would, may |
| Weak modals | should, ought, need, shall, will |
| Conditionals | if |
| Negations | no, not, neither, nor, never |
| Conclusive conjunctions | therefore, thus, furthermore |
| Subordinating conjunctions | until, despite, in spite, though |
| Following conjunctions | but, however, otherwise, yet |
| Definite determiners | the, this, that, those, these |
| Personal pronouns | I, you |
| First person | I, we, me, my, mine, us, our |
| Second person | you, your, yours |
| Third person | he, she, him, her, his, it, its |
| Question particles | why, what, when, which, who |
| Adjectives | correct, extreme, long, visible |
| Adverbs | maybe, about, probably, much |
| Proper nouns | names of places, things, etc. |
| Other nouns | other nouns |
| To be form | be, am, is are, was, were, been |
| To have form | have, has, had, having |
| Past tense verb | past tense verb |
| Gerund | gerund |
| Participle verb | past or present participle verb |
| Superlatives | superlative adjectives or adverbs |
| Exclamation | exclamation mark |
| Other | other terms |
Figure 2Architecture of the Convolutional Neural Network.
Figure 3Architecture of the Bidirectional LSTM network.
Global evaluation results for the CoAID dataset.
| Dataset | Classifier | AUC | f-Measure |
|---|---|---|---|
|
| CNN(WE) |
|
|
| CNN(WE+all) | 0.962 | 0.943 | |
| ML(WE+all) | 0.925 | 0.914 | |
| ML(BoW-TF-IDF+all) | 0.898 | 0.865 | |
| ML(BoW-binary+all) | 0.892 | 0.863 | |
| Bi-LSTM(WE+all) | 0.849 | 0.859 | |
| Bi-LSTM(WE) | 0.848 | 0.857 | |
| HPN | 0.844 | 0.858 | |
| ML(LIWC) | 0.669 | 0.789 |
Global evaluation results for the ReCOVery dataset.
| Dataset | Classifier | AUC | f-Measure |
|---|---|---|---|
|
| ML(WE+all) |
| 0.848 |
| ML(BoW-TF-IDF+all) | 0.915 | 0.771 | |
| CNN(WE) | 0.913 |
| |
| ML(BoW-binary+all) | 0.903 | 0.709 | |
| CNN(WE+all) | 0.896 | 0.828 | |
| ML(LIWC) | 0.817 | 0.743 | |
| Bi-LSTM(WE+all) | 0.741 | 0.655 | |
| Bi-LSTM(WE) | 0.734 | 0.673 | |
| HPN | 0.716 | 0.694 |
Global evaluation results for the FakeHealth dataset.
| Dataset | Classifier | AUC | f-Measure |
|---|---|---|---|
| ML(BoW-TF-IDF+all) |
| 0.653 | |
| ML(WE+all) | 0.687 |
| |
| ML(BoW-binary+all) | 0.675 | 0.641 | |
| CNN(WE) | 0.661 | 0.602 | |
| CNN(WE+all) | 0.645 | 0.597 | |
| ML(LIWC) | 0.608 | 0.598 | |
| Bi-LSTM(WE) | 0.583 | 0.574 | |
| Bi-LSTM(WE+all) | 0.563 | 0.539 | |
| HPN | 0.581 | 0.593 | |
| ML(BoW-TF-IDF+all) |
| 0.627 | |
| CNN(WE) | 0.700 | 0.624 | |
| CNN(WE+all) | 0.698 | 0.655 | |
| ML(LIWC) | 0.694 | 0.704 | |
| ML(BoW-binary+all) | 0.679 | 0.609 | |
| ML(WE+all) | 0.657 |
| |
| Bi-LSTM(WE+all) | 0.656 | 0.602 | |
| Bi-LSTM(WE) | 0.654 | 0.602 | |
| HPN | 0.563 | 0.660 |
Evaluation of effectiveness by feature class.
| AUC | CoAID | ReCOVery | FakeHealth (Release) | FakeHealth (Story) |
|---|---|---|---|---|
|
| 0.624 | 0.708 | 0.576 | 0.630 |
|
| 0.601 | 0.774 |
| 0.532 |
|
| 0.610 | 0.612 | 0.595 |
|
|
| 0.729 |
| 0.525 | 0.548 |
|
|
| 0.795 | 0.602 | 0.563 |