| Literature DB >> 35915807 |
Rishabh Upadhyay1, Gabriella Pasi1, Marco Viviani1.
Abstract
Research aimed at finding solutions to the problem of the diffusion of distinct forms of non-genuine information online across multiple domains has attracted growing interest in recent years, from opinion spam to fake news detection. Currently, partly due to the COVID-19 virus outbreak and the subsequent proliferation of unfounded claims and highly biased content, attention has focused on developing solutions that can automatically assess the genuineness of health information. Most of these approaches, applied both to Web pages and social media content, rely primarily on the use of handcrafted features in conjunction with Machine Learning. In this article, instead, we propose a health misinformation detection model that exploits as features the embedded representations of some structural and content characteristics of Web pages, which are obtained using an embedding model pre-trained on medical data. Such features are employed within a deep learning classification model, which categorizes genuine health information versus health misinformation. The purpose of this article is therefore to evaluate the effectiveness of the proposed model, namely Vec4Cred, with respect to the problem considered. This model represents an evolution of a previous one, with respect to which new features and architectural choices have been considered and illustrated in this work.Entities:
Keywords: Consumer health; Deep learning; Health misinformation; Machine learning; Natural language processing
Year: 2022 PMID: 35915807 PMCID: PMC9330960 DOI: 10.1007/s11042-022-13368-z
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.577
Summary of pseudo-automated approaches
| Paper | Health | Source | Participants | Age | Disease | Outcome |
|---|---|---|---|---|---|---|
| context | Experience | |||||
| [ | Mental health | Web pages | 5 | N/R | Mental health patients | Comprehensiveness, authoritativeness, trustworthiness, and currency of health information on mental health Web- sites positively affect users’ perceptions w.r.t. message quality |
| [ | HIV prevention | Web pages | 40 | 18-24 | African female college students | Interactive features, practical advice, and content authored by familiar and trustworthy sources positively affect users’ perceptions w.r.t. message utility |
| [ | General health | Web pages | 44 | Mean = 37 | Italian-speaking adults with diffe- rent health literacy | – Adults with low health literacy mentioned less established credibility criteria compared to those mentioned by the group with the higher health literacy |
| – However, common credibility criteria for both groups are: medical authorship, identifiable authorship, institutional authorship, and presence of ad- ditional author’s information | ||||||
| [ | Diet | Web pages | 252 | N/R | N/R | Health literacy plays an important role in decreasing the effect of health misinformation on the consumer |
Summary of automated approaches
| Paper | Focus | Health context | Dataset | Features | Model(s) |
|---|---|---|---|---|---|
| [ | Quality (Credibility) | Breast cancer | 780 Web pages from BCKOnline | 34 feature vectors obtained by 8 BCKOnline metadata elements, including title, description, creator, publisher, type, rights, subject, and audience | SMO, Naive Bayes, J48, IB1, ZeroR |
| [ | Reliability | General health | 360 Web pages | Link-based features, commercial features, PageRank features, presentation features, word features | SVM |
| [ | Quality | Shingles, flu and migraine | 246 Web pages | Linguistic and formal features | Multinomial naive Bayes, k-nearest neighbour, SVM, stochastic gradient descent, logistic regression, multilayer perceptron |
| [ | Veracity | General health | Health Stack Exchange (3,958 questions and 2,260 tags) question answering dataset | Word embedding features and sentiment features (i.e., polarity and subjectivity) | Neural networks and convolutional neural networks |
| [ | Quality | Breast cancer, arthritis, and depression | 269 Web pages | BERT and BioBERT features | Random forests and HEA neural networks |
| [ | Reliability | Vaccination | 259 texts were classified as reliable, and 183 as unreliable | Bag-of-words features | Naive Bayes and logistic regression |
| [ | Misinformation | Cancer and diabetes | 8368 articles (fake:2084) | Word embedding features | Bidirectional GRU |
Fig. 1The multi-layer architecture of Vec4Cred. In particular, several configurations of the model are illustrated. In (a), only the Web page content and its DOM structure are considered; such information are employed in all the model configurations; (a) + (b) represents the model configuration in which the URL of the Web Page is also considered, as in the Web2Vec model [15]; (a) + (c) represents the model configuration proposed in [58], considering the links present in the content of the target Web page; (a) + (b) + (c) is the model configuration in which we add the URLs in the form of domain-names present in the target Web page; with the addition of (d), we indicate the model configuration considering also the keyword extracted from the pages referred by the links presents in the target Web Page; finally, the last configuration of the model, represented by the addition of (e), considers parts of speech from the target Web Page content
Fig. 2Example of the construction of the POS-level corpus
Fig. 3Example of the construction of the word-level corpus from URLs in the target page
Fig. 4Example of the construction of the word-level corpus for keywords extracted from the linked page content in the target Web page
Fig. 5Example of the (Web page content) word-level embedding phase
Evaluation results
| D1 | D2 | D3 | D3(BI) | ||
|---|---|---|---|---|---|
| Web2Vec (Baseline) | Accuracy | 80.34 | – | 72.31 | 71.32 ± 2.0 |
| F1 | 86.80 | – | 73.16 | 71.88 ± 1.7 | |
| AUC | 69.84 | – | 71.34 | 70.77 ± 1.0 | |
| Web2Vec+L (Baseline) | Accuracy | 81.11 | – | 72.56 | 72.23 ± 1.0 |
| F1 | 86.78 | – | 73.10 | 71.98 ± 1.5 | |
| AUC | 78.44 | – | 71.11 | 72.00 ± 1.0 | |
| GoodIT (Baseline) | Accuracy | 89.89 | 98.32 | 74.12 | 72.70 ± 2.0 |
| F1 | 93.78 | 97.01 | 76.61 | 75.00 ± 2.1 | |
| AUC | 85.69 | 97.71 | 74.56 | 75.40 + 1.0 | |
| Vec4Cred ( | Accuracy | 79.33 ± 2.0 | |||
| F1 | 79.01 ± 1.0 | ||||
| AUC | 78.19 ± 1.0 | ||||
| Vec4Cred ( | Accuracy | 81.00 ± 1.0 | |||
| F1 | 81.00 ± 1.8 | ||||
| AUC | 79.00 ± 3.4 | ||||
| Vec4Cred ( | Accuracy | 82.00 ± 1.3 | |||
| F1 | 82.00 ± 1.2 | ||||
| AUC | 81.00 ± 1.0 |