| Literature DB >> 28122056 |
Bissan Audeh1, Michel Beigbeder1, Antoine Zimmermann1, Philippe Jaillon2, Cédric Bousquet3.
Abstract
The extraction of information from social media is an essential yet complicated step for data analysis in multiple domains. In this paper, we present Vigi4Med Scraper, a generic open source framework for extracting structured data from web forums. Our framework is highly configurable; using a configuration file, the user can freely choose the data to extract from any web forum. The extracted data are anonymized and represented in a semantic structure using Resource Description Framework (RDF) graphs. This representation enables efficient manipulation by data analysis algorithms and allows the collected data to be directly linked to any existing semantic resource. To avoid server overload, an integrated proxy with caching functionality imposes a minimal delay between sequential requests. Vigi4Med Scraper represents the first step of Vigi4Med, a project to detect adverse drug reactions (ADRs) from social networks founded by the French drug safety agency Agence Nationale de Sécurité du Médicament (ANSM). Vigi4Med Scraper has successfully extracted greater than 200 gigabytes of data from the web forums of over 20 different websites.Entities:
Mesh:
Year: 2017 PMID: 28122056 PMCID: PMC5266266 DOI: 10.1371/journal.pone.0169658
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Vigi4Med Scraper Structure.
Fig 2Web forums structure.
Fig 3An example of threads within a forum page.
Fig 4An example of messages within a thread page.
Fig 5Semantic data representation.
Fig 6An example of the generated RDF graph.
Fig 7Anonymization.
Fig 8Scraped sites results.
Comparaison of Vigi4Med Scrapper and other similar systems.
| Approach | Efficiency | P.flipping | Data ObDet. | Concept. Rep. | Privacy | Availability |
|---|---|---|---|---|---|---|
| Muslea et al. [ | ✕ | ✕ | ✔ | ✕ | ✕ | ✕ |
| Crescenzi et al. [ | ✕ | ✕ | ✔ | ✕ | ✕ | ✕ |
| Guo et al. [ | ✔ | ✕ | ✕ | ✕ | ✕ | ✕ |
| Cai et al. [ | ✔ | ✕ | ✕ | ✕ | ✕ | ✕ |
| Wang et al. [ | ✔ | ✔ | ✕ | ✕ | ✕ | ✕ |
| Yang et al. [ | ✔ | ✔ | ✔ | ✕ | ✕ | ✕ |
| Jiang et al. [ | ✔ | ✔ | ✔ | ✕ | ✕ | ✕ |
| Vigi4Med Scrapper | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |