| Literature DB >> 32706701 |
Richard Zowalla1,2, Thomas Wetter3,4,5, Daniel Pfeifer1,2.
Abstract
BACKGROUND: The internet has become an increasingly important resource for health information. However, with a growing amount of web pages, it is nearly impossible for humans to manually keep track of evolving and continuously changing content in the health domain. To better understand the nature of all web-based health information as given in a specific language, it is important to identify (1) information hubs for the health domain, (2) content providers of high prestige, and (3) important topics and trends in the health-related web. In this context, an automatic web crawling approach can provide the necessary data for a computational and statistical analysis to answer (1) to (3).Entities:
Keywords: distributed system; health information; internet; web crawling
Mesh:
Year: 2020 PMID: 32706701 PMCID: PMC7414401 DOI: 10.2196/17853
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Figure 1Schematic representation of the web graph traversal by a crawler. Pages colored in blue represent processed pages; in green, pages referenced in the frontier; in gray, undiscovered web content. Pages in dashed blue represent so-called initial seed pages.
Figure 2Architecture of a focused crawler based on the StormCrawler software development kit. Spouts (tap symbol) emit data (here: URLs), bolts (lightning symbol) process data (ie fetch, parse, and store the extracted content). Bolts can be enhanced with URL filters (white filter symbol) or parse filters (black filter symbol). URL filters are used to remove URLs based on predefined criteria. Parse filters include URL filters but are primarily used to clean the parsed content and compute topic relevance and priority.
Figure 3Relationship between target, relevant, and crawled web pages. Recall is estimated based on known relevant target pages and underlying independence assumption.
Figure 4Recall estimate equation.
Figure 5Workflow of an support vector machine–based text classification system: black lines indicate the training process; blue lines indicate the classification process; slanted boxes represent data; rectangular boxes represent computational steps.
Figure 6Workflow of the crowd-sourcing approach to build a test corpus for the purpose of classifier evaluation. Black lines indicate the assessment process; slanted boxes represent data; rectangular boxes represent processing steps.
Total number of acquired articles and respective class labels of various German content providers.
| Content provider | Class | Certa | Organization | Articles | Words (mean) | Words (median) | Sentences (mean) |
| Wikipedia Health | Hb | no | Wikimedia Foundation | 28,436 | 429 | 254 | 31 |
| Wikipedia General | Gc | no | Wikimedia Foundation | 18,364 | 736 | 266 | 26 |
| Common Crawl | G | no | Common Crawl Foundation | 36,297 | 480 | 429 | 33 |
| Deutsches Ärzteblatt | H | no | German Medical Association, National Association of Statutory Health Insurance Physicians | 9638 | 1852 | 520 | 136 |
| Onmeda | H | yes | Gofeminin.de GmbH | 636 | 6564 | 6113 | 439 |
| gesundheitsinformation.de | H | yes | Institute for Quality and Efficiency in Healthcare | 235 | 1923 | 1799 | 139 |
| Apotheken Umschau | H | yes | Wort & Bild Verlag | 1907 | 1052 | 658 | 73 |
| GESUNDheit.gv.at | H | no | Ministry of Social Affairs (Austria) | 2929 | 295 | 221 | 21 |
| Total | —d | — | — | 98,442 | 741 | 339 | 44 |
aYes indicates that a provider is certified by the Health On The Net Foundation Code of Conduct or another certification provider.
bH: health-related language.
cG: general language.
dNot applicable.
Total amount of articles used in the training and test corpus per content provider with corresponding class labels: health-related language (H) and general language (G).
| Content provider | Class | Documents | ||
|
|
| Training | Test | Total |
| Wikipedia | Ha | 22,748 | 5688 | 28,436 |
| Wikipedia | Gb | 10,339 | 2585 | 12,924 |
| Common Crawl | G | 24,685 | 6172 | 30,857 |
| Deutsches Ärzteblatt | H | 7710 | 1928 | 9638 |
| Onmeda | H | 509 | 127 | 636 |
| gesundheitsinformation.de | H | 189 | 46 | 235 |
| Apotheken Umschau | H | 1525 | 382 | 1907 |
| GESUNDheit.gv.at | H | 2343 | 586 | 2929 |
| Total | –c | 70,048 | 17,514 | 87,562 |
aH: health-related language.
bG: general language.
cNot applicable.
Listing of the confusion matrix and related evaluation metrics for the test and crowd-validated data set.
| Evaluation data sets | Baseline | ||||||
|
| Health | General | Sum | Accuracy | Precision | Recall | |
|
|
|
|
| 0.937 | 0.934 | 0.94 | |
|
| SVMa | —b | — | — | — | — | — |
|
| Health | 8182 | 575 | 8757 | — | — | — |
|
| General | 522 | 8235 | 8757 | — | — | — |
|
| Sum | 8704 | 8810 | 17,514 | — | — | — |
|
|
|
|
| 0.966 | 0.954 | 0.989 | |
|
| SVM | — | — | — | — | — | — |
|
| Health | 181 | 11 | 192 | — | — | — |
|
| General | 2 | 190 | 192 | — | — | — |
|
| Sum | 183 | 211 | 384 | — | — | — |
aSVM: support vector machine.
bNot applicable.
Figure 7Harvest rate over time measured at the end of each day (dashed line represents the mean harvest rate). Note that the drop at day 53 is related to an outage at our data center. Peak at day 106: storm cluster was extended by three additional virtual machines. Peaks at days 157, 158, 191, 194 and 222: crawl was resumed after infrastructure maintenance due to urgent security updates that required a restart of the host system and/or of the virtual machines.
Domains of 25 top-ranked web sites for country-code top-level domain .de with their respective publisher according to PageRank.
| Rank | Domain | Publisher | Type |
| 1 | www.rki.de | Robert Koch Institute | PIa |
| 2 | www.aerzteblatt.de | Deutscher Ärzte-Verlag GmbH | PI |
| 3 | www.charite.de | Charité–Berlin University of Medicine | PI |
| 4 | www.deutsche-alzheimer.de | Deutsche Alzheimer Gesellschaft | NPOb |
| 5 | www.aerztezeitung.de | Springer Medizin Verlag GmbH | POc |
| 6 | www.dge.de | Deutsche Gesellschaft für Ernährung | NPO |
| 7 | www.g-ba.de | Gemeinsamer Bundesausschuss (Federal Joint Comitee) | PI |
| 8 | www.bzga.de | Bundeszentrale für gesundheitliche Aufklärung (Federal Centre for Health Education) | PI |
| 9 | www.bundesgesundheitsministerium.de | Bundesministerium für Gesundheit (Federal Ministry of Health) | PI |
| 10 | www.apotheken-umschau.de | Wort & Bild Verlag | PO |
| 11 | www.dimdi.de | Deutsches Institut für Medizinische Dokumentation und Information (German Institute for Medical Documentation and Information) | PI |
| 12 | www.gesundheitsinformation.de | Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen (Institute for Quality and Efficiency in Healthcare) | PI |
| 13 | www.osteopathie.de | Verband der Osteopathen Deutschland eV | NPO |
| 14 | www.krebsgesellschaft.de | Deutsche Krebsgesellschaft eV | NPO |
| 15 | www.bfarm.de | Bundesinstitut für Arzneimittel und Medizinprodukte (Federal Institute for Drugs and Medical Devices) | PI |
| 16 | www.kbv.de | Kassenärztliche Bundesvereinigung | PI |
| 17 | www.krebshilfe.de | Stiftung Deutsche Krebshilfe | NPO |
| 18 | www.tk.de | Techniker Krankenkasse (Health Insurance) | PO |
| 19 | www.ebm-netzwerk.de | Deutsches Netzwerk Evidenzbasierte Medizin eV | NPO |
| 20 | www.bmg.bund.de | Bundesministerium für Gesundheit (Federal Ministry of Health) | PI |
| 21 | www.netdoktor.de | NetDoktor.de GmbH | PO |
| 22 | www.drk.de | Deutsches Rotes Kreuz eV (German Red Cross) | NPO |
| 23 | www.herzstiftung.de | Deutsche Herzstiftung | NPO |
| 24 | www.klinikum.uni-heidelberg.de | Universitätsklinikum Heidelberg | PI |
| 25 | www.aok.de | AOK Gesundheiskasse (Health Insurance) | PO |
aPI: public institution.
bNPO: nonprofit organization.
cPO: private organization.
Domains of 25 top-ranked web sites for country-code top-level domain .at with their respective publisher according to PageRank.
| Rank | Domain | Publisher | Type |
| 1 | www.gesundheit.gv.at | Bundesministerium für Arbeit, Soziales, Gesundheit und Konsumentenschutz (Ministry of Social Affairs) | PIa |
| 2 | www.meduniwien.ac.at | University of Vienna | PI |
| 3 | www.bmgf.gv.at | Bundesministerium für Arbeit, Soziales, Gesundheit und Konsumentenschutz (Ministry of Social Affairs) | PI |
| 4 | www.sozialministerium.at | Bundesministerium für Arbeit, Soziales, Gesundheit und Konsumentenschutz (Ministry of Social Affairs) | PI |
| 5 | www.apotheker.or.at | Österreichische Apothekenkammer (Austrian Pharmaceutical Association) | PI |
| 6 | www.sam-pharma.at | Pharma Handel GmbH | POb |
| 7 | www.aerztekammer.at | Österreichische Ärztekammer (Austrian Medical Association) | PI |
| 8 | www.univie.ac.at | University of Vienna | PI |
| 9 | www.herz-ambulatorium.at | Individual Person | PO |
| 10 | www.herz-ordination.at | Individual Person | PO |
| 11 | www.tg-steiermark.at | TG Therapeutische Gemeinschaft Betriebs GmbH | NPOc |
| 12 | www.impuls-fs.at | Institut für medizinisch-physiotherapeutische Untersuchung, Lehre und Schulung | PO |
| 13 | www.medunigraz.at | University of Graz | PI |
| 14 | www.brustvergroesserung-leicht.at | Individual Person | PO |
| 15 | www.bmg.gv.at | Bundesministerium für Arbeit, Soziales, Gesundheit und Konsumentenschutz (Ministry of Social Affairs) | PI |
| 16 | www.kages.at | Steiermärkische Krankenanstaltengesellschaft mbH | PO |
| 17 | science.orf.at | Österreichischer Rundfunk (Austrian Broadcasting Corporation) | PI |
| 18 | www.gynmed.at | Individual Person | PO |
| 19 | www.fhstp.ac.at | St. Pölten University of Applied Sciences | PI |
| 20 | www.dr-boehm.at | Individual Person | PO |
| 21 | bmg.gv.at | Bundesministerium für Arbeit, Soziales, Gesundheit und Konsumentenschutz (Ministry of Social Affairs) | PI |
| 22 | www.novartis.at | Novartis AG | PO |
| 23 | www.babyforum.at | FOKUS KIND Medien, CRAFT & VALUE | PO |
| 24 | femmestyle.at | Schönheitschirurgie femmestyle | PO |
| 25 | www.pfizer.at | Pfizer Inc | PO |
aPI: public institution.
bPO: private organization.
cNPO: nonprofit organization.
Domains of 25 top-ranked web sites for country-code top-level domain .ch with their respective publisher according to PageRank.
| Rank | Domain | Publisher | Type |
| 1 | www.uzh.ch | University of Zurich | PIa |
| 2 | www.usz.ch | Universitätsspital Zürich | PI |
| 3 | www.srf.ch | Schweizerische Radio- und Fernsehgesellschaft (Swiss Broadcasting Corporation) | PI |
| 4 | www.netdoktor.ch | netdoktor GmbH | POb |
| 5 | www.pancreas-help.ch | Schweizer Selbsthilfeorganisation Pankreaserkrankungen | NPOc |
| 6 | www.mutterglueck.ch | Individual Person | PO |
| 7 | www.association-osteo-swiss.ch | Schweizerischer Verband der Osteopathen | NPO |
| 8 | www.unibas.ch | University of Basel | PI |
| 9 | www.ethz.ch | ETH Zurich (Swiss Federal Institute of Technology in Zurich) | PI |
| 10 | www.rheumaliga.ch | Rheumaliga Schweiz | NPO |
| 11 | www.lungenliga.ch | Lungenliga Schweiz | NPO |
| 12 | www.rotpunkt-apotheken.ch | Rotpunkt-Pharma AG | PO |
| 13 | www.pharmawiki.ch | PharmaWiki GmbH | PO |
| 14 | www.bayer.ch | Bayer AG | PO |
| 15 | www.patientensicherheit.ch | Stiftung Patientensicherheit Schweiz | NPO |
| 16 | saez.ch | EMH Schweizerischer Ärzteverlag AG | NPO |
| 17 | www.swissheart.ch | Schweizerische Herzstiftung | NPO |
| 18 | gesundheitsfoerderung.ch | Gesundheitsförderung Schweiz | NPO |
| 19 | sensomotorische-lebensweisen.ch | Individual Person | PO |
| 20 | www.spitaluster.ch | Spital User | PO |
| 21 | symptome.ch | NOXA GmbH | PO |
| 22 | www.meineimpfungen.ch | Stiftung meineimpfungen | NPO |
| 23 | unicef.ch | United Nations International Children's Emergency Fund | NPO |
| 24 | www.bauchtumor.ch | Universitätsspital Bern | PI |
| 25 | www.fettabsaugungen.ch | FSnD Ltd | PO |
aPI: public institution.
bPO: private organization.
cNPO: nonprofit organization.
Figure 8A small extract of the host-aggregated web graph with focus on the website www.rki.de. The surrounding nodes represent websites with a maximum link-distance of two starting from www.rki.de. An edge between two nodes implies that there exists at least one hyperlink between some web pages of the hosting websites in either way. Only those websites are included whose content is highly health-related (ie, which were automatically classified as belonging to H with a probability equal to or greater than 0.93). Moreover, they have at least one ingoing and one outgoing link. The bigger a node and its caption, the higher is its page rank. For illustration reasons, directional arrows were not included.