Literature DB >> 27213980

A primer on theory-driven web scraping: Automatic extraction of big data from the Internet for use in psychological research.

Richard N Landers1, Robert C Brusso2, Katelyn J Cavanaugh1, Andrew B Collmus1.   

Abstract

The term big data encompasses a wide range of approaches of collecting and analyzing data in ways that were not possible before the era of modern personal computing. One approach to big data of great potential to psychologists is web scraping, which involves the automated collection of information from webpages. Although web scraping can create massive big datasets with tens of thousands of variables, it can also be used to create modestly sized, more manageable datasets with tens of variables but hundreds of thousands of cases, well within the skillset of most psychologists to analyze, in a matter of hours. In this article, we demystify web scraping methods as currently used to examine research questions of interest to psychologists. First, we introduce an approach called theory-driven web scraping in which the choice to use web-based big data must follow substantive theory. Second, we introduce data source theories, a term used to describe the assumptions a researcher must make about a prospective big data source in order to meaningfully scrape data from it. Critically, researchers must derive specific hypotheses to be tested based upon their data source theory, and if these hypotheses are not empirically supported, plans to use that data source should be changed or eliminated. Third, we provide a case study and sample code in Python demonstrating how web scraping can be conducted to collect big data along with links to a web tutorial designed for psychologists. Fourth, we describe a 4-step process to be followed in web scraping projects. Fifth and finally, we discuss legal, practical and ethical concerns faced when conducting web scraping projects. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

Mesh:

Year:  2016        PMID: 27213980     DOI: 10.1037/met0000081

Source DB:  PubMed          Journal:  Psychol Methods        ISSN: 1082-989X


  7 in total

1.  A computational reward learning account of social media engagement.

Authors:  Björn Lindström; Martin Bellander; David T Schultner; Allen Chang; Philippe N Tobler; David M Amodio
Journal:  Nat Commun       Date:  2021-02-26       Impact factor: 14.919

2.  Temporal trends in incidence of time-loss injuries in four male professional North American sports over 13 seasons.

Authors:  Garrett S Bullock; Elizabeth Murray; Jake Vaughan; Stefan Kluzek
Journal:  Sci Rep       Date:  2021-04-15       Impact factor: 4.379

3.  Analysis of mental and physical disorders associated with COVID-19 in online health forums: a natural language processing study.

Authors:  Rashmi Patel; Fabrizio Smeraldi; Maryam Abdollahyan; Jessica Irving; Conrad Bessant
Journal:  BMJ Open       Date:  2021-11-05       Impact factor: 2.692

4.  Visualization analysis of junior school students' pubertal timing and social adaptability using data mining approaches.

Authors:  Youzhong Ma; Ruiling Zhang; Yongxin Zhang
Journal:  Heliyon       Date:  2022-08-28

5.  Big data hurdles in precision medicine and precision public health.

Authors:  Mattia Prosperi; Jae S Min; Jiang Bian; François Modave
Journal:  BMC Med Inform Decis Mak       Date:  2018-12-29       Impact factor: 2.796

6.  Characterization of Rookie Season Injury and Illness and Career Longevity Among National Basketball Association Players.

Authors:  Chelsea L Martin; Amelia J H Arundale; Stefan Kluzek; Tyler Ferguson; Gary S Collins; Garrett S Bullock
Journal:  JAMA Netw Open       Date:  2021-10-01

7.  Temporal Trends and Severity in Injury and Illness Incidence in the National Basketball Association Over 11 Seasons.

Authors:  Garrett S Bullock; Tyler Ferguson; Jake Vaughan; Desiree Gillespie; Gary Collins; Stefan Kluzek
Journal:  Orthop J Sports Med       Date:  2021-06-14
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.