| Literature DB >> 31293336 |
Ka-Shing Cheung1, Wai K Leung1, Wai-Kay Seto1.
Abstract
Big Data, which are characterized by certain unique traits like volume, velocity and value, have revolutionized the research of multiple fields including medicine. Big Data in health care are defined as large datasets that are collected routinely or automatically, and stored electronically. With the rapidly expanding volume of health data collection, it is envisioned that the Big Data approach can improve not only individual health, but also the performance of health care systems. The application of Big Data analysis in the field of gastroenterology and hepatology research has also opened new research approaches. While it retains most of the advantages and avoids some of the disadvantages of traditional observational studies (case-control and prospective cohort studies), it allows for phenomapping of disease heterogeneity, enhancement of drug safety, as well as development of precision medicine, prediction models and personalized treatment. Unlike randomized controlled trials, it reflects the real-world situation and studies patients who are often under-represented in randomized controlled trials. However, residual and/or unmeasured confounding remains a major concern, which requires meticulous study design and various statistical adjustment methods. Other potential drawbacks include data validity, missing data, incomplete data capture due to the unavailability of diagnosis codes for certain clinical situations, and individual privacy. With continuous technological advances, some of the current limitations with Big Data may be further minimized. This review will illustrate the use of Big Data research on gastrointestinal and liver diseases using recently published examples.Entities:
Keywords: Colorectal cancer; Epidemiology; Gastric cancer; Gastrointestinal bleeding; Healthcare dataset; Hepatocellular carcinoma; Inflammatory bowel disease
Year: 2019 PMID: 31293336 PMCID: PMC6603810 DOI: 10.3748/wjg.v25.i24.2990
Source DB: PubMed Journal: World J Gastroenterol ISSN: 1007-9327 Impact factor: 5.742
Advantages and shortcomings of Big Data analysis (with proposed solutions)
| Clinical data readily available with minimal resources required | |
| Can study rare exposures | |
| Can study rare events | |
| Can study long-term effects | |
| Real-world data | |
| Large sample size | |
| Subgroup analysis | |
| Sensitivity analysis | |
| Interaction of different variables | |
| Adjustment of outcome to a multitude of risk factors | |
| Precise estimation of effect size | |
| Reliable capture of small variations in incidence or disease flare | |
| No selection bias if | |
| Data validity | Cross reference with medical records in a subset of the sample |
| Missing data | Statistical methods to deal with missing data, |
| Text mining or natural language processing of unstructured data | |
| Incomplete capture of variables or unavailability of certain diagnosis codes | Surrogate markers ( |
| Inclusion of a large set of measured variables | |
| Text mining or natural language processing of unstructured data | |
| Privacy | De-identification of individuals |
| Review of study plan by local ethics committee | |
| Hypothesis-free predictive models | Validation in prospective studies or randomized control trials |
| Residual and/or unmeasured confounding | Inclusion of a large set of measured variables |
| Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources | |
| Fulfilment of Bradford Hill criteria | |
| Reverse causality/protopathic bias (outcome of interest leads to exposure of interest) | Cohort study design instead of case-control study design |
| Excluding prescriptions of drugs of interest ( | |
| Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC | |
| Selection bias | Encompassing entire study population ( |
| Indication bias (or confounding by indication/disease severity) | Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment ( |
| Negative control exposure | |
| Confounding by functional status and cognitive impairment | Balance of patient characteristics, in particular comorbidities that can affect functional and cognitive status ( |
| Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes) | Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data |
| Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design) | Landmark analysis |
| Analysis using time varying covariates | |
| Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI users | Selection of an unexposed group with a similar likelihood of screening/testing |
| Selection of an outcome that are likely to be diagnosed equally in exposed and control groups | |
| Adjustment for the surveillance rate | |
| Access to healthcare | Stratified analysis according to patients’ residential regions ( |
| Selective prescription and treatment in frail and very sick patients | PS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction) |
COPD: Chronic pulmonary obstructive disease; RCT: Randomized controlled trial; GC: Gastric cancer; PPI: Proton pump inhibitor; PS: Propensity score.
Advantages of propensity score methodology
| Addressing “curse of dimensionality” when EPV < 10 | Traditional multivariable regression models yield similar results if EPV ≥ 10 |
| Recognition of subjects with absolute indications (or contraindications) of an intervention | Exclusion of areas of non-overlap of the PS distribution between exposed and unexposed groups to ensure comparability |
| Identification of PS interaction with treatment | Variation of effectiveness of an intervention according to indications (PS) may only be identified |
EPV: Events per variable; PS: Propensity score.
Examples of studies on gastric cancer research by utilization of large healthcare datasets
| Taiwan, China | Taiwan National Health Insurance Database (NHID) | GC | 80255 | Nationwide retrospective cohort study | Early |
| Wu et al[ | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| GC | 52161 | Nationwide retrospective cohort study | Association between NSAIDs and GC | ||
| Wu et al[ | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | GC | 63397 | Territory-wide retrospective cohort study | Association between PPIs and GC |
| Cheung et al[ | |||||
| PS regression adjustment | |||||
| Volume, Velocity and Variety | |||||
| GC | 63605 | Territory-wide retrospective cohort study | Association between aspirin and GC | ||
| Cheung et al[ | |||||
| PS regression adjustment | |||||
| Volume, Velocity and Variety | |||||
| GC | 63397 | Territory-wide retrospective cohort study | Effect of | ||
| Leung et al[ | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| GC | 7266 | Territory-wide retrospective cohort study | Association between metformin and GC | ||
| Cheung et al[ | |||||
| PS regression adjustment | |||||
| Sensitivity analysis: PS weighting by IPTW and PS matching | |||||
| Volume, Velocity and Variety | |||||
| Sweden | Swedish Cancer Registry | GC | 797067 | Nationwide retrospective cohort study | Association between PPIs and GC |
| Brusselaers et al[ | |||||
| Swedish Prescribed Drug Registry | Comparison with general population to derive SIR | ||||
| Volume, Velocity and Variety | |||||
| GC | 95176 | Nationwide retrospective cohort study | Effect of | ||
| Doorakkers et al[ | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
| United States | Kaiser Permanente (KP) | GC | 61684 | Retrospective cohort study | Association between different PPIs and GC |
| Schneider et al[ | |||||
| Volume, Velocity and Variety | |||||
This list is not exhaustive, but serves to provide a few distinct examples of how Big Data analysis can generate high-quality research outputs in the field of gastroenterology and hepatology. 3V: Volume/velocity/variety; GC: Gastric cancer; SIR: Standardized incidence ratio; H. pylori: Helicobacter pylori; NSAIDs: Non-steroidal anti-inflammatory drugs; PS: Propensity score; PPIs: Proton pump inhibitors; IPTW: Inverse probability of treatment weighting.
Examples of studies on hepatocellular carcinoma research by utilization of large healthcare datasets
| Taiwan, China | Publicly available data on HCC-related genes | HCC | n.a. | Signature inversion study | Anti-cancer effects of chlorpromazine and trifluoperazine on HCC |
| Chen et al[ | |||||
| Volume, Velocity and Variety | |||||
| Connectivity Map (CMap) -- includes 6100 drug-mediated expression profiles | |||||
| Taiwan National Health Insurance Database (NHID) | HCC | 4569 | Nationwide retrospective cohort study | Association between NA therapy and HCC recurrence among patients with HBV-related HCC after liver resection | |
| Wu et al[ | |||||
| Volume, Velocity and Variety | |||||
| Taiwan National Health Insurance Database (NHID) | HCC | 292290 | Nationwide case-control study | Association between DM and HCC | |
| Chen et al[ | |||||
| Volume, Velocity and Variety | |||||
| Taiwan National Health Insurance Database (NHID) | HCC | 43190 | Nationwide retrospective cohort study | Association between NA therapy and HCC among CHB patients | |
| Wu et al[ | |||||
| PS matching | |||||
| Volume, Velocity and Variety | |||||
| China | The Cancer Genome Atlas (TCGA) database | HCC | n.a. | Signature inversion study | Anti-cancer effect of prenylamine on HCC |
| Wang et al[ | |||||
| Volume, Velocity and Variety | |||||
| Connectivity Map (CMap) | |||||
| South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | HCC | 24156 | Nationwide retrospective cohort study | Difference between tenofovir and entecavir on reducing HCC risk |
| Choi et al[ | |||||
| Volume, Velocity and Variety | |||||
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | HCC | Entire Hong Kong population between 1999 and 2012 | Territory-wide retrospective cohort study | Association between NA therapy and HCC among CHB patients |
| Seto et al[ | |||||
| Volume, Velocity and Variety | |||||
| Sweden | Swedish Cancer Registry | HCC | 9160 CHB patients | Nationwide retrospective cohort study | Association between concomitant HBV/HDV infection and HCC |
| Ji et al[ | |||||
| Swedish Patient Registry | |||||
| Comparison with general population to derive SIR | |||||
| Volume, Velocity and Variety | |||||
This list is not exhaustive, but serves to provide a few distinct examples of how Big Data analysis can generate high-quality research outputs in the field of gastroenterology and hepatology. 3V: Volume/velocity/variety; HCC: Hepatocellular carcinoma; NA: Nucleos(t)ide analogue; DM: Diabetes mellitus; PS: Propensity score; CHB: Chronic hepatitis B; SIR: Standardized incidence ratio; HDV: Hepatitis D virus.
Examples of studies on gastrointestinal bleeding and/or proton pump inhibitor research by utilization of large healthcare datasets
| Taiwan, China | Taiwan National Health Insurance Database (NHID) | PUD | 403567 | Nationwide retrospective cohort study | Effect of |
| Wu et al[ | |||||
| Volume, Velocity and Variety | |||||
| PUD | 32235 | Nationwide retrospective cohort study | Risk of rebleeding from PUD in ESRD patients | ||
| Wu et al[ | |||||
| Volume, Velocity and Variety | |||||
| PPIs | 6552 | Nationwide retrospective cohort study | Effect of clopidogrel and PPIs on ACS | ||
| Volume, Velocity and Variety | |||||
| Wu et al[ | |||||
| South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | PPIs | 59233 | Nationwide retrospective cohort study | Effect of PPIs on thrombotic risk |
| Kim et al[ | |||||
| Volume, Velocity and Variety | |||||
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | Dabigatran | 5041 | Territory-wide retrospective cohort study | Risk factors for dabigatran-associated gastrointestinal bleeding |
| Chan et al[ | |||||
| Volume, Velocity and Variety | |||||
This list is not exhaustive, but serves to provide a few distinct examples of how Big Data analysis can generate high-quality research outputs in the field of gastroenterology and hepatology. 3V: Volume/velocity/variety; PUD: Peptic ulcer disease; H. pylori: Helicobacter pylori; PPIs: Proton pump inhibitors; ESRD: End-stage renal disease; ACS: Acute coronary syndrome.
Examples of studies on inflammatory bowel disease research by utilization of large healthcare datasets
| South Korea | Korean Health Insurance Review and Assessment Service (HIRA) | UC | 11233 | Nationwide retrospective cohort study | Incidence and clinical impact of perianal disease in UC |
| Song et al[ | |||||
| Comparator: general population | |||||
| Volume, Velocity and Variety | |||||
| Taiwan, China | Taiwan National Health Insurance Database (NHID) | IBD | 38039 | Nationwide retrospective cohort study to compare IBD patients with general population to derive SIR | Association between IBD and herpes zoster infection |
| Chang et al[ | |||||
| Hospital based nested case-control study | |||||
| Volume, Velocity and Variety | |||||
| Sweden | Swedish Patient Registry | UC | 63711 | Nationwide retrospective cohort study | Association between appendectomy and UC |
| Myrelid et al[ | |||||
| Volume, Velocity and Variety | |||||
| Swedish Medical Birth Register (child-mother link) | IBD | 827,239 children born between 2006 and 2013 | Nationwide prospective population-based register study | Association between maternal exposure to antibiotics during pregnancy and very early onset IBD in adulthood | |
| Ortqvist et al[ | |||||
| Volume, Velocity and Variety | |||||
| Swedish Multigeneration Register (child-father link) | |||||
| Swedish Prescribed Drug Register National Patient Register | |||||
| United States | NCBI Gene Expression Omnibus (GEO) | IBD | n.a. | Signature inversion study | Topiramate as a potential therapeutic agent against IBD |
| Dudley et al[ | |||||
| Volume, Velocity and Variety | |||||
| United States | n.a. | IBD | 1585 | Retrospective cohort study Natural language processing | Association between arthralgia and biologics (anti-TNF |
| Cai et al[ | |||||
| Volume, Velocity and Variety | |||||
| n.a | International IBD Genetics Consortium's Immunochip project | IBD | 53279 | Machine learning algorithm | Predictors of IBD |
| Wei et al[ | |||||
| Volume, Velocity and Variety | |||||
| United States | n.a. | IBD | 575 colonoscopy reports | Retrospective cohort study Natural language processing | Differentiation of surveillance from non-surveillance colonoscopy |
| Hou et al[ | |||||
| Volume, Velocity and Variety | |||||
| United States | n.a. | IBD | 1080 | Retrospective cohort study | Prediction of IBD remission in thiopurine users |
| Waljee et al[ | |||||
| Random Forest machine learning algorithm | |||||
| United States | n.a. | IBD | 20368 | Retrospective cohort study | Prediction of hospitalization and outpatient steroid use |
| Waljee et al[ | |||||
| Random Forest machine learning algorithm | |||||
| n.a. | Phase 3 clinical trial data | IBD | 491 | Retrospective cohort study | Prediction of steroid-free endoscopic remission with vedolizumab in UC |
| Waljee et al[ | |||||
| Random Forest machine learning algorithm | |||||
| Volume, Velocity and Variety | |||||
This list is not exhaustive, but serves to provide a few distinct examples of how Big Data analysis can generate high-quality research outputs in the field of gastroenterology and hepatology. 3V: Volume/velocity/variety; UC: Ulcerative colitis; IBD: Inflammatory bowel disease; SIR: Standardized incidence ratio; anti-TNF: anti-tumour necrosis factor.
Examples of studies on colorectal cancer research by utilization of large healthcare datasets
| Hong Kong, China | Clinical Data Analysis and Reporting System (CDARS) | CRC | 197902 | Territory-wide retrospective cohort study | Epidemiology, characteristics, risk factors and prognosis of postcolonoscopy Colorectal cancer in Asians |
| Cheung et al[ | |||||
| Volume, Velocity and Variety | |||||
| CRC | 187897 | Territory-wide retrospective cohort study | Association between statins and CRC | ||
| Cheung et al[ | |||||
| PS matching | |||||
| Volume, Velocity and Variety | |||||
| United States | Nurses’ Health Study II (NHSII) | CRC | 134763 | Prospective cohort study | Association between DM and CRC |
| Ma et al[ | |||||
| Volume and Variety | |||||
| Health Professionals Follow-up Study (HPFS) | |||||
| Nurses’ Health Study (NHS) | CRC | 1660 | Prospective cohort study | Effect of calcium intake, coffee and fibre on survival after CRC diagnosis | |
| Yang et al[ | |||||
| 1599 | |||||
| Volume and Variety | |||||
| Health Professionals Follow-up Study (HPFS) | Hu et al[ | ||||
| 1575 | |||||
| Song et al[ | |||||
| Nurses’ Health Study (NHS) | CRC | 141143 | Prospective cohort study | Risk factors of serrated polyps and conventional adenomas | |
| He et al[ | |||||
| Nurses’ Health Study II (NHSII) | |||||
| de Jong et al[ | |||||
| Volume and Variety | |||||
| Health Professionals Follow-up Study (HPFS) | |||||
| Nurses’ Health Study II (NHSII) | CRC | 85256 | Prospective cohort study | Association between obesity and CRC | |
| Liu et al[ | |||||
| Volume and Variety | |||||
| Netherlands | Dutch Lynch syndrome Registry | Various cancers including | 2788 | Retrospective cohort study | Decrease in CRC-related mortality in Lynch syndrome families by surveillance |
| Volume, Velocity and Variety | |||||
| CRC | |||||
| Netherlands, Germany, Finland | Dutch Lynch syndrome Registry | CRC | 2747 patients with 16327 colonoscopies | Retrospective cohort study | Surveillance interval on CRC incidence and stage |
| Engel et al[ | |||||
| Volume, Velocity and Variety | |||||
| German HNPCC Consortium | |||||
| Finland | |||||
This list is not exhaustive, but serves to provide a few distinct examples of how Big Data analysis can generate high-quality research outputs in the field of gastroenterology and hepatology. 3V: Volume/velocity/variety; CRC: Colorectal cancer; DM: Diabetes mellitus.