Literature DB >> 28490904

Clinical epidemiology in the era of big data: new opportunities, familiar challenges.

Vera Ehrenstein¹, Henrik Nielsen¹, Alma B Pedersen¹, Søren P Johnsen¹, Lars Pedersen¹.

Abstract

Routinely recorded health data have evolved from mere by-products of health care delivery or billing into a powerful research tool for studying and improving patient care through clinical epidemiologic research. Big data in the context of epidemiologic research means large interlinkable data sets within a single country or networks of multinational databases. Several Nordic, European, and other multinational collaborations are now well established. Advantages of big data for clinical epidemiology include improved precision of estimates, which is especially important for reassuring ("null") findings; ability to conduct meaningful analyses in subgroup of patients; and rapid detection of safety signals. Big data will also provide new possibilities for research by enabling access to linked information from biobanks, electronic medical records, patient-reported outcome measures, automatic and semiautomatic electronic monitoring devices, and social media. The sheer amount of data, however, does not eliminate and may even amplify systematic error. Therefore, methodologies addressing systematic error, clinical knowledge, and underlying hypotheses are more important than ever to ensure that the signal is discernable behind the noise.

Entities: Chemical Disease Gene Species

Keywords: electronic health records; healthcare administrative claims; medical record linkage; multicenter studies; validation studies

Year: 2017 PMID： 28490904 PMCID： PMC5413488 DOI： 10.2147/CLEP.S129779

Source DB: PubMed Journal: Clin Epidemiol ISSN： 1179-1349 Impact factor: 4.790

Introduction

Big data has firmly established itself in the health research,1,2 illustrated by publications in high-ranking general-interest biomedical journals, including The New England Journal of Medicine,3 JAMA,4 Journal of Internal Medicine,5 Science,6–9 and Nature.10–13 A basic definition of big data includes “the 3 Vs”: variety (linkage of many data sets from heterogeneous independent sources in a single data set); volume (large number of observations and variables per observation from different sources); and/or velocity (real-time or frequent data updates, often fully or partially automated).14 Other definitions encompass additional three Vs: value (clinically relevant information); variability (eg, seasonal or secular disease trends); and veracity (data quality).2 Routinely recorded health data are large automated data sets stemming from day-to-day activities of health care, such as hospital admissions or claims.15–18 These data have evolved from mere byproducts of health care delivery or billing into a powerful tool for improving patient care through preventive, etiologic, and prognostic epidemiologic research.4 A recent article summarizes 46 most influential studies conducted with big data in health care,1 while a review from 2015 provides multiple examples of the “variety” V in big data for health.2 The notion of applying lessons from the clinical past to the clinical future is “as old as medicine.”19 In a simplified form, evidence-based medical care means that a clinician can use research results in making treatment decisions in his or her clinical practice, often through explicit literature-based treatment guidelines. For a clinician, this means answers to questions such as: “How likely is my patient with atrial fibrillation on oral anticoagulants to develop a major bleeding? Does the risk vary by type of anticoagulant or patient characteristics?” or “To what extent does comorbidity affect mortality of patients with hip fracture?” To be answered, a clinical question must be first translated into a precise research question and then back-translated and interpreted for clinical decision making. Therefore, it is essential for clinicians and epidemiologists to understand each other’s language. For an epidemiologist, an answer to a research question should be a precise and valid estimate of an underlying population parameter such as mean, risk, incidence rate, or odds ratio. Big data – via the “volume” V – often addresses the precision component, but does little to address validity (the “veracity” V in the big-data vocabulary). Plausible hypotheses, expert knowledge, and accurate measurement tools must be available to ensure validity of research findings, since a highly precise biased result, especially perceived as credible based on precision alone, is more dangerous translated into clinical practice than an imprecise biased result.20,21 This paper, using primarily case studies from the Nordic countries, provides a brief overview and examples of use of big data in clinical epidemiology and outlines associated advantages and challenges.

Examples of big data collaborations in epidemiology

Some say that the digitalization of medical records revolutionized the usability of big data in medical research.4 Whether or not this claim is accepted, it is important to be aware that the current development follows a long evolution of using register data for medical research. This evolution started with the establishment of the first National Leprosy Register, in Norway, in 1856 (Figure 1),22,23 and of the Danish Cancer Registry, in 1943.24 Other Nordic registries followed, most of them established between the 1960s and the early 2000s.25,26 Researchers in the Nordic countries have been using the volume component of the big data before the term was invented: for decades, epidemiologists have been conducting epidemiologic studies based on linkage of routinely collected data from multiple administrative, health, and demographic registries, and their potential has been recognized at least since the 1990s,27 if not earlier.28

Figure 1

Building that used to house the Norwegian Leprosy Registry, currently home of the Department of Global Public Health and Primary Care, University of Bergen, Norway.

Note: Courtesy: Dr Astrid Lunde.

Estimates of association with narrow confidence intervals often stem from big data analyses of common health outcomes in population-based registry data spanning several decades. When the intervention or the outcome of interest is rare, even data from an entire country may be in sufficient, requiring that data from different countries are combined. Several formal or ad hoc collaborative networks in observational epidemiology have arisen, often from the need to study benefits and risks of relatively uncommon pharmacological16,29–31 or surgical32,33 interventions, or vaccines.3,30 Examples of pan-Nordic collaborations using combined data from Denmark, Finland, Iceland, Norway, and Sweden31,34,35 include studies on prenatal exposure to antidepressants and adverse effects in the offspring31,34,35 or the Nordic Arthroplasty Register Association (NARA) database of about 1 million primary hip and knee replacement procedures performed since 1995 in Denmark, Finland, Norway, and Sweden.36 NARA enabled studies of rare risk factors and outcomes, for which single-country data are too sparse.32,33 One clinically relevant question is whether a type of fixation used in total hip replacement (THR) is associated with risk of subsequent revision in patients younger than 55 years of age, since these patients may be different from older patients in mobility, post-THR life expectancy, and compliance with treatment. Only 5% of THR procedures are performed in patients younger than 55 years and previous studies, including those based on national hip registries, had insufficient sample size to address the fixation issue in younger patients. Pedersen et al37 used NARA to assemble a study population of ~30,000 patients younger than 55 years undergoing THR, with each fixation technique represented by more than 3,000 observations. The study yielded a clinically relevant message that uncemented implants are associated with a lower long-term risk of aseptic loosening but a higher short-term risk of revisions. Thus, the purpose of uncemented implants has been achieved in long term, but technical issues causing dislocation, periprostethic fracture, and infection have been previously overlooked in patients younger than 55 years. Use of routinely collected data for epidemiologic research has also been possible outside the Nordic countries, including general practice-based data in the UK, or claims-based databases and database networks in the USA. In contrast to the typical European health care databases, which are established to fulfill administrative (health services), clinical quality, or surveillance needs, the US claims databases (eg, Medicare, Medicaid, and commercial insurance records) are by-products of medical accounting. Several European database networks, including those encompassing the Nordic data, have been successfully established and have found ways to overcome challenges of differences in the underlying health care systems, languages, data-sharing laws, record-generating mechanisms, and classifications.5,16,30,38,39 Medical data in the Nordic countries are coded using a common basic set of standard classifications (International Classification of Diseases, Nordic Medico-Statistical Committee classification for procedures and causes of injury,40,41 or Anatomical Therapeutic Chemical codes for medications), which makes it easier to establish common algorithms. In the USA, Medicare and Medicaid provide financial incentives for “meaningful use” of electronic health records.3 The most prominent big data collaborative models in the USA have been the Mini-Sentinel project and the Observational Medical Outcomes Partnership (OMOP).3 The difference between routine records accumulated in systems like Mini-Sentinel or OMOP and those in Europe is the structure of the health care system, linkage possibilities, and the availability of lifelong complete follow-up. Thus, certain aspects of big data in Nordic countries are more diverse than those in many other databases (the “volume” V and the “variety” V of the big data), thanks to individual-level linkage to both medical and nonmedical data, including education, income, and residence, and because of lifelong follow-up. In 2013, the Mini-Sentinel project covered 360 million person-years of observation representing 150 million lives.3 In 2014, the Danish Civil Registration System, with its linkable network of national registries, covered 400 million person-years of observation from 9.5 million lives.25 Asian countries are building a linkable registry infrastructure with individual-level linkage mimicking those of the Nordic countries.42 The “variety” V of the big data is developing rapidly, whereby previously unused on underused types of data are incorporated into medical research, including electronic medical records, imaging, biobanks, and patient-reported data (including social media and wearables).2,43 Individual linkage may not be always necessary: in a classical ecologic study, hostility of language on Twitter was associated with country-specific mortality from heart diseases.44 Pharmacovigilance with social media is already a reality.45 Mobile phones can be used to test and subsequently deliver behavioral interventions such as smoking cessation aid46 or adherence support.47 The type of bias associated with certain types of data may change over time. For example, in the early days of epidemiologic research, random landline phone surveys tended to select the relatively more affluent, the employed, and the young. Today, these groups are more likely to be accessed via social networks and mobile telephony,2 while use of landline phones may select for older or disadvantaged population segments. Assembling database networks carries with it technical, logistical, ethical, and legal challenges.48 The last two are often the hardest to overcome because of issues of data access, patient privacy, and potential conflicts of interest. Even in large studies, one has to remain vigilant about patient privacy and the possibility of inadvertently identifying individuals based on a set of rare characteristics. Gini et al16 provide a practical guide of the different models of data networking, defined on the degree of centralization and harmonization of the different analytic processes. It seems to be practical to designate a single network partner, with adequate resources, to be the coordinating analytic hub. The process starts with raw data from each participating database and ends with the statistical output combining results of individual patients from all databases. Between the starting and the end points, there exist different models for the extent of process automation, autonomy, and control enjoyed by each data partner. A global protocol, with flexibility for local adaptations, is usually followed. Depending on the aims of the study, the analysis may entail as little sharing as contributing country-specific odds ratios for a meta-analysis or as much sharing as harmonization and pooling of individual-level data sets.16 Harmonization involves transformations, whereby each partner creates standard input data sets according to exact specification – a common data model (CDM) – which dictates the data set types and structure, variable names and attributes, and definitions of derived variables. A single statistical analytic program is then run on the CDM-conforming files either by each network partner locally (“one analyst, many outputs”) or centrally by the hub on the combined data set (“one analyst, one output”). By contrast, the “many analysts, many outputs” approach is discouraged because it is prone to error and duplicates work. Whether one or many analysts, quality control of programming by another analyst is always necessary. Health outcomes measured by health care professionals might differ from the outcomes subjectively experienced by patients, and the latter also affects the outcome of treatment. To fill this gap, patient-reported outcome measures (PROMs) are being used increasingly.49 An example of incorporation of PROMs in a single-country setting, while capitalizing on unique data linkage capabilities common to the Nordic settings, includes the generic infrastructure for collecting PROM data, AmbuFlex, developed in Denmark by Hjollund et al.50 The researchers have successfully implemented a flexible paper-based and electronic data collection on PROMs in more than 20 projects since 2004. Group-level aggregated PROM data, linked with data from routine registries and clinical databases, can be used to monitor national and regional hospital performance in oncology and cardiology care, psychiatry, neurology, and orthopedics. Patient-level PROM data collected on clinic level, in combination with electronic health records, can be used to facilitate screening, clinical decisions, patient–doctor communication, and efficient use of resources in cardiology, rheumatology, and oncology. Response rates exceeded 75% in all and 90% in most cases. A clinical decision support function of PROMs can save clinicians’ time by using an algorithm-based initial identification of patients in need of immediate attention, while presenting data on other patients in a decision-supporting format for clinical judgment.50 AmbuFlex is a unique example of implementation in routine care, a generic system integrated with electronic medical records, and is used for longitudinal collection of detailed PROM data on an individual level to personalize the care for the individual patient. This allows the collection of PROM data on large cohorts of chronically ill patients over many years, similar to the systems currently in place for administrative data.

Big data in epidemiology: benefits and challenges

Precision of results is not the only benefit of big data. Observations from large number of individuals allow a rapid detection of potential risk signals associated with newly marketed therapies, for which risks of rare adverse events are rarely known from Phase III preapproval trials (the velocity “V” of the big data).51 A thought experiment showed that having records of 100 million patients for safety monitoring would have allowed the detection of adverse cardiovascular effects of rofecoxib (Merck, Kenilworth, NJ, USA) in 3 months instead of 5 years.5,52 On the other hand, large data sets help convincingly rule out harmful associations, in the so-called “null studies.” One example is the abovementioned Nordic collaboration on safety of antidepressant use in pregnancy. Less than 2% of pregnant women use selective serotonin reuptake inhibitors (SSRIs) in pregnancy, while birth defects affect about 3% of live births. Therefore it took a pan-Nordic study to assemble a study population of >1.5 million pregnancies with ~73,000 malformation cases, including ~33,000 SSRI-exposed pregnancies with >1,300 cases exposed to SSRIs.34 The study convincingly showed a null association between maternal use of SSRIs and major birth defects, providing reassurance to pregnant women with depression and their physicians. Finally, in analyses based on large data sets, estimates are likely to be “highly statistically significant,” ie, associated with P-values <0.05. This “universal statistical significance” could finally lay to rest reliance on P-values for interpretation of study results, allowing researchers to focus on clinical significance instead.53–55 The perks of big data should not go to our collective heads. Big data does not address the usual epidemiologic challenges related to validity, and may even amplify them.15,56 Accurate measurement of study variables remains imperative in big-data settings. An advantage of multinational databases is that estimates originating from different databases to address the same research question amount to reproducibility checks of results under varying assumptions about the record-generating mechanisms and the effects of the underlying health care and social structures. At the same time, in multinational database studies, validity concerns are increased proportional to the number of the databases, with the need of several valid operational definitions for the same clinical characteristic or event, to avoid propagating a systematic error on a large scale.53,56 Validation of algorithms in large secondary databases remains imperative for valid inference.15,56,57 The NARA collaboration has contributed to improvement of data validity in all four participating countries through regular meetings, where differences in registration practice have been discussed. Also, through different research projects, a number of differences regarding data quality between registries have been pointed out and discussed, and subsequently changes in national registries have been made to achieve uniform data definition, collection, and interpretation. Large amounts of missing data may cause selection bias and undermine gains in precision afforded by big data, since in multiple regression models, standard statistical software removes observations with missing values. Reverse causation, immortal time bias,58 and healthy user/healthy adherer bias59 are likewise not remedied by large amounts of data and need to be addressed in big-data and small-data studies alike. On a pragmatic level, delay of data delivery and changes in coding practice present additional challenges.

Conclusion

Epidemiologic research, including database research, is an “exercise in measurement,”60 in an effort to maximize signal-to-noise ratio. The results of big data-based medical research represent a dividend to the public on its investment in the form of contribution to routine databases with data and with tax money. The advantages of big data are precision of results, including precise “null” findings, ability to address clinical questions in patient subgroups, and rapid detection of risk signals. In the Nordic countries, big data is collected and maintained by public institutions and operate in the setting of income-independent access to health care and lifelong follow-up. In other settings, such as US claims databases, demographic or economic disadvantages are better represented, while follow-up is not lifelong and health care access may be interrupted. Combining evidence from different settings and countries creates multiple-informant settings, providing built-in cross-validation and addressing a wide array of clinical questions in a single study. A formal requirement to the big data is that size, complexity, and velocity of the data are too intense for processing and interpretation with exiting tools. In the Nordic settings, the volume has been available for some decades, and the variety is increasing rapidly to include data on imaging, behavior, geo-location, ecology, genetics, and patient-reported outcomes. Velocity has not yet reached the real-time update stage, but it is improving, and its value is obvious. Veracity (familiar to epidemiologists as validity) needs to be assured before data can be interpreted. The large amount of data, thus, does not eliminate and may amplify sources of systematic error. To that end, technical expertise, clinical knowledge, and underlying hypotheses are more important than ever to ensure that the signal is not drowned out by noise.

58 in total

Review 1. Utilizing social media data for pharmacovigilance: A review.

Authors: Abeed Sarker; Rachel Ginn; Azadeh Nikfarjam; Karen O'Connor; Karen Smith; Swetha Jayaraman; Tejaswi Upadhaya; Graciela Gonzalez
Journal: J Biomed Inform Date: 2015-02-23 Impact factor: 6.317

2. Drug safety reform at the FDA--pendulum swing or systematic improvement?

Authors: Mark McClellan
Journal: N Engl J Med Date: 2007-04-13 Impact factor: 91.245

3. Big data in epidemiology: too big to fail?

Authors: Arnaud Chiolero
Journal: Epidemiology Date: 2013-11 Impact factor: 4.822

4. Big data in epidemiology: too big to fail?

Authors: Sengwee Toh; Richard Platt
Journal: Epidemiology Date: 2013-11 Impact factor: 4.822

Review 5. The Nordic countries as a cohort for pharmacoepidemiological research.

Authors: Kari Furu; Björn Wettermark; Morten Andersen; Jaana E Martikainen; Anna Birna Almarsdottir; Henrik Toft Sørensen
Journal: Basic Clin Pharmacol Toxicol Date: 2009-12-04 Impact factor: 4.080

6. From "big epidemiology" to "colossal epidemiology": when all eggs are in one basket.

Authors: Miguel A Hernán; David A Savitz
Journal: Epidemiology Date: 2013-05 Impact factor: 4.822

7. The inevitable application of big data to health care.

Authors: Travis B Murdoch; Allan S Detsky
Journal: JAMA Date: 2013-04-03 Impact factor: 56.272

8. US big-data health network launches aspirin study.

Authors: Sara Reardon
Journal: Nature Date: 2014-08-07 Impact factor: 49.962

9. Bioinformatics: Big data versus the big C.

Authors: Neil Savage
Journal: Nature Date: 2014-05-29 Impact factor: 49.962

10. Selective serotonin reuptake inhibitors and venlafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design.

Authors: Kari Furu; Helle Kieler; Bengt Haglund; Anders Engeland; Randi Selmer; Olof Stephansson; Unnur Anna Valdimarsdottir; Helga Zoega; Miia Artama; Mika Gissler; Heli Malm; Mette Nørgaard
Journal: BMJ Date: 2015-04-17

26 in total

1. Population Neuroscience: Dementia Epidemiology Serving Precision Medicine and Population Health.

Authors: Mary Ganguli; Emiliano Albanese; Sudha Seshadri; David A Bennett; Constantine Lyketsos; Walter A Kukull; Ingmar Skoog; Hugh C Hendrie
Journal: Alzheimer Dis Assoc Disord Date: 2018 Jan-Mar Impact factor: 2.703

Review 2. Large-Scale Genomic Biobanks and Cardiovascular Disease.

Authors: Aeron M Small; Christopher J O'Donnell; Scott M Damrauer
Journal: Curr Cardiol Rep Date: 2018-03-08 Impact factor: 2.931

3. Novel Data Linkages to Characterize Palliative and End-Of-Life Care: Challenges and Considerations.

Authors: Cara L McDermott; Ruth A Engelberg; Cossette Woo; Li Li; Catherine Fedorenko; Scott D Ramsey; J Randall Curtis
Journal: J Pain Symptom Manage Date: 2019-07-23 Impact factor: 3.612

4. Quality of MBSAQIP data: bad luck, or lack of QA plan?

Authors: K Noyes; A A Myneni; S D Schwaitzberg; A B Hoffman
Journal: Surg Endosc Date: 2019-06-12 Impact factor: 4.584

5. Study Types in Orthopaedics Research: Is My Study Design Appropriate for the Research Question?

Authors: Isabella Zaniletti; Katrina L Devick; Dirk R Larson; David G Lewallen; Daniel J Berry; Hilal Maradit Kremers
Journal: J Arthroplasty Date: 2022-09-06 Impact factor: 4.435

6. Impact of industry 4.0 to create advancements in orthopaedics.

Authors: Mohd Javaid; Abid Haleem
Journal: J Clin Orthop Trauma Date: 2020-03-18

Review 7. What Can We Learn About Drug Safety and Other Effects in the Era of Electronic Health Records and Big Data That We Would Not Be Able to Learn From Classic Epidemiology?

Authors: Ali Zarrinpar; Ting-Yuan David Cheng; Zhiguang Huo
Journal: J Surg Res Date: 2019-10-22 Impact factor: 2.192

8. Establishing a National Cardiovascular Disease Surveillance System in the United States Using Electronic Health Record Data: Key Strengths and Limitations.

Authors: Brent A Williams; Stephen Voyce; Stephen Sidney; Véronique L Roger; Timothy B Plante; Sharon Larson; Michael J LaMonte; Darwin R Labarthe; Bailey M DeBarmore; Alexander R Chang; Alanna M Chamberlain; Catherine P Benziger
Journal: J Am Heart Assoc Date: 2022-04-12 Impact factor: 6.106

9. Precision Medicine in Type 2 Diabetes: Clinical Markers of Insulin Resistance Are Associated With Altered Short- and Long-term Glycemic Response to DPP-4 Inhibitor Therapy.

Authors: John M Dennis; Beverley M Shields; Anita V Hill; Bridget A Knight; Timothy J McDonald; Lauren R Rodgers; Michael N Weedon; William E Henley; Naveed Sattar; Rury R Holman; Ewan R Pearson; Andrew T Hattersley; Angus G Jones
Journal: Diabetes Care Date: 2018-01-31 Impact factor: 19.112

10. Socioeconomic disparities in first stroke incidence, quality of care, and survival: a nationwide registry-based cohort study of 44 million adults in England.

Authors: Benjamin D Bray; Lizz Paley; Alex Hoffman; Martin James; Patrick Gompertz; Charles D A Wolfe; Harry Hemingway; Anthony G Rudd
Journal: Lancet Public Health Date: 2018-03-15