| Literature DB >> 21301985 |
Alexander K Smith1, John Z Ayanian, Kenneth E Covinsky, Bruce E Landon, Ellen P McCarthy, Christina C Wee, Michael A Steinman.
Abstract
Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.Entities:
Mesh:
Year: 2011 PMID: 21301985 PMCID: PMC3138974 DOI: 10.1007/s11606-010-1621-5
Source DB: PubMed Journal: J Gen Intern Med ISSN: 0884-8734 Impact factor: 5.128
A Practical Approach to Successful Research with Large Datasets
| Steps | Practical advice |
|---|---|
| (1) Define your research topic and question | (1) Start with a thorough literature review |
| (2) Ensure that the research question has clinical or policy relevance and is based on sound a priori reasoning. A good question is what makes a study good, not a large sample size | |
| (3) Be flexible to adapt your question to the strengths and limitations of the potential datasets | |
| (2) Select a dataset | (1) Use a resource such as the Society of General Internal Medicine’s Online Compendium ( |
| (2) To increase the novelty of your work, consider selecting a dataset that has not been widely used in your field or link datasets together to gain a fresh perspective | |
| (3) Factor in complexity of the dataset | |
| (4) Factor in dataset cost and time to acquire the actual dataset | |
| (5) Consider selecting a dataset your mentor has used previously | |
| (3) Get to know your dataset | (1) Learn the answers to the following questions: |
| •Why does the database exist? | |
| •Who reports the data? | |
| •What are the incentives for accurate reporting? | |
| •How are the data audited, if at all? | |
| •Can you link your dataset to other large datasets? | |
| (2) Read everything you can about the database | |
| (3) Check to see if your measures have been validated against other sources | |
| (4) Get a close feel for the data by analyzing it yourself or closely reviewing outputs if someone else is doing the programming | |
| (4) Structure your analysis and presentation of findings in a way that is clinically meaningful | (1) Think carefully about the clinical implications of your findings |
| (2) Be cautious when interpreting statistical significance (i.e., p-values). Large sample sizes can yield associations that are highly statistically significant but not clinically meaningful | |
| (3) Consult with a statistician for complex datasets and analyses | |
| (4) Think carefully about how you portray the data. A nice figure sometimes tells the story better than rows of data |
Glossary of Terms Used in Secondary Dataset Analysis Research
| Term | Meaning |
|---|---|
| Types of datasets (not mutually exclusive) | |
| Administrative or claims data | Datasets generated from reimbursement claims, such as ICD-9 codes used to bill for clinical encounters, or discharge data such as discharge diagnoses |
| Longitudinal data | Datasets that measure factors of interest within the same subjects over time |
| Clinical registries | Datasets generated from registries of specific clinical conditions, such as regional cancer registries used to create the Surveillance Epidemiology and End Results Program (SEER) dataset |
| Population-based survey | A target population is available and well-defined, and a systematic approach is used to select members of that population to take part in the study. For example, SEER is a population-based survey because it aims to include data on all individuals with cancer cared for in the included regions |
| Nationally representative survey | Survey sample that is designed to be representative of the target population on a national level. Often uses a complex sampling scheme. The Health and Retirement Study (HRS), for example, is nationally representative of community-dwelling adults over age 50 |
| Panel survey | A longitudinal survey in which data are collected in the same panel of subjects over time. As one panel is at the middle or end of its participation, a panel of new participants is enrolled. In the Medical Expenditures Panel Survey (MEPS), for example, individuals in the same household are surveyed several times over the course of 2 years |
| Statistical sampling terms | |
| Clustering | Even simple random samples can be prohibitively expensive for practical reasons such as geographic distance between selected subjects. Identifying subjects within defined clusters, such as geographic regions or subjects treated by the same physicians, reduces cost and improves the feasibility of the study but may decrease the precision of the estimated variance (e.g., wider confidence intervals) |
| Complex survey design | A survey design that is not a simple random selection of subjects. Surveys that incorporate stratification, clustering and oversampling (with patient weights) are examples of complex data. Statistical software is available that can account for complex survey designs and is often needed to generate accurate findings |
| Oversampling | Intentionally sampling a greater proportion of a subgroup, increasing the precision of estimates for that subgroup. For example, in the HRS, African-Americans, Latinos, and residents of Florida are oversampled (see also survey weights) |
| Stratification | In stratification, the target population is divided into relatively homogeneous groups, and a pre-specified number of subjects is sampled from within each stratum. For example, in the National Ambulatory Medical Care Survey physicians are divided by specialty within each geographic area targeted for the survey, and a certain number of each type of physician is then identified to participate and provide data about their patients |
| Survey weights | Weights are used to account for the unequal probability of subject selection due to purposeful over- or under-sampling of certain types of subjects and non-response bias. The survey weight is the inverse probability of being selected. By applying survey weights, the effects of over- and under-sampling of certain types of patients can be corrected such that the data are representative of the entire target population |
Online Compendia of Secondary Datasets
| Compendium | Web address | Description |
|---|---|---|
| Society of General Internal Medicine (SGIM) Research Dataset Compendium |
| Designed to assist investigators conducting research on existing datasets, with a particular emphasis on health services research, clinical epidemiology, and research on medical education. Includes information on strengths and weaknesses of datasets and the insights of experienced users about making best use of the data |
| National Information Center on Health Services Research and Health Care Technology (NICHSR) |
| This group of sites provides links to a wide variety of data tools and statistics, including research datasets, data repositories, health statistics, survey instruments, and more. It is sponsored by the National Library of Medicine |
| Inter-University Consortium for Political and Social Research (ICPSR) |
| World’s largest archive of digital social science data, including many datasets with extensive information on health and health care. ICPSR includes many sub-archives on specific topic areas, including minority health, international data, substance abuse, and mental health, and more |
| Partners in Information Access for the Public Health Workforce |
| Provides links to a variety of national, state, and local health and public health datasets. Also provides links to sites providing a wide variety of health statistics, information on health information technology and standards, and other resources. Sponsored by a collaboration of US government agencies, public health organizations, and health sciences libraries |
| Canadian Research Data Centres |
| Links to datasets available for analysis through Canada’s Research Data Centres (RDC) program |
| Directory of Health and Human Services Data Resources (US Dept. of Health and Human Services) |
| This site provides brief information and links to almost all datasets from National Institutes of Health (NIH), Centers for Disease Control and Prevention (CDC), Centers for Medicare and Medicaid Services (CMS), Agency for Healthcare Research and Quality (AHRQ), Food and Drug Administration (FDA), and other agencies of the US Department of Health and Human Services |
| National Center for Health Statistics (NCHS) |
| This site links to a variety of datasets from the National Center for Health Statistics, several of which are profiled in Table |
| Medicare Research Data Assistance Center (RESDAC); and Centers for Medicare and Medicaid Services (CMS) Research, Statistics, Data & Systems |
| These sites link to a variety of datasets from the Centers for Medicare and Medicaid Services (CMS) |
| Veterans Affairs (VA) data |
| A series of datasets using administrative and computerized clinical data to describe care provided in the VA health care system, including information on outpatient visits, pharmacy data, inpatient data, cost data, and more. With some exceptions, use is generally restricted to researchers with VA affiliations (this can include a co-investigator with a VA affiliation) |
Examples of High Value Datasets
| Cost, availability, and complexity | Dataset | Description | Sample publications |
|---|---|---|---|
| Free. Readily available. Population-based survey with cross-sectional design. Does not require special statistical techniques to address complex sampling | Surveillance, Epidemiology and End Results Program (SEER) | Population-based multi-regional cancer registry database. SEER data are updated annually. Can be linked to Medicare claims and files (see Medicare below) | Trends in breast-conserving surgery among Asian Americans and Pacific Islanders, 1992–2000 |
| Treatment and outcomes of gastric cancer among US-born and foreign-born Asians and Pacific Islanders | |||
| Free. Readily available. Requires statistical considerations to account for complex sampling design and use of survey weights | National Ambulatory Medical Care Survey (NAMCS) & National Hospital Ambulatory Care Survey (NHAMCS) | Nationally-representative serial cross-sectional surveys of outpatient and emergency department visits. Can combine survey years to increase sample sizes (e.g., for uncommon conditions) or evaluate temporal trends. Provides national estimates | Preventive health examinations and preventive gynecological examinations in the US |
| The NAMCS and NHAMCS are conducted annually. Do not link to other datasets | Primary care physician office visits for depression by older Americans | ||
| National Health Interview Survey (NHIS) | Nationally-representative serial cross-sectional survey of individuals and families including information on health status, injuries, health insurance, access and utilization information. The NHIS is conducted annually. Can combine survey years to look at rare conditions | Psychological distress in long-term survivors of adult-onset cancer: results from a national survey | |
| Can be linked to National Center for Health Statistics Mortality Data; Medicare enrollment and claims data; Social Security Benefit History Data; Medical Expenditure Panel Survey (MEPS) data; and National Immunization Provider Records Check Survey (NIPRCS) data from 1997–1999 | Diabetes and Cardiovascular Disease among Asian Indians in the US | ||
| Behavioral Risk Factor Surveillance System (BRFSS) | Serial cross-sectional nationally-representative survey of health risk behaviors, preventative health practices, and health care access. Provides national and state estimates. Since 2002, the Selected Metropolitan/Micropolitan Area Risk Trends (SMART) project has also used BRFSS data to identify trends in selected metropolitan and micropolitan statistical areas (MMSAs) with 500 or more respondents. BRFSS data are collected monthly. Does not link to other datasets | Perceived discrimination in health care and use of preventive health services Use of recommended ambulatory care services: is the Veterans Affairs quality gap narrowing? | |
| Free or minimal cost. Readily available. Can do more complex studies by combining data from multiple waves and/or records. Accounting for complex sampling design and use of survey weights can be more complex when using multiple waves—seek support from a statistician. Or can restrict sample to single waves for ease of use | Nationwide Inpatient Sample (NIS) | The largest US database of inpatient hospital stays that incorporates data from all payers, containing data from approximately 20% of US community hospitals. Sampling frame includes approximately 90% of discharges from US hospitals | Factors associated with patients who leave acute-care hospitals against medical advice |
| NIS data is collected annually. For most states, the NIS includes hospital identifiers that permit linkages to the American Hospital Association (AHA) Annual Survey Database and county identifiers that permit linkages to the Area Resource File (ARF) | Impact of hospital volume on racial disparities in cardiovascular procedure mortality | ||
| National Health and Nutrition Examination Survey (NHANES) | Nationally- representative series of studies combining data from interviews, physical examination, and laboratory tests | Demographic differences and trends of vitamin D insufficiency in the US population,1988-2004 | |
| NHANES data are collected annually. Can be linked to National Death Index (NDI) mortality data; Medicare enrollment and claims data; Social Security Benefit History Data; and Medical Expenditure Panel Survey (MEPS) data; and Dual Energy X-Ray Absorptiometry (DXA) Multiple Imputation Data Files from 1999–2004 | Association of hypertension, diabetes, dyslipidemia, and metabolic syndrome with obesity: findings from the National Health and Nutrition Examination Survey, 1999 to 2004 | ||
| The Health and Retirement Study (HRS) | A nationally-representative longitudinal survey of adults older than 50 designed to assess health status, employment decisions, and economic security during retirement | Chronic conditions and mortality among the oldest old | |
| HRS data is collected every 2 years. Can be linked to Social Security Administration data; Internal Revenue Service data; Medicare claims data (see Medicare below); and Minimum Data Set (MDS) data | Advance directed and surrogate decision making before death | ||
| Medical Expenditure Panel Survey (MEPS) | Serial nationally-representative panel survey of individuals, families, health care providers, and employers covering a variety of topics. MEPS data are collected annually | Loss of health insurance among non-elderly adults in Medicaid | |
| Can be linked by request to the Agency for Healthcare Research and Quality to numerous datasets including the NHIS, Medicare data, and Social Security data | Influence of patient-provider communication on colorectal cancer screening | ||
| Data costs are in the thousands to tens of thousands of dollars. Requires an extensive application and time to acquire data is on the order of months at a minimum. Databases frequently have observations on the order of 100,000 to >1,000,000. Require additional statistical considerations to account for complex sampling design, use of survey weights, or longitudinal analysis. Multiple records per individual. Complex database structure requires a higher degree of analytic and programming skill to create a study dataset efficiently. | Medicare claims data (alone), SEER-Medicare, and HRS-Medicare | Claims data on Medicare beneficiaries including demographics and resource utilization in a wide variety of inpatient and outpatient settings. Medicare claims data are collected continually and made available annually. Can be linked to other Medicare datasets that use the same unique identifier numbers for patients, providers, and institutions, for example, the Medicare Current Beneficiary Survey, the Long-Term Care Minimum Data Set, the American Hospital Association Annual Survey, and others. SEER and the HRS offer linkages to Medicare data as well (as described above) | Long-term outcomes and costs of ventricular assist devices among Medicare beneficiaries |
| Association between the Medicare Modernization Act of 2003 and patient wait times and travel distance for chemotherapy | |||
| Medicare Current Beneficiary Survey (MCBS) | Panel survey of a nationally-representative sample of Medicare beneficiaries including health status, health care use, health insurance, socioeconomic and demographic characteristics, and health expenditures. MCBS data are collected annually. Can be linked to other Medicare Data | Cost-related medication nonadherence and spending on basic needs following implementation of Medicare Part D | |
| Medicare beneficiaries and free prescription drug samples: a national survey |