Literature DB >> 32332489

Generating Subcounty Health Data Products: Methods and Recommendations From a Multistate Pilot Initiative.

Trang Q Nguyen¹, Isaac H Michaels, Dulce Bustamante-Zamora, Brian Waterman, Elna Nagasako, Yunshu Li, Marjory L Givens, Keith Gennuso.

Abstract

BACKGROUND: County Health Rankings & Roadmaps (CHR&R) makes data on health determinants and outcomes available at the county level, but health data at subcounty levels are needed. Three pilot projects in California, Missouri, and New York explored multiple approaches for defining measures and producing data at subcounty geographic and demographic levels based on the CHR&R model. This article summarizes the collective technical and implementation considerations from the projects, challenges inherent in analyzing subcounty health data, and lessons learned to inform future subcounty health data projects.
METHODS: The research teams used 12 data sources to produce 40 subcounty measures that replicate or approximate county-level measures from the CHR&R model. Using varying technical methods, the pilot projects followed similar stages: (1) conceptual development of data sources and measures; (2) analysis and presentation of small-area and subpopulation measures for public health, health care, and lay audiences; and (3) positioning the subcounty data initiatives for growth and sustainability. Unique technical considerations, such as degree of data suppression or data stability, arose during the project implementation. A compendium of technical resources, including samples of automated programs for analyzing and reporting subcounty data, was also developed.
RESULTS: The teams summarized the common themes shared by all projects as well as unique technical considerations arising during the project implementation. Furthermore, technical challenges and implementation challenges involved in subcounty data analyses are discussed. Lessons learned and proposed recommendations for prospective analysts of subcounty data are provided on the basis of project experiences, successes, and challenges.
CONCLUSIONS: This multistate pilot project offers 3 successful approaches for creating and disseminating subcounty data products to communities. Subcounty data often are more difficult to obtain than county-level data and require additional considerations such as estimate stability, validating accuracy, and protecting individual confidentiality. We encourage future projects to further refine techniques for addressing these critical considerations.

Entities: Chemical

Mesh：

Year: 2021 PMID： 32332489 PMCID： PMC7690642 DOI： 10.1097/PHH.0000000000001167

Source DB: PubMed Journal: J Public Health Manag Pract ISSN： 1078-4659

In the past decade, county-level data for health-related measures have been made broadly available nationwide.1 However, to further support communities in assessing and effectively reducing local health burden, subcounty data are greatly needed for identifying health disparities within counties that county-level data may not detect.2 Subcounty data are also needed for targeting, monitoring, and evaluating public health interventions.3 Granular knowledge about the distribution of health burdens enables public health practitioners to focus limited resources on the most acutely affected geographies and groups to achieve optimal impact.4 Multiple terms are used to describe subcounty-level data.5,6 In this context, “small-area data” refers to aggregate data for towns, zip code tabulation areas (ZCTAs), census tracts, and other geographies that can be smaller than counties. Likewise, “subpopulation data” refers to aggregate data for age groups, sexes, races, ethnicities, or other groups that, in part, comprise the total populations of counties. Hospitals and local health departments are increasingly interested in accessing subcounty data.3 The Patient Protection and Affordable Care Act requires private nonprofit hospitals to regularly conduct community health needs assessments that include implementation plans for addressing identified health issues.7,8 Similarly, the national Public Health Accreditation Board requires accredited health departments and those pursuing accreditation to regularly conduct community health assessments and develop community health improvement plans.9 In past decades, studies have explored small-area data analysis for specific health outcomes and/or small geographical areas.10–12 More recently, public health analysts have explored data at the subcounty level and some have begun to support these assessments and plans by improving the availability of subcounty data.13–20 However, these projects either covered a specific geographic area (eg, one city, one county) or included a limited number of data sources and measures. Since 2010, the County Health Rankings & Roadmaps (CHR&R) have provided overall health measures for nearly every county nationwide.21 To address the need for subcounty data compatible with CHR&R measures that cover a broad range of health factors and outcomes, 3 pilot projects were conducted in 2015 by Washington University in St Louis in partnership with the Missouri Hospital Association, New York State Department of Health, and the California Department of Public Health. The overall aims of the pilot projects were to (1) provide local data from multiple sources for a broad range of measures to support community health needs assessments and development of community health improvement plans and (2) develop analytical capability for subcounty data analyses and presentation to support public health activities. Detailed technical methods, analytic techniques, sample data, and programs (SAS and R) from the 3 projects have since been shared as a white paper for data analysts to utilize in future projects.22 This article aims to summarize key considerations and lessons learned from the pilot projects to adopt and adapt the CHR&R model and measures to generate data products for subpopulations and small areas below the county level.

Methods

The pilot teams' processes shared many common steps, despite having varied data sources, measures, and outputs (Figure). Herein is a summary of the key stages and important considerations. The pilot project in New York was approved by the New York State Department of Health Institutional Review Board (IRB reference #15-041). For the pilot project in Missouri, employment of the aggregate data utilized as model inputs was governed by Hospital Industry Data Institute master data use agreements and academic personnel participation was reviewed by the Washington University School of Medicine Human Research Protection Office. General Model of Subcounty Health Data Development Used with permission Abbreviations: CHR&R, County Health Rankings & Roadmaps; TA, technical assistance.

Conceptual development for data sources and measures

Identifying target audiences

The projects identified target audiences that included local public health departments, partner government agencies, hospital associations, hospital community benefits organizations, health planning organizations, community-based organizations, population health and strategic planning personnel, and academic researchers. The New York project supported the New York State Prevention Agenda, which requires local health departments, hospitals, and other community partners to collaborate on community health assessments and community health improvement plans every 3 years. In California, the measures were produced by the Healthy Communities Data and Indicators Project (HCI) of the Office of Health Equity. The HCI regularly produces subcounty data on the social determinants of health and disparities to inform stakeholders such as the California Health in All Policies Task Force and “Let's Get Healthy California” state health improvement plan. The Missouri ZIP Health Rankings Project23 aimed to assist hospitals, public health departments, community-based organizations, and funders with appropriately targeting scarce community health improvement resources.24

Selecting measures

Criteria for selecting health measures may include the geographical units of data that are available, the volume of the measured events, the extent of data time aggregation necessary to obtain stable estimates, local priorities, the sustainability and longevity of the data source, and the cost. The 3 projects were able to create 22 out of 35 ranked measures from the CHR&R model based on these criteria. The sources included birth, death, and hospitalization administrative data, health survey data, US Census data, and modeled data (see Supplemental Digital Content Table A, available at http://links.lww.com/JPHMP/A652). When possible, the data were further disaggregated by county subpopulations such as race/ethnicity, sex, poverty level, and disability status to inform on disparities. Some ranked measures, such as alcohol-impaired driving deaths, were not selected because of event rarity. Other measures were selected as proxies due to unavailability of data at the subcounty level that met the specifications of measures from the CHR&R model. For example, the New York project substituted teen births (original measure) with teen pregnancies, and hospitalizations for ambulatory care–sensitive conditions were substituted with preventable hospitalizations. The California project adjusted the age groups for unemployment and educational attainment measures according to those available in existing data sources. The Missouri team selected multiple measures that correlated (as proxies) with existing CHR&R measures using a 2-step process that included an examination of face and criterion validity.24

Selecting geographic units

Selection of geographic units of analysis depended on data availability, time range of data aggregation necessary to obtain stable estimates, and the familiarity and usability by the target audience. Several solutions were implemented to improve the stability of estimates with the desired geographic granularity, including the use of state-specific data sources that capture larger samples or populations than those available from national data sources. For example, in contrast to calculating preventable hospitalizations among Medicare patients (aged 65 years and older), New York used a statewide all-payer database to generate a similar measure for all adult patients. For certain measures, pilot projects used more years of data for subcounty measures than CHR&R used for county-level measures. Even within a single state, all measures were not available in the same geographical units. Therefore, New York and California used a variety of small-area geographic units, including zip codes (ZCTAs), minor civil divisions, and census tracts (see Supplemental Digital Content Table A, available at http://links.lww.com/JPHMP/A652). Geographic aggregation of zip codes into minor civil divisions was used for survey data by New York. In contrast, Missouri developed a methodology to produce all measures at the zip code level.

Analyzing and presenting small-area/subpopulation health measures

Generating subcounty statistics

The 3 projects followed different methods to analyze data and generate estimates for each measure using statistical software (eg, SAS, R) (Table). The New York team aggregated individual records from administrative data (eg, births, deaths, emergency department visits, and hospitalizations) to generate counts, rates, and percentages for selected measures, by county population characteristics (eg, race/ethnicity, age group, Medicaid status, and education levels) and by geographic areas. For survey data, zip codes were aggregated to generate estimates for minor civil divisions. Abbreviations: SES, socioeconomic status; ZCTA, zip code tabulation area. The California team used an application programming interface (API) to automate downloads of American Community Survey tables that were subsequently used to aggregate demographic groups within selected geographies. For some measures, model-based estimates were produced for multiple small areas (including ZCTAs, cities, congressional districts, and assembly districts) via a subcontract award with the California Health Interview Survey Neighborhood Edition. The Missouri team conducted principal components analyses to derive composite zip-level scores corresponding to CHR&R subdomain scores.24

Applying data suppression

Increased granularity of reporting increases the risk that individuals with certain health outcomes could be identified. To protect individuals' confidential information and to address problems arising from skewed or potentially miscoded underlying data, estimates were sometimes suppressed (ie, not reported). Three types of data suppression were applied: primary, secondary, and tertiary. Primary suppression rules, usually based on minimum volume thresholds for estimate numerators and/or denominators, vary depending on the data source and criteria used in each pilot project are described in Supplemental Digital Content Table A (available at http://links.lww.com/JPHMP/A652). Secondary suppression was applied when primary suppression affected only one subpopulation or one geographic area in a county; suppressing an additional cell (ie, secondary suppression) prevents the identification of data for the primary-suppressed cell, which could otherwise be calculated by subtracting the sum of the unsuppressed cells from the county's total count. Tertiary suppression was applied to remove outlier estimates that resulted from coding errors (eg, in patients' demographic information) or from skewed age group distributions among cases that caused age adjustment to produce extreme values. In addition, the Missouri team applied Winsorization26 criteria (top coding) to estimates to minimize the risk of identifying individuals.

Assessing data stability

Disaggregating data result in equal or smaller (but never larger) counts and in proportions with equal or smaller (but never larger) denominators. Subcounty estimates are therefore less stable than county-level estimates. General guidelines state that, for count measures, estimates with a relative standard error (RSE) greater than 30% should be considered unreliable/unstable.27 This usually occurs when there are fewer than 10 events in the numerator.28 For measures using survey data, guidelines state that an estimate can be considered unreliable/unstable when the width of the 95% confidence interval is greater than 20% or the RSE is greater than 30%.29 The research teams found that most end users preferred data products to include as much data as possible, even when estimates were unstable. Academic and technical users may already understand the limitations of making inferences based on unstable estimates and may prefer to receive all subcounty estimates. Nontechnical users may not have prior understanding of the limitations and therefore may prefer to receive only stable subcounty estimates that can be used in practice. Accordingly, the New York team flagged unstable estimates with asterisks and explained estimate stability in the Methods section of its reports. The California research team provided confidence intervals and RSEs in its data products, along with materials on how to interpret data stability.

Validating results

It is important to assess the distribution of estimate values for each measure to identify outliers and determine whether they are acceptable. Outliers can result not only from truly extreme health disparities in specific communities but also from errors in raw data coding (eg, race/ethnicity miscoded on individual records) or overadjustments by statistical procedures (eg, age-adjusting estimates for small geographies with skewed underlying age distributions). Each team examined outlying estimates carefully to identify and exclude erroneous data (eg, coding errors, differentially adjusted estimates) while avoiding exclusion of “real” data with truly outlying values that reflect true health burdens in the population. The teams reviewed univariate distributions of subcounty estimates for count data. Estimates with values exceeding the respective measure's statewide 90th percentile were compared with county-level estimates. Extreme values were investigated individually and suppressed if deemed invalid (tertiary suppression). For age-adjusted estimates, age distributions of the underlying populations were checked as well. For the composite scores, zip codes were suppressed if 2 or more of their scores were larger than 3 standard deviations and could not be sufficiently explained.

Designing data outputs and visualizations

The research teams based their data product designs and media on end users' needs and input. During the development processes, each team conducted active discussion (eg, focus groups) with key stakeholders to collect and review their feedback and suggestions regarding data product designs. It was found that users preferred to receive data products with data visualizations (eg, graphs, maps, trends, data tables) in formats that are easily accessible (eg, PDF, online query). Users also preferred to receive data in formats that they could further manipulate or use to generate their own visualizations (eg, Excel). In addition, teams identified needs to provide users with simple technical guidance and explanations of methods and limitations to support accurate interpretation of data products.

Standardizing and automating report production

It is important to consider output and visualization designs before developing a production process because factors such as data set structure and the formation of technical programs in SAS or R may depend on formats and features of the outputs and visualizations. The 3 research teams incorporated standardization and automation with these considerations in mind. The California and New York teams used SAS and R to automate data processing, including data import, data aggregation, calculation of rates, calculation of reliability measures, data visualization (eg, graphs, maps, tables), final formatting, and export. When available, APIs were used to download data from secondary sources. All measures followed a standardized data output format. The Missouri team worked with partners to develop an online data platform that includes functions such as mapping, report building, and downloadable public use files.30

Disseminating reports and other data products

To support and enhance utilization of final data products, researchers can develop dissemination strategies to reach key audiences, such as Web site publication, direct distribution, in-person presentations, or a combination of methods. Project teams worked with stakeholders and leveraged their networks to share data products with key audiences. In California, data were published on the California Department of Public Health Web site and announced to a large number of stakeholders. In Missouri, the launch of the exploreMOhealth30 data platform was announced via press releases, region-specific fact sheets, presentations at a statewide meeting, and a webinar. In New York, the reports were e-mailed directly to local health departments and then publicly released on the New York State Association of County Health Officials Web site.31 The team also held a webinar for local health departments, hospitals, regional health planning groups, and other partners to introduce the reports and help with interpreting them.

Planning for sustainability

All 3 projects aimed to provide data products to support end-user needs in applied settings after the initial product release. Sustainability requires teams to strategically use available data and resources, have well-documented operational processes, and secure funding. One consideration with implications for sustainability is the ongoing availability and cost of data needed to derive subcounty measures. Most data for the pilot projects were freely available and frequently refreshed to support periodic product updates. However, population counts for small areas sometime need to be purchased from commercial sources. The Missouri research team transitioned from using commercial data sources to a publicly available source to maintain production without incurring untenable costs. Sustained operations also involve additional costs, including providing user outreach and training, implementing processes to collect user feedback, administering support, and conducting ongoing research and development to improve measures and data products. Developing and documenting standardized processes and workflows to accurately produce small-area estimates were necessary for maintaining ongoing production and assisted in the production of similar projects. For example, the New York team documented technical methods and SAS programs as protocols so that new staff could easily modify and apply them to another subcounty data project.32 Securing budgetary support for sustained operations was a de facto consideration for all 3 research teams. For the Missouri team, pilot project funding did not support ongoing delivery of data and reports. Therefore, the Missouri team identified and negotiated shared funding through 2 foundations to support both ongoing operations and the development of a shared Web-based reporting platform.

Results

One lesson learned from these pilot projects was that there are trade-offs between estimates' geographic and demographic granularities, and both stability and suppression. Especially when working with subcounty data, maintaining privacy is essential. Because regulations may vary depending on the data source and where data are to be displayed, it is important for researchers to follow data suppression policies and assess identifiability risks. It is still possible to release granular estimates by applying necessary suppression criteria and including accompanying information about estimate stability and other data limitations. In addition, proxy measures are often needed because of the lack of availability of measures at the subcounty level. The CHR&R model is a well-known framework for assessing community conditions and health outcomes at the county level. When attempting to replicate this model at the subcounty level, the pilot projects often needed to identify proxy measures to help fill information gaps when subcounty-level data did not match the county-level measure. The pilot projects identified proxies by using the same measure but with a different universe (eg, population 18 years and older in substitution of Medi-Cal enrolled population), using a similar measure (eg, teen pregnancy in substitution of teen births), and using a new measure that correlates with the original measure (eg, hospital utilization rates for mental health in substitution of the average number of mentally unhealthy days reported in past 30 days). Obtaining health outcomes data at the subcounty level is perhaps the biggest technical challenge in subcounty data analyses. In general, there are 3 options for subcounty health-related data, each with differing limitations: Survey oversample: A limitation of this approach is that it is expensive and self-reported. Modeled estimates (small-area estimation): A limitation of this approach is that it generalizes population characteristics and will not capture variation due to public programs (eg, local tobacco control programs), which could make model estimates substantially different from direct estimates.33,34 Administrative data (eg, health claims data): A limitation of this approach is that the data are not population based but rather reflect the population receiving health care. Another technical challenge is validating outliers. Widely accepted criteria for distinguishing real outliers (where health burdens are truly extreme) from erroneous outliers (eg, caused by data entry errors, or skewed adjustment) do not exist. Future research on validation algorithms or best practices could greatly contribute to subcounty data analyses. These technical challenges also impact user engagement. The general public can be surprised and disappointed if data cannot be provided for their cities or neighborhoods for multiple health outcomes. This issue can be magnified by different users' needs for more granular geographical levels and subpopulation data (ie, race and ethnicity). Geographic areas of interest to users can include, for example, neighborhoods, voting districts, hospital service areas, and school districts. Different users will also require different levels of estimate stability. One known method to increase the stability of small-area estimates is aggregating data from adjacent geographies.35 However, in this case, the analyst's aggregation choices may or may not align geographically with how communities define themselves. In some cases, a less stable estimate with clearly stated limitations will suffice while other users may require increased estimate stability. Especially as multisector collaborations to improve health increase, finding ways to locate and combine relevant data across multiple sectors and from multiple source types, and to display these data in geographic areas that are meaningful to different stakeholders are becoming increasingly important. When the unit of analysis is a subcounty area, this work is more challenging.

Implications for Policy & Practice

Future subcounty data projects should have clearly defined frameworks/models, goals, and target audiences. Researchers should consider technical considerations early on and throughout subcounty data projects. Data product designs should be discussed while analyzing data and generating estimates so that, at later steps, results can be organized and structured to streamline the production of data products. Decisions on how or whether to present unstable data should consider the needs of end users, users' ability to interpret unstable estimates, and institutional policy. Automating analyses and production processes can improve the consistency of reported data and the quality of data visualizations. Clearly documenting methods and processes further supports project sustainability, facilitates staff transitions, and enables project methods to be adapted for subsequent work. Technical assistance on how to use the data products and interpret the results should be provided when releasing data products. At all stages, input from key stakeholders can help inform the project considerations. To increase sustainability, investigators interested in pursuing subcounty health data projects should design the project to align with organizational interests. Funders should provide support for multiyear projects to ensure greater impact and sustainability. Locating and analyzing subcounty data across sectors is resource intensive. While there is great interest in the use and availability of these types of data, funding and sustaining services can be challenging. Producing subcounty data on a regular basis requires full-time dedicated staff. However, sustainable funding to produce small-area estimates for health outcomes is often lacking, whether with subcontractors or through the development of in-house capabilities. For example, having dedicated staff to work on indicator data projects might be difficult to secure in local or state health departments, as budget earmarks and competing priorities can make it difficult to dedicate staff exclusively to a single project.

Conclusions

Generating, communicating, and sustaining subcounty data are critical steps for advancing health and equity. The generally limited availability of resources for implementing evidence-based public health interventions highlights the needs for small-area data so that interventions can be more effectively targeted. This article outlines opportunities and challenges practitioners may face in this work, shares lessons learned, and offers 3 pilot projects that successfully developed and disseminated small-area data as useful models for future projects.

TABLE

Summary of Project Data Sources, Subcounty-Level Estimates and Methods

	State
	California	Missouri	New York
Data sources	Demographic statistics databaseHealth behaviors surveyHousing dataAggregate crime reports data	Hospital inpatient, outpatient, and emergency department discharge dataSocioeconomic deprivation index database25Census-based data set	Health behaviors surveyVital statistics (birth and death)Health care discharge data
Subcounty-level data	Demographic:Disability statusPoverty levelRace/ethnicitySexGeographic:Assembly districtsCensus tractCity/townCongressional districtsZCTA	Demographic:The clinical, SES, and census-based data used were available at both zip code and county levels (see ref 20 Supplemental Digital Content file, available at: http://links.lww.com/JPHMP/A652)Geographic:Zip code	Demographic:Age groupEducationMedicaid statusRace/ethnicityGeographic:Minor civil division (where data available)Zip code
Methods	Aggregating data from existing sources such as the American Community Survey; eg, aggregating demographic groups (sex or age) within smaller geographies (eg, census tract)Generating model-based estimates for subcounty geographies	Ranking zip code health factor and health outcome composite scores generated from estimates for multiple measures	Aggregating individual record data into rates or percentages for individual measuresFor survey data (Behavioral Risk Factor Surveillance System survey), zip codes were aggregated into minor civil divisions

Abbreviations: SES, socioeconomic status; ZCTA, zip code tabulation area.

14 in total

1. Healthy People 2010 criteria for data suppression.

Authors: Richard J Klein; Suzanne E Proctor; Manon A Boudreault; Kathleen M Turczyn
Journal: Healthy People 2010 Stat Notes Date: 2002-07

2. Development of an interactive environmental public health tracking system for data analysis, visualization, and reporting.

Authors: Thomas O Talbot; Sanjaya Kumar; Gwen D Babcock; Valerie B Haley; Steven P Forand; Syni-An Hwang
Journal: J Public Health Manag Pract Date: 2008 Nov-Dec

3. Geographic variation in colorectal cancer survival and the role of small-area socioeconomic deprivation: a multilevel survival analysis of the NIH-AARP Diet and Health Study Cohort.

Authors: Min Lian; Mario Schootman; Chyke A Doubeni; Yikyung Park; Jacqueline M Major; Rosalie A Torres Stone; Adeyinka O Laiyemo; Albert R Hollenbeck; Barry I Graubard; Arthur Schatzkin
Journal: Am J Epidemiol Date: 2011-08-11 Impact factor: 4.897

4. Improving Efficiency in Mobile Data Collection for Place-Based Public Health Research.

Authors: Daniel P Giovenco; Torra E Spillane
Journal: Am J Public Health Date: 2019-02 Impact factor: 9.308

5. National Center for Health Statistics Data Presentation Standards for Proportions.

Authors: Jennifer D Parker; Makram Talih; Donald J Malec; Vladislav Beresovsky; Margaret Carroll; Joe F Gonzalez; Brady E Hamilton; Deborah D Ingram; Kenneth Kochanek; Frances McCarty; Chris Moriarity; Iris Shimizu; Alexander Strashny; Brian W Ward
Journal: Vital Health Stat 2 Date: 2017-08

6. Small-area study of the incidence of neoplasms of the brain and central nervous system among adults in the West Midlands region, 1974-86. Small Area Health Statistics Unit.

Authors: N Eaton; G Shaddick; H Dolk; P Elliott
Journal: Br J Cancer Date: 1997 Impact factor: 7.640

7. A novel framework for validating and applying standardized small area measurement strategies.

Authors: Tanja Srebotnjak; Ali H Mokdad; Christopher Jl Murray
Journal: Popul Health Metr Date: 2010-09-29

8. Measuring Subcounty Differences in Population Health Using Hospital and Census-Derived Data Sets: The Missouri ZIP Health Rankings Project.

Authors: Elna Nagasako; Brian Waterman; Mathew Reidhead; Min Lian; Sarah Gehlert
Journal: J Public Health Manag Pract Date: 2018 Jul/Aug

9. Electronic Health Record (EHR)-Based Community Health Measures: An Exploratory Assessment of Perceived Usefulness by Local Health Departments.

Authors: Karen F Comer; P Joseph Gibson; Jian Zou; Marc Rosenman; Brian E Dixon
Journal: BMC Public Health Date: 2018-05-22 Impact factor: 3.295

10. Making Better Use of Population Health Data for Community Health Needs Assessments.

Authors: Michael A Stoto; Mary V Davis; Abby Atkins
Journal: EGEMS (Wash DC) Date: 2019-08-20