Literature DB >> 34664641

The quality of social determinants data in the electronic health record: a systematic review.

Lily A Cook¹, Jonathan Sachs¹, Nicole G Weiskopf¹.

Abstract

OBJECTIVE: The aim of this study was to collect and synthesize evidence regarding data quality problems encountered when working with variables related to social determinants of health (SDoH).
MATERIALS AND METHODS: We conducted a systematic review of the literature on social determinants research and data quality and then iteratively identified themes in the literature using a content analysis process.
RESULTS: The most commonly represented quality issue associated with SDoH data is plausibility (n = 31, 41%). Factors related to race and ethnicity have the largest body of literature (n = 40, 53%). The first theme, noted in 62% (n = 47) of articles, is that bias or validity issues often result from data quality problems. The most frequently identified validity issue is misclassification bias (n = 23, 30%). The second theme is that many of the articles suggest methods for mitigating the issues resulting from poor social determinants data quality. We grouped these into 5 suggestions: avoid complete case analysis, impute data, rely on multiple sources, use validated software tools, and select addresses thoughtfully. DISCUSSION: The type of data quality problem varies depending on the variable, and each problem is associated with particular forms of analytical error. Problems encountered with the quality of SDoH data are rarely distributed randomly. Data from Hispanic patients are more prone to issues with plausibility and misclassification than data from other racial/ethnic groups.
CONCLUSION: Consideration of data quality and evidence-based quality improvement methods may help prevent bias and improve the validity of research conducted with SDoH data.

Entities: Chemical

Keywords: Hispanic Americans; bias; data quality; healthy equity; social determinants of health

Mesh：

Year: 2021 PMID： 34664641 PMCID： PMC8714289 DOI： 10.1093/jamia/ocab199

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

Interest in social determinants of health (SDoH) among clinicians, researchers, and policy-makers has increased in recent years, driven both by a recognition of their role as major contributors to health outcomes and by interest in improving health equity. There are substantial and justifiable concerns, however, regarding the quality of SDoH in clinical data. It is a long-established tenet of information science that poor-quality data lead to poor-quality results. Without attention to the quality of SDoH data, researchers cannot guarantee that results provide valid or useful insights. Our objective was to conduct a review of the literature on SDoH data quality to characterize the issues that impact the use of these data for research and policy. Specifically, our goal was to collect and synthesize available evidence regarding the kinds of quality issues typically encountered with SDoH data, the biases these issues may create during analysis, and any identified methodological solutions to these issues that can be used by researchers to improve social determinants data quality prior to analysis. We were unable to find any prior work that both gathers information about how researchers can improve the quality of clinical SDoH data and also summarizes how specific issues of bias and validity are introduced into research utilizing social determinants variables. To the best of our knowledge, this review is the first to bring together information about a variety of social determinants variables to examine the issues inherent to the field of social informatics.

BACKGROUND

Although data quality—also known as data integrity, data accuracy, or data validation—was initially conceptualized for broad application across information systems regardless of context, there is a growing body of literature devoted to the topic of electronic health record (EHR) data quality. As the field of data science has shifted from collecting data toward processing the massive amounts of data already collected, informatics researchers are addressing the problem of secondary use—how to make clinical documentation usable as a research source. Published in 2016, the Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data provides 3 major concepts that describe the quality of EHR data used for research: Conformance speaks to whether the dataset’s reported values meet structural standards and formats. Conformance is further broken into 3 subcategories: value conformance, relational conformance, and computational conformance. Completeness looks at whether or not the data are actually present. Plausibility asks if the data values are believable and accurate. It can be broken into 3 additional subcategories: uniqueness plausibility, atemporal plausibility, and temporal plausibility. SDoH are, by their very nature, nonmedical. As such, few SDoH data elements are routinely stored in the medical record. Some, such as race and health insurance status, are commonly collected during patient onboarding and are usually available as demographic or administrative data. However, these elements represent only a small portion of all the nonclinical factors known to influence health, and although there has been a widespread effort to collect more SDoH from patients these data have remained sparse. Although an exhaustive list of all methods used by researchers to access a wider array of social determinants factors is beyond the scope of this work, researchers have generally relied on the following: Diagnostic codes (eg , ICD Z- codes) indicating SDoH such as “Problems relating to housing and economic circumstances” have the benefit of being standardized across systems, but in practice are rarely used by clinicians. Geocoding patient addresses allow researchers to integrate biomedical data from the EHR with community-level data sources such as the US Census. In 2014, the Institute of Medicine suggested using “neighborhood and community composition” as a proxy for individual-level indicators that cannot be directly collected from patients. Structured and semistructured tools for clinicians such as flowsheets, screening tools, and questionnaires, can collect more information from patients than can be found in administrative fields. Although a variety of these tools are now available, including the Protocol for Responding to and Assessing Patients’ Assets, Risks, and Experiences (PRAPARE) and the Epic SDoH Wheel, they have been adopted by a very small number of clinics. The lack of any single, ubiquitously applied clinical tool means extra work for researchers, who would need to extract and harmonize any information gathered from these applications. Each of these methods has benefits and drawbacks that must be considered when selecting a research dataset. However, little information is available to assist researchers making these choices, and there are no agreed-upon best practices for working with SDoH data.

MATERIALS AND METHODS

Search strategy and screening

We conducted an iterative, deductive, and systematic literature review. Specifics of the workflow are detailed in Figure 1. First, PubMed and Ovid MEDLINE databases were searched for articles focused on social determinants research and data quality, accuracy, validity, or the introduction of bias. Although no date range was specified for articles, the initial results were limited by the date range of databases themselves. The search terms were constructed to align with the definition of SDoH created by the World Health Organization (WHO) and utilized by

Figure 1.

PRISMA flow diagram.

PRISMA flow diagram. major organizations such as Gravity Project., Nonmodifiable social and economic factors such as race/ethnicity, socioeconomic status, education level, environmental health (proximity to healthy food, walkability, exposure to environmental toxins, etc.), and health insurance status were all considered SDoH for the purpose of this review. However, in keeping with the WHO’s definition, modifiable behaviors such as smoking and exercise were not. Because data linkage is commonly used to enrich clinical social determinants datasets with community-level information, articles about the quality of geocoded patient address data were included if they discussed research focused on linking clinical data to exterior datasets for clinical research purposes. The Medical Subject Headings (MeSH) database was used to identify appropriate keywords. The initial search was conducted in PubMed, and an adjacency search was performed in Ovid MEDLINE to find articles not indexed in PubMed. Details of the search strategy, including keywords, can be found in Supplementary Appendix SA.

Eligibility

The results of the 2 database searches were compared in order to identify and remove duplicate articles, and then the remaining articles were screened based on title and abstract. After screening, the first author then manually reviewed the remaining articles to determine whether they met the eligibility criteria described below and summarized in Table 1.

Table 1.

Eligibility criteria for articles

	√ Included:	X Excluded:
Topic/Focus	Original, peer-reviewed research focused on the quality of social determinants of health data.	Reviews; opinion pieces; research that has not been peer-reviewed.
Social Determinants of Health Factors	Race/ethnicity, language preference, health insurance status, country of origin, occupation, socioeconomic status, education level, environmental health (proximity to healthy food, walkability, exposure to environmental toxins, etc.), geocoded patient address data (only included if the article primarily focused on linking clinical data to external datasets for research on social determinants)	Behaviors (eg, smoking and exercise)
Sources of Health Data	Clinical sources within the United States and Canada: EHR, medical registries, administrative databases compiled from EHR data, observational studies using clinical data pulled directly from the medical records of participants	Nonclinical sources: population-level data, mHealth sources, genomic datasets, vital records (ie, birth or death certificate data); Clinical sources outside the United States or Canada
Language	Articles written in English.	Articles in languages other than English.

Eligibility criteria for articles

Inclusion criteria

Articles were eligible for the review if they were in English, used data from healthcare systems in the United States and Canada, and were original research. The research described in the articles must use patient health data originating from clinical sources (ie, data sourced from electronic health records, disease registries, etc.). Studies examining information from registries were included because this information is often abstracted directly from medical records. Also included were several large databases compiled from electronic health record EHR data, such as the Biomedical Translational Research Information System repository, and the Healthcare Cost and Utilization Project’s State Inpatient Database., Research using datasets from large cohort studies such as the National Birth Defect Prevention Study were included if the study’s dataset was drawn directly from the participant’s medical records.,

Exclusion criteria

Articles were ineligible for this review if they described research that used patient health data from nonclinical sources (eg, population-level data, patient-generated data such as mHealth sources, or data from genomic datasets not originating from clinical sources). Vital records such as birth and death certificate data were not included because they often differ significantly in structure and content from other medical records. To identify additional works missed by the initial query, a snowball technique was applied to the citations in the eligible articles.

Data extraction and thematic analysis

Using a deductive approach, content analysis was performed on eligible articles. Each manuscript was categorized by (1) the specific social determinant examined and (2) the primary data quality issue. This information was then abstracted and tabulated. Data abstraction was performed alongside a closer reading of the selected articles, which informed the thematic analysis. An iterative, inductive approach was taken to identify themes in the literature, with a specific focus on themes that could be actionable to researchers using health records for social determinants research. Once themes were identified, the articles were reviewed a final time to abstract and tabulate the prevalence of issues and common approaches to solutions. A complete list of the categories selected for each article can be found in Supplementary Appendix SB.

RESULTS

A total of 76 articles were included in this review. Throughout the literature, the most commonly represented quality issue associated with social determinants data was plausibility, that is, accuracy (n = 31, 41%). Thirty-eight percent (n = 29) of manuscripts focused primarily on the completeness of social determinants data in the medical record—whether or not data were missing. The remaining 21% looked largely at conformance—whether data were compatible (n = 16). A tabulated breakdown by data quality issue and social determinants category is available in Table 2.

Table 2.

Characteristics of studies included in this review

	Primary social determinant of health, n (%)
Primary data quality issue	Race, ethnicity, country of origin	Insurance status	Occupation	General community-level	Environmental	Nonspecific
Completeness (missing data), n = 29	15 (37.5%)	1 (100%)	6 (86%)	2 (12.5%)	1 (20%)	6 (86%)
Conformance (incompatible data), n = 16	0	0	1 (14%)	10 (62.5%)	3 (60%)	0
Plausibility (inaccurate data), n = 31	25 (62.5%)	0	0	4 (25%)	1 (20%)	1 (14%)
Total, n = 76	40	1	7	16	5	7
Typical article title	“Accuracy of Race, Ethnicity, and Language Preference in an Electronic Health Record”	“Primary Payer at DX: Issues with Collection and Assessment of Data Quality”	“Availability and accuracy of occupation in cancer registry data among Florida firefighters”	“Match Rate and Positional Accuracy of Two Geocoding Methods for Epidemiologic Research”	“Residential mobility in early childhood and the impact on misclassification in pesticide exposures”	“Utilization of Social Determinants of Health ICD-10 Z-Codes Among Hospitalized Patients in the United States, 2016–2017”
Usual source for this information within the patient record	administrative or demographic sources			patient address is geocoded to link community-level data		diagnosis codes

Characteristics of studies included in this review Articles about race, ethnicity, country of origin, and language preference were grouped into a single category and had the largest body of data quality literature (40 of 76 articles; 53%). Sixteen articles (21%) addressed the quality of geocoded patient address data, which is frequently used in clinical social determinants research to link individual, patient-level data to community-level datasets to incorporate variables not available in the medical record. Also, represented in the literature were occupation (9%), environmental factors (7%), and insurance status (1%). Seven articles (9%) addressed social determinants data generally without focusing on any specific variable. Three of these nonspecific articles discussed the use of International Classification of Diseases (ICD) codes (a.k.a., Z-codes) as a source of social determinants information in the medical record.

Bias

The first theme identified in the thematic analysis was that bias or validity problems were likely to result from data quality concerns. A majority of articles (47 of 76; 62%) either found bias when running test analyses on their datasets or they noted that data quality was differentially poor for certain groups and thus there was a high potential for bias. Twenty-four articles (32%) did not evaluate their datasets for bias and 5 articles (7%) tested for bias but were unable to find any. Sixteen articles (21%) observed that data are differentially incomplete (also referred to as Missing Not at Random); 23 (30%) noted misclassification bias. Results about bias associated with specific social determinants are presented below and summarized in Table 3.

Table 3.

Findings about bias and differential data quality

Bias Finding?	Social determinant	Bias type	Articles reporting that finding, n (%)
Yes	Race/ethnicity	Misclassification	19 (25.0)
		Missing Not at Random (MNAR)	9 (11.8)
		Differentially implausible	2 (2.6)
		Other	1 (1.3)
	Insurance	Missing Not at Random (MNAR)	1 (1.3)
	Occupation	Missing Not at Random (MNAR)	4 (5.3)
	General Community Level	Rural data are problematic	3 (3.9)
	General Community Level	Other	3 (3.9)
	Environmental	Misclassification	3 (3.9)
	Nonspecific	Missing Not at Random (MNAR)	2 (2.6)
Unknown	Did not evaluate for bias		24 (31.6)
No	Evaluated for bias and found none		5 (6.6)

Findings about bias and differential data quality

Race/ethnicity/country-of-origin variables

Plausibility

Articles that discussed race, ethnicity, or country-of-origin data were most concerned with plausibility (ie, accuracy), which was discussed in 26 of the 40 articles (65%). Eighty-five percent of these articles (22 of 26) noted the potential for implausible data to cause error or bias in research, most commonly with misclassification. Three studies (12%) looked for bias but did not find any. Misclassification bias, that is, incorrect assignment, was noted as a problem or a potential problem in 18 of the 26 (69%) articles about the plausibility of race/ethnicity data.,, Further, several studies reported that implausible data and misclassification errors were more likely for certain groups: Fourteen studies reported that Hispanic patients were more likely to be misclassified, either that information about their ethnicity was missing or they had been mistakenly grouped into the “Other” category.,,,,,, Four of these studies also found that Asian patients were more likely to be missing information identifying their race than white patients.,,, Six studies found disproportionately high rates of misclassification for American Indians in comparison to other racial/ethnic groups; most often, these patients were misidentified as white.,,,,,

Completeness

The remaining 15 studies that looked at race/ethnicity data primarily examined its completeness (38%). Ten of these (66%) identified that the incomplete data led to validity issues, most commonly that these data were not missing at random and had the potential to introduced bias.,,

Geocoded patient address data used for linkage to community-level variables

Articles about geocoded patient address data, on the other hand, were largely concerned with relational conformance (ie, linkage match rates for geocoding), which was examined in 10 of the 16 articles (63%). Plausibility, that is accuracy, was the primary concern of 4 of the articles about geocoded patient address data. Overall, 43% (n = 7) of the studies about geocoded patient address data acknowledged or established a potential for bias in their datasets. The remainder looked at data quality but did not evaluate their datasets for any validity problems that may result. Although race/ethnicity data were mostly plagued with a single type of error (ie, misclassification), geocoded address data linked with community-level data were associated with multiple forms, including cartographic confounding and Type II error (ie, falsely accept the null hypothesis). Another study characterized the issues encountered with geocoded patient data as the distinction between individual- and community-level variables. They found that “the accuracy of the community-level data for identifying patients with and without social risks was 48.0%.” The authors noted that the use of these data for patient risk stratification “may heighten the risk of ecologic fallacy, wherein incorrect assumptions are made about an individual based on aggregate-level information about a group.” It should also be noted that the quality problems found with geocoded patient address data were not randomly distributed; for example, relational conformance (ie, match rates) tend to be poorer for rural areas and certain parts of the country.

Environmental health variables

Because information about exposure to toxins is rarely recorded in the medical record, geocoded patient addresses are used to link health information to data about the environment. We found 5 studies discussing the use of patient addresses for exposure assessment. As with the geocoded community-level variables discussed above, relational conformance was represented in the majority of the articles about the quality of the datasets used for environmental health (60%). Four (80%) of the included articles explored the impact of residential mobility on exposure assessment; that is, whether patients’ moving impacted the results of studies looking at environmental outcomes.,,, In all of these articles, bias was characterized as misclassification of exposure to contaminants, an issue that has been noted elsewhere to be a source of Type II error in environmental health research.

Nonspecific social determinants

Six of the 7 articles which addressed social determinants data as a broad, general category assessed the completeness of this information (86%); one addressed plausibility. Two of the articles mentioned the potential for bias, in both cases due to data missing nonrandomly., Three of the articles looked at the completeness of ICD Z- or V-codes, diagnostic codes that can be used by clinicians to collect SDoH data from patients. All 3 articles concluded that clinicians were utilizing ICD codes to represent SDoH around 2% of the time.

Occupation

Six of the 7 studies (86%) that looked at the quality of occupational data primarily examined completeness, all noting that occupational information is frequently missing from the health record., Four studies found that data were not missing at random and that male patients were more likely to have occupational information in their record.,

Recommendations from the literature

The second theme we identified in our analysis is that there are solutions researchers can use to mitigate the issues caused by data quality problems. Forty-seven of the articles (62%) made at least one evidence-based recommendation for researchers seeking to improve the quality of social determinants data after it has been collected. We grouped these recommendations into 5 suggestions, which are detailed below and briefly summarized in Table 4.

Table 4.

Summary of recommendations found in the articles

Five ways to increase data quality
Recommendation	References supporting this recommendation
1. Avoid complete case analysis	²³ ^, ⁵⁰ ^, ⁶⁵
2. Impute data	¹⁴ ^, ²² ^, ²⁹ ^, ³⁰ ^, ³⁶ ^, ³⁷ ^, ⁴⁵ ^, ⁴⁶ ^, ⁵⁰ ^, ⁶¹ ^, ^65–71
3. Rely on multiple sources	¹³ ^, ²⁶ ^, ²⁸ ^, ⁴⁹ ^, ⁶⁰ ^, ⁶² ^, ⁶⁴ ^, ⁷²
4. Use validated software tools	¹⁰ ^, ¹² ^, ⁵⁴ ^, ⁶⁰ ^, ⁶² ^, ^73–77
5. Select addresses thoughtfully	¹⁵ ^, ¹⁶ ^, ⁵⁶ ^, ⁵⁷

Summary of recommendations found in the articles

Avoid complete case analysis

It is a common practice to exclude incomplete (ie, missing) data from the analysis, a method also known as casewise deletion or complete case analysis. However, 3 studies in our review found that casewise deletion decreased the quality of race/ethnicity data.,, Grundmeier et al found that using only complete cases “produced highly biased results,” and in fact reversed the odds ratio for the Black subjects in their dataset. Brown et al found that their “race and ethnicity coefficient estimates are often biased downwards either toward zero or more negative when data with missing race and ethnicity is dropped.” In all 3 studies, imputation was recommended as preferable to casewise deletion of missing data. In a study on using patient addresses to determine pesticide exposures, Ling et al noticed that there were significant differences between patients with complete address information and those who were missing information. In particular, Hispanic women born in Mexico and people living in poor neighborhoods were more likely to have missing addresses. For studies that rely on patient address to determine exposure, this means that these groups are more likely to be excluded from the analysis, potentially biasing the results. One additional study about geocoding of patient addresses did not evaluate for bias, but did note that “unmatched addresses tend to be unevenly distributed—more likely to occur in rural areas and newly developed suburban areas, and less likely to occur in inner-city areas.” In other words, complete case analysis would likely exclude a disproportionate number of rural patients.

Impute data

Several studies looked at the use of imputation, also referred to as indirect estimation, to increase the completeness of datasets and avoid casewise deletion.,,,,,,,, Imputation is a way to infer missing data, and there are many imputation methods that can be used to generating substitute values to fill in missing data. Most of these studies examined methods for imputing race/ethnicity data, although several looked at imputing geocoded patient addresses,,, and one looked at imputing occupational data. The most widely researched imputation method was Bayesian Improved Surname Geocoding (BISG).,,,, BISG is used to supplement missing race/ethnicity data and provides a probability of a patient belonging to a particular racial or ethnic group based on that patient’s geocoded address and their last name. In their study on the use of BISG, Dembosky et al found that it “did not substantially alter the estimated overall racial/ethnic distribution, but it did modestly increase sample size and statistical power.” Imputation was recommended not only as a solution for missing data but also to improve the accuracy of implausible data.,,,,, Methods involving Spanish surname coding, a close relative of BISG that uses a patient’s last name to guess their ethnicity, were investigated in several studies to increase the reliability and consistency of data from Hispanic patients.,,, Two articles validated similar, surname-based methods for data from Asian/Pacific Islander patients., All of these studies confirmed that these techniques reduced misclassification errors and enhanced data quality for their respective populations. Finally, 2 additional studies looked at imputing race/ethnicity data with anonymized clinical datasets,, a situation where patient identifying information has been removed, and therefore, it is not possible to use imputation methods that rely on patient surname or geocoded address. Ma et al compared 4 imputation methods and found that conditional multiple imputation, “substantially improved statistical inferences for racial health disparities research.”

Rely on multiple sources

This method compares and links data across an individual’s record in order to fill in or correct missing data fields and was recommended by several studies as a way to either increase completeness,,,,,, and/or check the accuracy of implausible data.,, For example, Smith et al. recommended supplementing race/ethnicity data from the EHR with birth certificate data to increase completeness and plausibility. Another study used natural language processing (NLP) to “improve the identification of race and ethnicity in EHR data.” Those researchers used NLP to comb through the unstructured text fields in clinical notes, then used the results to augment race/ethnicity data missing from structured fields.

Use validated software tools

Several articles evaluated a specific software tool, usually one developed by the authors, for either assessing or increasing the quality of the data. Articles about geocoding patient address data, particularly, evaluated the ability of specific geocoding tools, such as ArcGIS, to increase data quality by improving match rates or positional accuracy., For researchers seeking to increase the quality of occupation data prior to analysis, the National Institute of Occupational Safety and Health (NIOSH) has created a free, web-based system called NIOSH Industry and Occupation Computerized Coding System that was evaluated and recommended by several studies.,,, Some studies tested data quality assessment tools such as the Data Quality Assessment Tool, created to evaluate patient records at Community Health Centers, and the Data Completeness Analysis Package.,

Select addresses thoughtfully

Patients move over addresses over time, which means that researchers often have decisions to make about which address to use. When confronted with data from patients who may have moved several times over the study period, 3 studies concluded that a single patient address was sufficient;,, one study recommended using the most recent address, while the other 2 concluded that address at birth was adequate for research on the effect of early exposures., However, a study by Brokamp et al compared the effect of using the most recent patient address, birth address, and an average across various addresses. They found that using the most recent address or address at birth for a mostly urban population monitored over a 7-year period could create a bias toward the null.

DISCUSSION

The data quality issues represented in our review found in similar proportions to those in Weiskopf and Weng’s review examining the quality of EHR data for secondary use. These both validate our findings and suggest that SDoH data suffer from quality problems similar to other data in the medical record. Our review found that the category of data quality problem varies depending on the variable. Likewise, the kind of error created by these data quality problems also varies based on the social determinant factor in question. Most notably, problems encountered with the quality of SDoH data do not occur randomly. Although many researchers are aware that data “missing not at random,” commonly abbreviated to MNAR, can cause bias during analysis, fewer are aware of the problems associated with other kinds of “data quality not at random.” However, problems with plausibility not at random—for example, the accuracy of data for Hispanic or Latino patients being lower than the accuracy of data from white patients in the same dataset—has similarly profound implications. Namely, when patients from one racial or ethnic group are lost in another group or mistakenly categorized as “Other,” subsequent analysis can cause those groups to be under-represented in research results. Misidentification of the race or ethnicity of groups of patients can inadvertently lead to the erasure of those groups from clinical research. Several articles documented that race/ethnicity/country-of-origin data tend to be recorded inconsistently across a patient’s record, especially for Hispanic patients. Why is data quality so poor for this group? Thirteen studies speculated that this may be due to the “fluid, debatable, and problematic” nature of the definition of race and ethnicity.,,,,,,,,,, Race is, after all, not a biological category but a social and political one, and thus its terminology shifts over time. Pellegrin et al noted that the fluidity of these definitions leads patients to respond inconsistently to questions about their race/ethnicity, thus causing problems with data reliability. Further, the fact that these categories are so broad and poorly defined leads to difficulties with data validity. At the institutional level, several studies speculate that the quality of data about ethnicity has been impaired by variations in how healthcare systems record and handle this information. As one study noted, “inconsistent classification of Hispanics is likely attributable, in part, to differences in the definition of being Hispanic across classification schemes.” Because the gold standard for race/ethnicity is widely considered to be patient self-report,,,,,,,, it is possible that the increasing use of dynamic patient-facing data entry tools may allow people to inform and correct their own demographic information, thus helping to improve the quality of race, ethnicity, and country-of-origin data in the future. The quality of patient elements, particularly demographic data, promises to become increasingly important as the efforts to link patient records across multiple institutions are expanded. This is necessary for large-scale research, big data analytics, and continuity of patient care. Privacy-preserving record linkage (PPRL) methods identify when records from different sources belong to the same entity while minimizing the exposure of sensitive personal information. These techniques often rely heavily on patient address along with name and date of birth. When there are errors or missing address data, linkage quality suffers. Many of the social determinants data elements most commonly used in research, such as race, ethnicity, insurance status, and address, were originally collected as demographic data for administrative purposes. Inevitably, data quality issues arise when these elements are used for secondary, retrospective research. Given the increasing importance of social determinants in health equity research and intervention, it is crucial that healthcare institutions work to improve the quality and availability of these data. Efforts such as the Gravity Project are already underway to create standardized, structured reporting of SDoH. Consistently applied standards for SDoH data collection in the EHR would result in improved data quality, which in turn would lead to more robust research, care coordination, and population health management.

Limitations

The data quality concepts used here are not completely orthogonal or distinct. For example, several studies found that the plausibility of the patient race/ethnicity information in their dataset was questionable because the data were incomplete; in other words, plausibility was low because a patient’s race was reported correctly in one area, but as “other” in another.,, The snowball sampling approach we used may have caused some research areas to be under-represented. Citation searching is inherently exponential; if our initial search turned up few articles within a certain domain, then that domain may appear to have a smaller body of literature than is in fact present. In addition, publication bias may have affected our findings if authors did not report negative findings when evaluating their datasets for bias. Although some social determinants variables may have more thoroughly documented data quality issues, this does not mean that those variables are of poorer quality. A larger body of research may indicate simply that these variables are more accessible to researchers and therefore easier to study. For example, our finding that articles about race/ethnicity data were such a large proportion of the literature may reflect that this information is more readily available in the structured fields of the health record than other SDoH variables. Another limitation is that our search for solutions to these data quality problems is focused on the needs of researchers using observational databases. Because researchers require ex post facto methods for improving secondary use data, we ignored any recommendations from in the literature about improving data collection practices at the point of care.

CONCLUSIONS

The types of quality problems found with SDoH data vary depending on the variable; race/ethnicity data from the health record can be implausible or incomplete, while linked community-level data are prone to problems with nonconformance as well as plausibility. Similarly, data quality problems can lead to corresponding issues of validity and reliability; race/ethnicity data that are implausible or missing not at random may lead to misclassification bias, while problems with geocoding can lead to misclassification, confounding, or ecologic fallacy. Several studies have documented that data quality from Hispanic patients can be particularly implausible and is especially prone to misclassification bias when compared to data from other racial/ethnic groups. Fortunately, evidence-based solutions are available for researchers who want to improve the quality of social determinants data ex post facto. While complete case analysis has the potential for bias, imputation techniques can avoid these shortcomings. Consideration of data quality by researchers prior to analysis, along with thoughtfully applied quality improvement methods, may help prevent bias and improve the validity of research conducted with SDoH data.

FUNDING

LAC has received support from the National Library of Medicine under Award Number T15LM007088. NGW has received funding under the National Library of Medicine Award Numbers K01LM012738 and R21LM013645.

AUTHOR CONTRIBUTIONS

LAC led the study, drafted the manuscript, conducted the literature review, and analyzed the data. JS helped revise the manuscript and drafted significant portions of the “Discussion” section. NGW assisted with the interpretation of the data, contributed to the conception and design of the study and to the structure of the manuscript, and also provided substantial intellectual content and critical revision.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Click here for additional data file.

76 in total

1. Agreement in race-ethnicity coding between a hospital discharge database and another database.

Authors: A P Polednak
Journal: Ethn Dis Date: 2001 Impact factor: 1.847

2. Post office box addresses: a challenge for geographic information system-based studies.

Authors: Susan E Hurley; Theresa M Saunders; Rachna Nivas; Andrew Hertz; Peggy Reynolds
Journal: Epidemiology Date: 2003-07 Impact factor: 4.822

3. Use and abuse of computer-stored medical records.

Authors: J van der Lei
Journal: Methods Inf Med Date: 1991-04 Impact factor: 2.176

4. Underserved populations with missing race ethnicity data differ significantly from those with structured race/ethnicity documentation.

Authors: Evan T Sholle; Laura C Pinheiro; Prakash Adekkanattu; Marcos A Davila; Stephen B Johnson; Jyotishman Pathak; Sanjai Sinha; Cassidie Li; Stasi A Lubansky; Monika M Safford; Thomas R Campion
Journal: J Am Med Inform Assoc Date: 2019-08-01 Impact factor: 4.497

5. Indirect Estimation of Race/Ethnicity for Survey Respondents Who Do Not Report Race/Ethnicity.

Authors: Jacob W Dembosky; Amelia M Haviland; Ann Haas; Katrin Hambarsoomian; Robert Weech-Maldonado; Shondelle M Wilson-Frederick; Sarah Gaillot; Marc N Elliott
Journal: Med Care Date: 2019-05 Impact factor: 2.983

6. A multifaceted comparison of ArcGIS and MapMarker for automated geocoding.

Authors: Sanjaya Kumar; Ming Liu; Syni-An Hwang
Journal: Geospat Health Date: 2012-11 Impact factor: 1.212

7. The expanded racial and ethnic codes in the Medicare data files: their completeness of coverage and accuracy.

Authors: D S Lauderdale; J Goldberg
Journal: Am J Public Health Date: 1996-05 Impact factor: 9.308

8. Accuracy of racial classification of Vietnamese patients in a population-based cancer registry.

Authors: K C Swallen; S L Glaser; S L Stewart; D W West; C N Jenkins; S J McPhee
Journal: Ethn Dis Date: 1998 Impact factor: 1.847

9. Variation in Electronic Health Record Documentation of Social Determinants of Health Across a National Network of Community Health Centers.

Authors: Erika K Cottrell; Katie Dambrun; Stuart Cowburn; Ned Mossman; Arwen E Bunce; Miguel Marino; Molly Krancari; Rachel Gold
Journal: Am J Prev Med Date: 2019-12 Impact factor: 5.043

Review 10. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research.

Authors: Nicole Gray Weiskopf; Chunhua Weng
Journal: J Am Med Inform Assoc Date: 2012-06-25 Impact factor: 4.497

3 in total

Review 1. The Assessment of Social Determinants of Health in Postsepsis Mortality and Readmission: A Scoping Review.

Authors: Ryan S Hilton; Katrina Hauschildt; Milan Shah; Marc Kowalkowski; Stephanie Taylor
Journal: Crit Care Explor Date: 2022-07-29

2. Racial Disparities in Adherence to Annual Lung Cancer Screening and Recommended Follow-Up Care: A Multicenter Cohort Study.

Authors: Roger Y Kim; Katharine A Rendle; Nandita Mitra; Chelsea A Saia; Christine Neslund-Dudas; Robert T Greenlee; Andrea N Burnett-Hartman; Stacey A Honda; Michael J Simoff; Marilyn M Schapira; Jennifer M Croswell; Rafael Meza; Debra P Ritzwoller; Anil Vachani
Journal: Ann Am Thorac Soc Date: 2022-09

3. Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave.

Authors: Lily Cook; Juan Espinoza; Nicole G Weiskopf; Nisha Mathews; David A Dorr; Kelly L Gonzales; Adam Wilcox; Charisse Madlock-Brown
Journal: JMIR Med Inform Date: 2022-09-06

3 in total