| Literature DB >> 34169138 |
Shiho Kino1,2, Yu-Tien Hsu1, Koichiro Shiba3, Yung-Shin Chien1, Carol Mita4, Ichiro Kawachi1, Adel Daoud5,6,7,8.
Abstract
BACKGROUND: Machine learning (ML) has spread rapidly from computer science to several disciplines. Given the predictive capacity of ML, it offers new opportunities for health, behavioral, and social scientists. However, it remains unclear how and to what extent ML is being used in studies of social determinants of health (SDH).Entities:
Keywords: Machine learning; Review; Social determinants of health
Year: 2021 PMID: 34169138 PMCID: PMC8207228 DOI: 10.1016/j.ssmph.2021.100836
Source DB: PubMed Journal: SSM Popul Health ISSN: 2352-8273
Summary of the included studies.
| Authors | Year | Aim | Theme | Data Type | Country | Number of rows | Outcomes | Algorithm | Main Findings |
|---|---|---|---|---|---|---|---|---|---|
| ( | 2019 | To assess the connection between social vulnerability and its urban and dwelling context by a decision model. | Prediction | Survey | Spain | 5381 | Social vulnerability | ANNs | There is a connection and relationship between demographic and social vulnerability phenomena and the residential configuration of Andalusia. |
| ( | 2020 | To analyze how socio-economic and socio-cultural factors play a role in the initiation and cultivation of addictive behaviors and use a machine learning approach to predict the early onset of such behaviors. | Prediction | Survey | Global | 176 | Smoking and alcohol habit | Gaussian naïve Bayes, SVM | Logistic Regression to be the best performing classifier to predict both drinking and smoking habits. |
| ( | 2019 | To examine spatial patterns of country-level stillbirth rates and determine the influence of social determinants of health on spatial patterns of stillbirth rates. | Data curation | Survey | Global | 194 | Stillbirth rate | Bayesian networks | The Bayesian network model suggests strong dependencies between stillbirth rate and gender inequality index, geographic regions, and skilled birth attendants during delivery. |
| ( | 2010 | To evaluate the impact of educational attainment on the prevalence of osteoporosis and peripheral fractures and to develop a simple algorithm using a tree-based approach with education level and clinical data. | Prediction | Survey | Morocco | 356 | Bone mineral density | CART | A lower level of education was associated with significantly lower bone mineral densities at the lumbar spine and the hip sites and with a higher prevalence of osteoporosis at these sites in a dose-response manner. |
| ( | 2019 | To explore potential confounders in an adolescent public health dataset of a developing country by using a combination of machine learning methods and graph analysis. | Prediction | Survey | Brazil | 102301 | Health status | Gradient boosting machines | The proposed approach might be a useful tool to obtain novel insights on the association between variables and to identify general factors related to health conditions. |
| ( | 2015 | To examine the relation of neighborhood alcohol outlet density and norms around drunkenness with alcohol | Prediction | Survey | United States | N/A | Alcohol use disorder | SuperLearner algorithm | The neighborhood environment shapes alcohol use disorder. Despite the lack of additive interaction, each exposure had a substantial relationship with alcohol use disorder. Their findings suggest that alteration of outlet density and norms together would likely be more effective than either one alone. |
| ( | 2016 | To examine the relations between childhood adversities and mental disorders by race/ethnicity in the National Comorbidity SurveyAdolescent Supplement | Causal inference, Prediction | Survey | United States | N/A | Mental disorders | SuperLearner algorithm | Among adversities, physical abuse, emotional abuse, and sexual abuse had the strongest associations with mental disorders. Of all outcomes, behavior disorders were most strongly associated with adversities. |
| ( | 2020 | To explore the relationship between individual social capital and functional ability. | Prediction | Survey | China | N/A | Activity function | CART | Subjects with lower social participation and lower social connection had an increased risk of functional disability. |
| ( | 1991 | To apply the classification and regression trees method to classify abstainers and drinkers according to interactions among ten sociodemographic factors. | Prediction | Survey | United States | 5952 | Alcohol use | CART | Low rates of drinking were shown for low-income women with less than high school education. |
| ( | 2017 | To quantify and compare different dimensions of social and economic resilience in Bam and Rudbar with a descriptive-analytical method. | Prediction | Survey | Iran | 660 | Social and economic resilience | Feedforward multilayer perceptron ANN | The social component, namely, social capital, was the most important determinant of resilience. |
| ( | 2013 | To provide critical insights into the vast heterogeneity of disability within India. | Prediction | Survey | Global | 7150 | Disability score | Regression tree analysis | Having two or more symptomatic NCDs |
| ( | 2019 | To develop a model for predicting whether a person with T2DM has uncontrolled diabetes (hemoglobin A1c ≥ 9%), incorporating individual and area-level (census tract) covariates | Prediction | Survey | United States | 1015808 | Type 2 diabetes mellitus | LASSO regression, Ridge regression, RF | Machine learning models improved upon risk prediction. |
| ( | 2020 | To apply Logic regression in a study evaluating the association between occupational history and the risk of amyotrophic lateral sclerosis (ALS), and discuss advantages of the method as well as drawbacks and practical issues relevant for epidemiological research. | Prediction | Survey | Denmark | 37972 | Incidence of ALS | Logic regression | Logic regression may represent a useful methodology in several epidemiological studies dealing with a high number of covariates and is one of the few available approaches to investigate patterns of multiple binary covariates as they relate to a given outcome, which can offer several advantages in terms of both computation and interpretation. |
| ( | 2018 | To examine the cumulative effect of additional years and tenure security of social housing on mental health in a large cohort of lower-income Australians. | Prediction | Survey | Australia | 4777 | Mental health | Marginal structural models with machine learning-generated weights | The more transitions people made in/out of social housing, the greater the impact on mental health and psychological distress. |
| ( | 2019 | To determine if area-level resources, defined as organizations that assist individuals with meeting health-related social needs, are associated with lower levels of cardiometabolic risk factors. | Prediction | Survey | United States | 123355 | Body mass index | RF | Resources associated with lower BMI included more food resources, employment resources, and nutrition resources. |
| ( | 2018 | To assess whether knowledge of neighborhood socioeconomic status improves the prediction of health outcomes. | Prediction | Survey | United States | 90097 | Use of health care services and hospitalizations due to accidents, asthma, influenza, myocardial infarction, and stroke. | Random survival forest | Information on neighborhood socioeconomic status may not contribute much more to risk prediction above and beyond what is already provided by electronic health record data. |
| ( | 2020 | To estimate associations between fruit and vegetable intake relative to total energy intake and adverse pregnancy outcomes. | Causal inference | Survey | United States | 7572 | Adverse pregnancy outcomes | Targeted maximum likelihood estimation, SuperLearner algorithm | The differences in Results between Super Learner with TMLE and logistic regression suggest that dietary synergy, which is accounted for in machine learning, may play a role in pregnancy |
| ( | 1991 | To identify potential high users of services among low-income psychiatric outpatients using CART. | Prediction | Survey | United States | 382 | Potential high users of services | CART | Discharge from inpatient psychiatric treatment right before admission to outpatient psychiatric treatment is the most powerful predictor. |
| ( | 2016 | To present and apply a method that makes predictions for trips reported in a household travel survey based on the data from a GPS and accelerometer data collection conducted in the same geographical context. | Data curation, Prediction | Survey, Accelerometer data | Global | 82084 | Transport-related physical activity | RF | The education level had a positive association with transport-related physical activity (T-MVPA). Household income had a negative association with T-MVPA, especially for those people without a motorized vehicle. |
| ( | 2014 | To explore complex interactions between different social determinants and their impact on mental healthcare use. | Prediction | Survey | Canada | 10600 | Mental health visits with a primary care provider or geriatrician | CART | Income adequacy plays an important role among women, while marital status is of greater importance among men for mental health services utilization. |
| ( | 2018 | To use machine learning algorithms to predict life expectancy at birth and then compare health-related characteristics of the under- and overachievers. | Prediction | Survey | Brazil | 3052 | Life expectancy at birth | ANNs, RFs, gradient boosted trees, least squares, Ridge and LASSO regressions, SVMs | Overachievers presented better Results regarding primary health care. Underachievers performed more cesarean deliveries and mammographies and had more life-support health equipment. |
| ( | 2017 | To identify patterns of characteristics that distinguish very low food security households in the United States from other households. | Prediction | Survey | United States | 13351 | Food security | CART | Household experiences of VLFS were associated with different predictors for different types of households and often occurred at the intersection of multiple characteristics spanning unmet medical needs, poor health, disability, limitation, depressive symptoms, low income, and food assistance program participation. |
| ( | 2018 | To investigate the probability of suicide death using baseline characteristics and simple medical facility visit history. | Prediction | Survey | South Korea | Suicidal rate | SVMs, ANNs | Male gender, older age, lower-income, medical aid, and disability were linked to increased risk for suicide death at 10-year follow-up. | |
| ( | 2019 | To use network modeling to characterize co-occurring psychosocial risks to maternal and child health among at-risk pregnant women. | Prediction | Survey | South Africa | 200 | Distress about pregnancy | Network analysis | Unintended pregnancy was strongly tied to distress about pregnancy. Distress about pregnancy was most central in the network and was connected to antenatal depression and HIV stigma |
| ( | 2019 | To present a new and highly configurable rule-based clinical natural language processing system designed to automatically extract information that requires inferencing from clinical notes. | Algorithmic fairness, Prediction | Text (Electronic health records) | United States | N/A | Housing situation, living alone, and social support | NLP | The algorithm is highly accurate in extracting and classifying the three variables of interest (housing situation, living alone, and social support). |
| ( | 2011 | To identify the complex interplay of area-level factors associated with the high area-specific incidence of Australian priority cancers using a classification and regression tree approach. | Prediction | Survey | Australia | 186075 | Cancer incidence | CART | HL suffered more negative health outcomes and had higher healthcare service utilization. |
| ( | 2020 | To better understand HL from a linguistic perspective and to open new research areas to enhance population management and individualized care. | Data curation | Text (Electronic health records), Survey | United States | 1050577 | Hospitalization | Discriminant function analysis | The developed model predicts human ratings of HL with ~80% accuracy. Validation indicates that lower HL patients are more likely to be nonwhite and have lower educational attainment. In addition, patients with lower |
| ( | 2019 | To predict women's height from their socioeconomic status | Prediction | Survey | Global | N/A | Height | LASSO, Ridge regression, generalized additive model, Bayesian neural network, bagged CART, RF | There were no relevant non-linear relationships between SES and women's height. |
| ( | 2019 | To examine the pathways of economic austerity propagate through families' living conditions and societies' structural and political characteristics. | Causal inference | Survey | Global | 1940734 | Child poverty | Generalized RF | The International Monetary Fund (IMF) program affects children residing in the middle of the social stratification more than compared to their peers residing in both the top and bottom of this stratification; for those children residing in societies that have selected into IMF programs and have historically spent most on education, are at a higher risk of falling into poverty. |
| ( | 2017 | To provide an empirical model of predicting low back pain by considering the occupational, personal, and psychological risk factor interactions in workers population employed in industrial units using an ANNs approach. | Prediction | Survey | Iran | 92 | Low back pain severity | Neural network model | The mean classification accuracy of the developed neural networks for the testing and training phase data was about 88% and 96%, respectively. In addition, the mean classification accuracy of both training and testing data was 92%, indicating much better Results compared with other methods. |
| ( | 2020 | To identify predictors of youths' first episode of homelessness during the 12 months after substance use treatment entry. | Prediction | Survey | United States | 20069 | The first episode of homelessness | Lasso machine learning regression | The adolescents who were older, male, reported more victimization experiences, mental health problems, family problems, deviant peer relationships, and substance use problems were more likely to report experiencing homelessness. |
| ( | 2019 | To evaluate the determinants of health in aging using machine learning methods and to compare the accuracy with traditional methods. | Prediction | Survey | United Kingdom | 6209 | Health metrics | RF, deep learning, linear model | Health-trend, physical activity, and personal-fitted variables were the main predictors of health. The performance of the RF method was similar to the traditional linear model, but RF significantly outperformed deep learning. |
| ( | 2019 | To study the social and economic data and the relationship between opioid drug abuse situations. | Prediction | Survey | United States | 8344 | Opioid | Grey relation analysis | The deviation of prediction is reduced from (−10.99%, 22.33%) to (−8.29%, 2.81%), making the modified value closer to the measured value. |
| ( | 2020 | To build and evaluate a novel framework termed Stratified Cascade Learning and used it for predicting the risk of hospitalization. | Prediction | Survey | United States | 14300 | Hospitalization risk | Stratified cascade learning mode | The stratified cascade learning model does not improve either the area under the curve or the negative predictive value of the basic classifier but materially improves accuracy and specificity measures at the expense of lowering sensitivity for the “predictable” subset. |
| ( | 2005 | To identify the socioeconomic, sociodemographic, and health-related lifestyle behavior profile of adults who comply with the recommended four or more servings per day of fruit and vegetables. | Prediction | Survey | Ireland | 6539 | Fruit and vegetable consumption | CART | Irish people do not comply with the dietary recommendations, but this varies greatly by social circumstance. |
| ( | 2007 | To investigate the differences in culture, attitudes, and social networks between Australian and Taiwanese men and women and to identify the factors that predict midlife men and women's quality of life in both countries. | Prediction | Survey | Global | 715 | Quality of life | CART | People who had higher levels of horizontal individualism and collectivism, positive attitudes, and better social support had better psychological, social, physical, and environmental health, while it emerged that vertical individualists with competitive characteristics would experience a lower quality of life. |
| ( | 2018 | To use machine learning to identify an optimal set of predictors for urban interpersonal firearm violence rates using a broad set of community characteristics. | Prediction | Survey | United States | N/A | Firearm violence | Random forest analysis | The top 5 covariates with the highest variable importance – the black isolation index, the black segregation index, the percent of households receiving food stamps, the percent of men age 65+ with high school education, and the percent never married – achieved 0.708 average R-squared and average MSE alone. |
| ( | 2020 | To examine the association between community firearm violence and risk of preterm birth. | Causal inference, Prediction | Survey | United States | 2084417 | Preterm birth | SuperLearner algorithm | Firearm violence was associated with the risk of preterm delivery, and this association was partially mediated by infection and substance use. |
| ( | 2019 | To assess the relative associations of demographic, psychological, behavioral, and cognitive variables with body mass index in a nationally representative sample of youth. | Prediction | Survey | United States | 4524 | Body mass index | LASSO regression | Stimulant medications and demographic factors were most strongly associated with body mass index. |
| ( | 2019 | To examine differences in sociodemographic characteristics, health, nutritional status, and food purchasing behavior between new and existing recipients of SNAP after the recession. | Prediction | Survey | United States | 21806 | Household-level nutritional characteristics | LASSO regression | Given that new recipients are generally better off than existing recipients, it may be more impactful from a public health perspective to instead intervene among those existing recipients who may have more long-standing challenging socioeconomic circumstances. |
| ( | 2014 | To determine long-term risk profiles for suicidal ideation among a community sample of older adults using a decision tree approach, with a focus on the role of physical, social, and psychological risk factors, and their interactions | Prediction | Survey | Australia | 2160 | Suicidal ideation | CART | Psychological factors are important for predicting suicidal ideation. Both physical and social factors significantly improved the predictive ability of the model. |
| ( | 2019 | To measure the relative importance of race compared to health care and social factors on prostate cancer-specific mortality. | Prediction | Survey | United States | 514878 | Prostate cancer mortality | RFs | Tumor characteristics at diagnosis were the most important factors for prostate cancer mortality. Across all groups, race was less than 5% as important as tumor characteristics and only more important than health care and social factors in 2 of the 18 groups. Although race had a significant impact, health care and social factors known to be associated with racial disparities had greater or similarly important effects across all ages and stages. |
| ( | 2015 | To develop a computational algorithm for network epidemiology to map structure-activity data of HAART-drugs cocktails over complex networks of AIDS epidemiology and socioeconomic factors. | Prediction | Survey | United States | 131252 | The probability of AIDS could be halted in a county with a HAART cocktail | Linear neural network | The machine-learning algorithms could be useful as an initial form of screening for the prediction of effective drugs in preclinical assays for the treatment of HIV in different populations of U.S. counties with a given AIDS epidemiological prevalence. However, the models did not appear to be effective when using socioeconomic factors to predict the efficacy of the treatment of HIV. |
| ( | 2016 | To characterize cumulative risk associated with co-occurring risk factors for cigarette smoking. | Prediction | Survey | United States | 114426 | Smoking status | CART | The effects associated with common risk factors for cigarette smoking are independent, cumulative, and generally summative. |
| ( | 2017 | To examine risk factors for using full-flavor versus other cigarette types, including socioeconomic disadvantage and other risk factors for tobacco use or tobacco-related adverse health impacts. | Prediction | Survey | United States | 114426 | Tobacco use or tobacco-related adverse health impacts | CART | The use of full-flavor cigarettes is overrepresented in socioeconomically disadvantaged and other vulnerable populations and associated with an increased risk of nicotine dependence. |
| ( | 2009 | To explore the spatial distribution of notified cryptosporidiosis cases and identified major socioeconomic factors associated with the transmission of cryptosporidiosis in Brisbane, Australia. | Data curation | Survey, Digital base map from the Australian Bureau of Statistics | Australia | N/A | Incidence of Cryptosporidiosis infection | Spatial CART | A spatial CART model shows that the relative risk for cryptosporidiosis transmission was 2.4 when the value of the social economic index for areas was over 1028 and the proportion of residents with low educational attainment in a statistical local area exceeded 8.8%. |
| ( | 2010 | To examine the impact of social economic and weather factors on cryptosporidiosis and explored the possibility of developing such a model using social economic and weather data in Queensland, Australia. | Prediction | Survey | Australia | N/A | Monthly incidence of cryptosporidiosis | Spatiotemporal CART | Spatiotemporal CART models based on social economic and weather variables can be used for predicting the outbreak of cryptosporidiosis in Queensland, Australia. |
| ( | 2017 | To build a model to predict all-cause 30-day readmission risk, and added block-level census data as proxies for social determinants of health | Prediction | Survey | United States | 323813 | 30-day hospital readmission | ANN | Neural networks are great candidates to capture the complexity and interdependency of various data fields in electronic health records. |
| ( | 2018 | To explore the mutual importance or hierarchy of sociodemographic and lifestyle-related risk factors of being overweight using RF. | Prediction | Survey | Finland | 4757 | Overweight | RF | RF did not demonstrate higher power in variable selection compared to linear regression in our study. The features of RF are more likely to appear beneficial in settings with a larger number of predictors. |
| ( | 2019 | To analyze treatment heterogeneity of maternal agency on severe child undernutrition and how this effect plays out in the context of armed conflict. | Causal inference, Prediction | Survey | Nigeria | 48613 | Severe child malnutrition | Bayesian additive regression tree | Maternal education decreases severe child undernutrition, but only when mothers acquire ten years of education or higher. This protective effect remains even during the exposure of armed conflict. |
| ( | 2019 | To evaluate the potential of Google Street View (GSV) as a novel green space measure for epidemiological studies. | Data curation, Prediction | Image | United States | 254 | Green View Index, normalized difference vegetation index | Green screen algorithm | GSV-based measures captured unique information about green space exposures. We further developed a GVI:NDVI ratio, which was associated with the amount of vertical green space in an image. The GVI and GVI:NDVI ratio were weakly related to neighborhood socioeconomic status and are therefore less susceptible to confounding in health studies compared to other green space measures. |
| ( | 2016 | To determine which built environment characteristics contributed to the classification of African American women as having four or more CVD risk factors at optimal levels. | Prediction | Survey | United States | 30 | CVD | CART | The classification and regression trees identified participants with few, low-quality neighborhood physical activity resources and who were older than 55 as least likely to have four or more CVD risk factors at optimal levels |
| ( | 2012 | To present an epidemiologic systems approach for identifying potential determinants of diarrhea in children under five years. | Causal inference, Prediction | Survey | Pakistan | 18202 | Diarrhea | Bayesian network modeling | The only access to a dry pit latrine (protective), access to an atypical water source (protective), and no formal garbage collection (unprotective) were directly dependent on the presence of diarrhea. |
| ( | 2019 | To identify and ranks predictors of cardiovascular health at the neighborhood level in the United States. | Prediction | Survey | United States | 27066 | Neighborhood CVD and stroke prevalence | RF | Demographics, health behaviors, and prevention measures explained the vast majority of the variance: 93.2% for CVD and 96.0% for stroke. |
| ( | 2016 | To examine the feasibility of inferring regional health outcomes from socio-demographic data through national censuses and community surveys. | Prediction | Survey | United States | N/A | Angina or CHD, heart attack, stroke, diabetes, hypertension, obesity | Regression with stepwise feature selection, group LASSO, RF, Gaussian process regression | The model had sufficient external validity to make good predictions, based on demographics alone, for areas not included in the model development. |
| ( | 2013 | To evaluate the influence of work environment on the risk of arterial hypertension development in employees of various social groups. | Prediction | Survey | Russia | 3664 | Arterial hypertension | CTA | Quetelet index, age, and occupation are highly significant for arterial hypertension prediction in the working-age population. Occupation is very significant for arterial hypertension prediction in middle‐aged patients. |
| ( | 2019 | To evaluate whether the Operation Peacemaker Fellowship, an innovative firearm violence-prevention program implemented in Richmond, California, was associated with reductions in firearm and non-firearm violence. | Prediction | Survey | United States | N/A | Experiencing non-firearm and firearm violence | Generalization of the synthetic control method | The program was associated with reductions in firearm violence (annually, 55% fewer deaths and hospital visits, 43% fewer crimes) but also unexpected increases in non-firearm violence (annually, 16% more deaths and hospital visits, 3% more crimes). These associations were unlikely to be attributable to chance for all outcomes except non-firearm homicides and assaults in crime data |
| ( | 2011 | To assess the usefulness of the decision trees method as a research method of multidimensional associations between menarche and socioeconomic variables. | Prediction | Survey | Poland | 2354 | Menstruation appearance at the age of 12–14 | CART | The strongest discriminatory power was attributed to the number of children in a family and the mother's and then father's educational level. |
| Meng et al. ( | 2017 | To examine openly shared substance-related tweets to estimate prevalent sentiment around substance use and identify popular substance use activities. | Data curation | Text (tweets), Survey | United States | 79848992 | Substance use behaviors | Maximum entropy text classifier | More convenience stores in a zip code were associated with higher percentages of tweets about alcohol. Larger zip code population size and higher percentages of African Americans and Hispanics were associated with fewer tweets about substance use and underage engagement. Zipcodeeconomic disadvantage was associated with fewer alcohol tweets but more drug tweets. |
| ( | 2017 | To evaluate neighborhood influences on physical activity among older adults, analogous, in a genetic context, to a genome-wide association study. | Prediction | Survey | United States | 3497 | Physical activity | LASSO regression, RF | Only neighborhood socioeconomic status and disorder measures were associated with total activity and gardening, whereas a broader range of measures was associated with walking. |
| ( | 2016 | To evaluate health problems in income-based groups. | Prediction | Survey | United States | 3604 | Self-rate health | CART | More risk factors for self-rate health problems and chronic burden indicators were associated with SRH in lower-income groups |
| ( | 2016 | To explore interactions between demographic, tobacco, and psychosocial factors to identify cigarette smokers at highest risk for alternative tobacco product use from a racially/ethnically and socioeconomically diverse sample of adult smokers across the full smoking spectrum. | Prediction | Survey | Global | 2376 | Concurrent alternative tobacco product use | CTA | Alcohol for men and age, race/ethnicity, and discrimination for women increased the probability of alternative tobacco product use. |
| ( | 2016 | To use publicly available, geotagged Twitter data to create neighborhood indicators for happiness, food, and physical activity for three large counties: Salt Lake, San Francisco, and New York. | Data curation | Text (tweets), Survey | United States | 2.8 million | Neighborhood happiness, diet, and physical activity | NLP | Happy tweets, healthy food references, and physical activity references were less frequent in census tracts with greater economic disadvantage and higher proportions of racial/ethnic minorities and youths |
| ( | 2016 | To build, from geotagged Twitter data, a national neighborhood database with area-level indicators of well-being and health behaviors. | Data curation | Text (tweets), Survey | United States | 80 million | Happy, food, and physical activity tweets, Mortality, chronic condition (obesity, diabetes, high cholesterol), self-rated health | NLP | Census tract factors like percentage African American and economic disadvantage were associated with lower census tract happiness. Urbanicity was related to a higher frequency of fast-food tweets. Greater numbers of fast-food restaurants predicted a higher frequency of fast-food mentions. Surprisingly, fitness centers and nature parks were only modestly associated with a higher frequency of physical activity tweets. Greater state-level happiness, positivity toward physical activity, and positivity toward healthy foods, assessed via tweets, were associated with lower all-cause mortality and prevalence of chronic conditions such as obesity and diabetes and lower physical inactivity and smoking, controlling for state median income, median age, and percentage white non-Hispanic. |
| ( | 2017 | To leverage geotagged Twitter data to create national indicators of the social environment, with small-area indicators of prevalent sentiment and social modeling of health behaviors, and to test associations with county-level health outcomes, while controlling for demographic characteristics. | Data curation, Prediction | Text (tweets), Survey | United States | 80 million | National indicators of the social environment | NLP | Twitter indicators of happiness, food, and physical activity were associated with lower premature mortality, obesity, and physical inactivity. Alcohol-use tweets predicted higher alcohol-use–related mortality. |
| Nguyen et al. ( | 2017 | To create zip code level indicators of happiness, food, and physical activity culture from geolocated Twitter data to examine the relationship between these neighborhood characteristics and obesity and diabetes diagnoses. | Data curation, Prediction | Text (tweets), Survey | United States | 1.86 million | Obesity and diabetes prevalences | NLP | Individuals living in zip codes with the greatest percentage of happy and physically active tweets had lower obesity prevalence after accounting for individual age, sex, nonwhite race, Hispanic ethnicity, education, and marital status, as well as zip code population characteristics. More happy tweets and lower caloric density of food tweets in a zip code were associated with the lower individual prevalence of diabetes. |
| ( | 2017 | First, to build a national food environment database from geotagged Twitter and Yelp data. Second, to test associations between state food environment indicators and health outcomes. | Data curation, Prediction | Text (tweets, Yelp listing), Survey | United States | 79,848,992 | Mortality, chronic condition (obesity, diabetes, high cholesterol), self-rated health | NLP | A one standard deviation increase in caloric density of food tweets was related to higher all-cause mortality (+46.50 per 100,000), diabetes (+0.75%), obesity (+1.78%), high cholesterol (+1.40%), and fair/poor self-rated health (2.01%). More burger Yelp listings were related to a higher prevalence of diabetes (+0.55%), obesity (1.35%), and fair/poor self-rated health (1.12%). More alcohol tweets and Yelp bars and pub listings were related to higher state-level binge drinking and heavy drinking, but lower mortality and lower percent reporting fair/poor self-rated health. |
| ( | 2017 | To construct built environment indicators using computer vision techniques and publicly available Google Street View images and to examine relationships between neighborhood built environments, demographic characteristics of residents, and health outcomes. | Data curation, Prediction | Map (Google Street View), survey | United States | N/Aa | Obesity and diabetes | Convolutional neural networks | Computer vision models had an accuracy of 86%–93% compared with manual annotations. Individuals living in zip codes with the greenest streets, crosswalks, and commercial buildings/apartments had relative obesity prevalences that were 25%–28%lower and relative diabetes prevalences that were 12%–18%lower than individuals living in zip codes with the least abundance of these neighborhood features. |
| ( | 2019 | To show that a widely used algorithm exhibits significant racial bias under the healthcare context. | Algorithmic fairness | Text (Electronic health records) | United States | N/A | Patients who will derive the greatest benefit from the high-risk care management | Prediction algorithms (unspecified) | Black patients are considerably sicker than White patients, as evidenced by signs of uncontrolled illnesses. Remedying this disparity would increase the percentage of Black patients receiving additional help from 17.7 to 46.5%. The bias arises because the algorithm predicts health care costs rather than illness, but unequal access to care means that we spend less money caring for Black patients than for White patients |
| ( | 2006 | To evaluate the most important sociodemographic factors on the smoking status of high school students using a broad randomized epidemiological survey. | Prediction | Survey | Turkey | 3304 | Smoking status | CART | The significantly important factors that affect current smoking in these age groups were increased by household size, late birth rank, certain school types, low academic performance, increased second-hand smoking, and stress. |
| ( | 2012 | To examine the health-related quality of life in a cohort of individuals with irritable bowel syndrome and to explore the use of several data-mining methods to identify which socio-demographic and irritable bowel syndrome symptoms are most highly associated with impaired health-related quality of life. | Prediction | Survey | United Kingdom | 494 | Quality of life | CTA, ANN | Psychological morbidity and socio-demographic factors such as marital status and employment status also have a major influence on health-related quality of life in irritable bowel syndrome. |
| ( | 2018 | To examine the associations between 11 childhood adversities and intelligence, using targeted maximum likelihood estimation | Causal inference, Prediction | Survey | United States | 10073 | Nonverbal score from the Kaufman Brief Intelligence Test | SuperLearner algorithm | The largest associations were observed for deprivation-type experiences, including poverty and low parental education, which were related to reduced intelligence. Although lower in magnitude, threat events related to intelligence included physical abuse and witnessing domestic violence. Violence prevention and poverty-reduction measures would likely improve childhood cognitive outcomes. |
| ( | 2019 | To evaluate the efficacy of an SMS-based refill reminder solution using conversational artificial intelligence | Data curation, Prediction | Text | United States | 273356 | Medication refill request | Neural network multilayer perceptron | There are sharp differences in the likelihood to reply to a refill reminder and complete a refill request via SMS based on demographic and socioeconomic factors. We found a strong association between refill request rates and patient language, age, race/ethnicity, and SDOH levels, and these differences may contribute to health disparities and impact health outcomes in Medicare patients. |
| ( | 2019 | To demonstrate how collective use of data mining and prediction algorithms to analyze socioeconomic population health data can stand beside classical correlation analysis in routine data analysis. | Data curation | Text (Website) | United States | N/A | Hospital length of stay | Hyperbolic Dirac net | The combined use of tools and modes of use described in this paper appears capable of adding significant value to the analysis of socioeconomic health data. |
| ( | 2018 | To investigate how machine learning may add to our understanding of social determinants of health using data from the Health and Retirement Study. | Prediction | Survey | United States | N/A | Blood pressure, body mass index, waist circumference, and telomere length | Repeated linear regressions, penalized linear regressions, RFs, neural networks. | Dental visits, current smoking, self-rated health, serial-seven subtractions, probability of receiving an inheritance, probability of leaving an inheritance of at least $10,000, number of children ever born, African-American race, and gender are highly weighted predictors. |
| ( | 2018 | To characterize trauma-related falls in infants and toddlers aged 0–3 years over a 4-year period and develop a risk stratification model of causes of fall injuries. | Prediction | Survey | Israel | 2277 | Trauma-related falls in infants and toddlers | CART | The leading determinants of fall injuries in children below the age of 3 years are age, ethnicity, and low socioeconomic status. |
| ( | 2018 | To empirically test the contribution of social components versus more traditional symptom-related features in the prediction of health outcomes. | Prediction | Survey | United States | 3678 | Asthma patients at risk of a hospital revisit after an initial visit | RF, SVM | Socio-markers in the Memphis study area aggregated on the ZIP code level can be reliable predictors of pediatric asthma patients at risk of hospital revisit within a year. |
| ( | 2019 | To solve health-perspective problems by understanding socioeconomic factors which affect children's health and how they influence malaria and anemia. | Causal inference, Prediction | Survey | Global | 6935 | Malaria and anemia among children | ANNs, SVM, k nearest neighbors, RFs, naive Bayes | ANNs gave the best Results of 94.74% and 84.17% accuracy for malaria and anemia prediction, respectively. |
| ( | 2019 | To apply a deep learning approach to street images for measuring spatial distributions of income, education, unemployment, housing, living environment, health, and crime. | Data curation, Prediction | Image | United Kingdom | 525860 | Income, education, unemployment, housing, living environment, health, and crime | Deep learning (not specific) | The application of deep learning to street imagery predicted inequalities better in some outcomes (i.e., income, living environment) than others (i.e., crime, self-reported health). |
| ( | 2018 | To estimate the associations between having an adult child migrant and depressive symptoms | Prediction | Survey | United States | 11,806 | Depressive symptoms | Targeted maximum likelihood estimation | Associations between having an adult child migrant and depressive symptoms varied by respondent gender, family size, and the location of the child migrant. |
| ( | 2020 | To evaluate the association between adult child US migration status and change in cognitive performance scores. | Prediction | Survey | United States | 5972 | Cognitive performance | Targeted maximum likelihood estimation | For women, having an adult child in the United States was associated with a steeper decline in verbal memory scores and overall cognitive performance. There were mostly null associations for men. |
| ( | 2017 | To explore the racial disparity in obesity considering not only the individual behavior but also geospatially derived environmental risk factors. | Prediction | Survey | United States | 5240 | Obesity | Multiple additive regression trees | Multiple additive regression trees (MART) performed better than generalized linear models. MART explained a larger proportion of the racial disparity in obesity. However, there remained disparities that cannot be explained by factors collected in this study. |
ANN, artificial neural network; CART, classification and regression trees; CTA, classification tree analysis; CVD, cardiovascular disease; NCD, non-communicable disease; NLP, natural language processing; RF, random forest; SVM, support vector machine.
Fig. 1Flowchart of literature search for scoping review on the traction of machine learning in the social determinants of health.
Fig. 2Trend in the number of included studies by themes in machine learning.