| Literature DB >> 34983651 |
Romana Haneef1, Mariken Tijhuis2, Rodolphe Thiébaut3,4,5, Ondřej Májek6,7, Ivan Pristaš8, Hanna Tolonen9, Anne Gallay10.
Abstract
BACKGROUND: The capacity to use data linkage and artificial intelligence to estimate and predict health indicators varies across European countries. However, the estimation of health indicators from linked administrative data is challenging due to several reasons such as variability in data sources and data collection methods resulting in reduced interoperability at various levels and timeliness, availability of a large number of variables, lack of skills and capacity to link and analyze big data. The main objective of this study is to develop the methodological guidelines calculating population-based health indicators to guide European countries using linked data and/or machine learning (ML) techniques with new methods.Entities:
Keywords: Artificial intelligence; Data linkage; Guidelines; Health indicators; Linked data; Machine learning techniques; Methodological guidelines; Population health research; Statistical techniques
Year: 2022 PMID: 34983651 PMCID: PMC8725299 DOI: 10.1186/s13690-021-00770-6
Source DB: PubMed Journal: Arch Public Health ISSN: 0778-7367
Fig. 1Flow diagram of search strategy used to identify studies using data linkage and/or machine learning techniques for health surveillance and health care performance to develop the methodological guidelines to estimate population-based indicators, a study performed under InfAct project, May 2021
Methodological guidelines using linked data and/or machine learning techniques to estimate population-based indicators, a study performed under InfAct project, May 2021
| Item number | Checklist item | Description | |
|---|---|---|---|
| Define the rationale and objective of the study by adopting PICO criteria to research studies focused on population health. | ☐ | ||
| Select the appropriate study design that could best address the proposed research question. | ☐ | ||
| Select the required linked data sources to answer the proposed research question. | ☐ | ||
| 4.1 | Define the inclusion and exclusion criteria of the study population by taking into account age, sex and period of data collection. | ☐ | |
| 4.2 | Sample size | State the significance level of alpha and power based on the defined research question to calculate the sample size. | ☐ |
| 5.1 | Main outcomes | Define the main outcomes by taking into account study population, health condition to be studied, exposure (intervention/risk factors, if relevant) and defined period of study. | ☐ |
| 5.2 | Level of estimation | Describe the level of estimation of health outcomes at the lowest possible granularity level (i.e., at community, metropolitan, departmental or regional levels). | ☐ |
| 6.1 | A. Data extraction | Extract data with required input variables from linked data set to a single file or a spreadsheet that could be converted according to the required format of the statistical software for data analysis. | ☐ |
| 6.2 | Coding of variables | Code the input variables, which are common in different linked data sets continuous or categorical or binary variables for required data analysis. | ☐ |
| B. Data preparation to develop and apply a ML-algorithm | |||
| 6.3 | Identify and define the target groups for a given defined time window based on the outcome of interest. | ☐ | |
| 6.4 | Code the inputs variables, which are common in different linked data sets to continuous or categorical or binary variables for a given defined time window time. | ☐ | |
| 6.4 | Split of final data set into 80% training and 20% test data set. | ☐ | |
| 7.1 | A. Variables selection | Select variables after the removal of all variables with a variance equal to zero. | ☐ |
| 7.2 | Estimate the RelifExp score based on the relevance of each variable to the outcome of interest. | ☐ | |
| B. Statistical techniques | |||
| 7.3 | I. Classical statistical techniques | Select an appropriate statistical technique to address the proposed research question according to the study objectives and the available data. | ☐ |
| II. ML-techniques | |||
| 7.4 | Train various models and compare the performances of each model in terms of AUC curve (only for binary classifier). | ☐ | |
| 7.5 | Validate the model performance using k-fold cross-validation first on training data set, and then assess the model performance on test data set. | ☐ | |
| 7.6 | Select the final model based on specific performance metrics including sensitivity, specificity, PPV*, NPV*, F1-score and kappa. | ☐ | |
| C. Sensitivity/uncertainty analysis | |||
| 7.7 | Perform a sensitivity analysis to identify the most influential parameters for a given output of a model. | ☐ | |
| 7.8 | Select an appropriate method to perform the sensitivity analysis. | ☐ | |
| 7.9 | Calculate the uncertainty in estimates using 95% CI* and describe the source of uncertainty (if relevant). | ☐ | |
| D. Potential issues during data analysis | |||
| I. Missing data | |||
| 7.10 | Identify the missing data in the given dataset. | ☐ | |
| 7.11 | Apply an appropriate technique for the imputation of missing values in the given data set. | ☐ | |
| 7.12 | II. Imbalanced target group in a given dataset | Apply an appropriate technique to create a balanced data set either using down sampling or over sampling approach. | ☐ |
| 7.13 | III. Bias and variance tradeoff | Find the most generalizable model to keep the balance between bias and variance. | ☐ |
| Describe the study limitations related to data sources (i.e., linkage, quality, access and privacy), study design, study population and statistical method used (if relevant). | ☐ |
*PPV Positive Predictive Value, NPV Negative Predictive Value, CI Confidence interval