| Literature DB >> 33317497 |
Hansi Zhang1, Yi Guo1,2, Mattia Prosperi3, Jiang Bian4,5.
Abstract
BACKGROUND: To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility.Entities:
Keywords: Cancer outcomes research; Integrative data analysis; Ontology; Reporting guideline
Mesh:
Year: 2020 PMID: 33317497 PMCID: PMC7734720 DOI: 10.1186/s12911-020-01270-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Review of relevant reporting guidelines in the EQUATOR network
Summary of reporting guidelines based on the data source domains and levels guided by the NIMHD framework
| Domain of influences | Level of influences | Guidelines | |
|---|---|---|---|
| Not specifieda | Individual level | [ | |
| Societal level | [ | ||
| Biological data | Genetics data | Individual level | [ |
| Immunogenomic data | [ | ||
| Molecular epidemiological data | [ | ||
| Drug safety data from biologics registers | [ | ||
| Behavioral data | Crime, violence data | Individual level | [ |
| Dietary or nutritional data | [ | ||
| Medication adherence | [ | ||
| Sociocultural environment | Environmental data | Individual/ Community/ Societal/Interpersonal | [ |
| Physical environment | |||
| Healthcare system | Administrative data, Electronic health records, Claim data, Patient or disease registries, Quality or safety surveillance databases | Individual level | [ |
aWhen reporting data sources or RF variables, these studies did not specify a specific data domain
Fig. 2An overview of the reporting guideline for RF variable and data source selection and data integration
ATTEST reporting guideline checklist
| Item No | Recommendation | Page | |
|---|---|---|---|
| Background/rationale | 1 | Explain the scientific background and rationale for the study being reported in one or two sentences | |
| Prespecified hypotheses: | 2 | State prespecified hypotheses in on or two sentences | |
| Data sources | 3a | Describe the time coverage | |
| 3b | Describe the geographic coverage | ||
| 3c | Describe the sample size | ||
| 3d | Describe the demographic distribution | ||
| 3e | Describe the cohort criteria | ||
| 3f | Describe the sources of biases (e.g., sample bias) | ||
| 3 g | Describe the data collection approach | ||
| Dependent variables | 4a | State the variable definition and variable type (e.g., primary outcome variable, secondary outcome variable) | |
| 4b | State the data source of dependent variable | ||
| 4c | State the data type (e.g., numerical, categorical, date-time) of dependent variable | ||
| 4d | State descriptive statistics (e.g., min, max. Median, value range, percentile) of dependent variable | ||
| 4e | State the NIMHDa domains and levels of dependent variable | ||
| Independent variables | 5a | State the variable definition and variable type (e.g., primary predictor, secondary predictor) | |
| 5b | State the data source of dependent variable | ||
| 5c | State the data type (e.g., numerical, categorical, date-time) of dependent variable | ||
| 5b | State descriptive statistics (e.g., min, max. Median, value range, percentile) of independent variable | ||
| 5e | State the NIMHD domains and levels of independent variable | ||
| Controlled variables | 6a | State the variables type (e.g., numerical, categorical) of controlled variable | |
| 6b | State the data source of controlled variable | ||
| 6c | State descriptive statistics (e.g., min, max. Median, value range, percentile) of controlled variable | ||
| 6d | State the NIMHD domains and levels of controlled variable | ||
| Missing data | 7a | For each data source, describe whether required or expected variable that is not present | |
| 7b | For each variable, describe method of how to handle missing data | ||
| 7c | For each variable, describe the missing rate | ||
| Data processing | 8a | Data extraction: for each variable, describe how to process the raw data source to extract the variable | |
| 8b | Data cleaning: for each variable, describe the method used to detect and correct (or remove) the incorrect records, missing values or outliers | ||
| Integration strategy | 9 | Describe the integration strategy for each variable:1) Integrate with variables from same level, 2) Integrate with variables from different levels, and 3) Creation of additional computed elements | |
| Integration algorithm | 10 | For each variable, describe the algorithm used to integrate it with variables from other data sources | |
| Variable validation | 11 | For each variable, describe data validation rule for the selected variable. Rule should identify both the variable and the validation algorithms | |
| Integrated variable | 12 | Describe the variable after integration and basic descriptive statistics (e.g., min, max. Median, value range, percentile) | |
Please document the items for each data source and variable separately
aNational Institute on Minority Health and Health Disparities (NIMHD)
Fig. 3The class hierarchy of OD-ATTEST
The classes and properties reused or created for OD-ATTEST
| Label | Internationalized Resource Identifiers (IRIs) | Reference ontology | |
|---|---|---|---|
| Classes | objecitve | iao:0000005 | IAOb |
| data source | iao:0000100 | ||
| measurement datum | iao:0000109 | ||
| dependent variable | obi:0000751 | OBIc | |
| independent variable | obi:0000750 | ||
| controlled variable | obi:0000785 | ||
| data processing | obi:0200000 | ||
| study | ncit:C63536 | NCItd | |
| hypothesis | ncit:C28362 | ||
| rationale | ncit:C80263 | ||
| primary outcome | ncit:C142644 | ||
| secondary outcome | ncit:C142680 | ||
| sample size | ncit:C53190 | ||
| missing data | ncit:C142610 | ||
| data validation | ncit:C142500 | ||
| data type | ncit:C42645 | ||
| data collection method | ncit:C103159 | ||
| data analysis | sio:001051 | SIOe | |
| minimum value | stato:0000150 | STATOf | |
| maximum value | stato:0000151 | ||
| median | stato:0000574 | ||
| mean | stato:0000573 | ||
| value range | stato:0000035 | ||
| percentile | stato:0000293 | ||
| data distribution | stato:0000161 | ||
| statistical sampling | stato:0000502 | ||
| outlier | stato:0000036 | ||
| primary predictor | od-attest:000015 | OD-ATTESTg | |
| secondary predictor | od-attest:000016 | ||
| demographic distribution | od-attest:000093 | ||
| outcome variable data source | od-attest:000019 | ||
| predictor data source | od-attest:000094 | ||
| cohort criteria | od-attest:000008 | ||
| descriptive statistic | od-attest:000012 | ||
| missing rate | od-attest:000068 | ||
| data source time coverage | od-attest:000023 | ||
| data source geographic coverage | od-attest:000024 | ||
| sources of bias | od-attest:000051 | ||
| data integration | od-attest:000052 | ||
| data extraction | od-attest:000054 | ||
| data cleaning | od-attest:000055 | ||
| integration strategy | od-attest:000056 | ||
| integrate variables from same level | od-attest:000057 | ||
| integrate variables from different levels | od-attest:000058 | ||
| creation of additional elements | od-attest:000059 | ||
| integration algorithm | od-attest:000060 | ||
| validation strategy | od-attest:000068 | ||
| integrated variable | od-attest:000096 | ||
| Properties | is determined by | od-attest:000097 | OD-ATTEST |
| has rationale | od-attest:000098 | ||
| has objective | od-attest:000099 | ||
| has data source | od-attest:000100 | ||
| has cohort criteria | od-attest:000101 | ||
| has demographic distribution | od-attest:000102 | ||
| has sources of bias | od-attest:000103 | ||
| has controlled variable | od-attest:000104 | ||
| has independent variable | od-attest:000105 | ||
| has dependent variable | od-attest:000106 | ||
| has data type | od-attest:000107 | ||
| has descriptive statistics | od-attest:000108 | ||
| has NIMHD level | od-attest:000109 | ||
| has NIMHD domain | od-attest:000110 | ||
| has data collection approach | od-attest:000111 | ||
| has sample size | od-attest:000112 | ||
| has missing data | od-attest:000113 | ||
| has data integration | od-attest:000114 | ||
| has data processing | od-attest:000115 | ||
| has data validation | od-attest:000116 | ||
| has integration strategy | od-attest:000117 | ||
| extracted from | od-attest:000118 | ||
| has description | od-attest:000119 | ||
| has time coverage | od-attest:000120 | ||
| has geographic coverage | od-attest:000121 |
aPrefix: iao:
sio:
bInformation Artifact Ontology
cOntology for Biomedical Investigations
dNational Cancer Institute Thesaurus
eStatistics Ontology
fSemanticscience Integrated Ontology
gOntology for the Documentation of Variable and Data Source Selection and Integration Process
An example of two previous mIDA case studies annotated using ATTEST checklist
| Item No | Recommendation | Page No Study (1) [ | Page No Study (2) [ | |
|---|---|---|---|---|
| Background/rationale | 1 | Explain the scientific background and rationale for the study being reported in one or two sentences | Page 1, section “ | Page 1, section “ |
| Prespecified hypotheses | 2 | State prespecified hypotheses in on or two sentences | Page 2, section “ | N/A |
| Data source | 3a | Describe the time coverage | ||
| 3b | Describe the geographic coverage | |||
| 3c | Describe the sample size | |||
| 3d | Describe the demographic distribution | N/A | ||
| 3e | Describe the Cohort criteria | |||
| 3f | Describe the sources of bias | N/A | N/A | |
| 3 g | Describe the data collection approach | N/A | ||
| Dependent variable | 4a | State the variable definition and variable type (e.g., primary outcome variable, secondary outcome variable) | ||
| 4b | State the data source of dependent variable | |||
| 4c | State the data type (e.g., numerical, categorical, date-time) of dependent variable | |||
| 4d | State descriptive statistics (e.g., min, max. Median, value range, percentile) of dependent variable | Cancer survival: N/A | ||
| 4e | State the NIMHD domain and levels of dependent variable | |||
| Independent variable | 5a | State the variable definition and variable type (e.g., primary predictor, secondary predictor) | ||
| 5b | State the data type (e.g., numerical, categorical) of independent variable | |||
| 5c | State the data source of independent variable | Page 5, Table 1 | ||
| 5d | State descriptive statistics (e.g., min, max. Median, value range, percentile) of independent variable | Page 4, Table 1 | N/A | |
| 5e | State the NIMHD domain and levels of independent variable | Page 5, Table 1 | ||
| Controlled variable | 6a | State the controlled variable and variable type (e.g., numerical, categorical) of controlled variable | N/A | |
| 6b | State the data source of controlled variable | Page 2, section “ | N/A | |
| 6c | State descriptive statistics (e.g., min, max. Median, value range, percentile) of controlled variable | Page 2, section “ | N/A | |
| 6d | State the NIMHD domain and levels of controlled variable | Page 2, section “ | N/A | |
| Missing data | 7a | For each data source, describe whether required or expected variable that is not present | N/A | N/A |
| 7b | For each variable, describe method of how to handle missing data | N/A | N/A | |
| 7c | For each variable, describe the missing rate | N/A | N/A | |
| Data processing | 9a | Data extraction: for each variable, describe how to process the raw data source to extract the variable | N/A | |
| 9b | Data cleaning: for each variable, describe the method used to detect and correct (or remove) the incorrect records, missing values or outliers | N/A | N/A | |
| Integration strategy | 10 | Describe the integration strategy for each variable:1) Integrate with variables from same level, 2) Integrate with variables from different levels, and 3) Creation of additional computed elements | ||
| Integration algorithms | 11 | For each variable, describe the algorithm used to integrate it with variables from other data sources | N/A | |
| Variable validation | 12 | For each variable, describe data validation rule for the selected variable. Rule should identify both the variable and the validation algorithms | N/A | |
| Integrated variable | 13 | Describe the variable after integration and basic descriptive statistics (e.g., min, max. Median, value range, percentile) | N/A | Page 18, Table 4 |
FCDS Florida Cancer Data System
ATSDR Agency for Toxic Substances& Disease Registry
BRFSS behavioral risk factor surveillance system
aIf the reported items for all variables or data sources are described at the same place, you can list the page/section/table information at once. For the integration related items, we only presented variables that have the information (N/A will not be showed in the table)
Fig. 4An OD-ATTEST-annotated report generated based on a mIDA case study
An example of annotated semantic triples represented in RDF format using Turtle syntax
@prefix od-attest: < @prefix ncit: < @prefix rdfs: < @prefix xsd: < | |
od-attest:30066664 rdf:type ncit:study; od-attest:has rationale od-attest:30066664/rationale; od-attest:has objective od-attest:30066664/objective. od-attest:30066664/rationale rdf:type ncit:rationale; od-attest:has description "Extant cancer survival analyses have..." ^^ xsd:string. od-attest:30066664/objective rdf:type ncit:objective; od-attest:has description "built a semantic data integration …" ^^ xsd:string. |
aResource Description Framework