Literature DB >> 33317497

An ontology-based documentation of data discovery and integration process in cancer outcomes research.

Hansi Zhang¹, Yi Guo^1,2, Mattia Prosperi³, Jiang Bian^4,5.

Abstract

BACKGROUND: To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility.
METHODS: Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies.
RESULTS: We summarized the review results and created a reporting guideline-ATTEST-for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST.
CONCLUSION: Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

Entities: Chemical

Keywords: Cancer outcomes research; Integrative data analysis; Ontology; Reporting guideline

Mesh：

Year: 2020 PMID： 33317497 PMCID： PMC7734720 DOI： 10.1186/s12911-020-01270-3

Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN： 1472-6947 Impact factor: 2.796

Background

Cancer is a major disease burden worldwide [1]. As the 2nd leading cause of death in the United States (US), about 1 in 4 deaths is due to various types of cancer [2]. In 2019, an estimation of 1,762,450 new cancer cases diagnosed and 606,880 cancer deaths is reported by the American Cancer Society (ACS) in US [2]. The lifetime probabilities of being diagnosed with cancer are 39.3 and 37.4% for male and female, respectively [3]. However, the risk factors (RFs) for these high cancer incidence and mortality rates are still not fully understood. Over the past two decades, increasing efforts have been directed toward identifying and understanding cancer RFs using various methods, such as genome-wide association studies (GWAS) [4, 5] or more recent machine learning-based approaches [6, 7]. Nevertheless, emerging evidence suggests that it is the interaction among many risk factors together that affect the risk of cancer and cancer outcomes, rather than a single cause [8]. Further, the RFs involved are across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual level, interpersonal level, and community level). However, there is not yet an agreement among the cancer research community regarding how these multi-level cancer RFs interact with each other. To do so, the first and most crucial step is to gain a comprehensive view of potential multi-level RFs associated with various cancer outcomes such as the stage of diagnosis (the most important prognostic factor) and survival. We surveyed existing research on RFs for late stage cancer diagnosis and poor survival, we found current studies about RFs for cancer outcomes are mostly from single-level analyses with mostly individual patient-level data. For instance, Andrew et al. assessed individual patient characteristics (e.g., age, gender, family history), and lifestyle factors (e.g., education, insurance and socioeconomic status) to study their risks associated with colorectal cancer at late stage [9]. These individual-level RFs have also been reported for other major types of cancers such as breast and cervical cancers [10-13]. Further, prior studies studying cancer RFs often only analyzed data from a single source, such as SEER [14], SEER-Medicare [15], or a state or hospital cancer registry [16]. Among these cancer risk factor studies, the complex interplay between difference levels RFs are often ignored (e.g., county-level smoking rate vs. individual smoking behavior). These single-level RF analyses (1) lead to biased effect estimates of RFs due to potential confounding from omitted factors, (2) omit critical cross-level RF interactions, such as race by residence, that could inform multi-level intervention design. Nowadays, advances in technology created new ways for us to determine and measure disease risk factors across different levels (e.g., from advancements in genome sequencing for genetic markers to better sensors for producing more accurate estimates of environmental pollutants). The availability of such abundant data online in electronic formats enables researchers to pool data on an unprecedented scale and offers a great opportunity to do a thorough examination of multi-level RFs in a multi-level integrative data analysis (mIDA) so that confounding effects and across-level interactions can be studied. However, researchers face significant barriers to do so, especially because there is a lack of consensus and proper guidance to help researchers systematically think and discovery these variables from heterogenous sources. In 2017, National Institute on Minority Health and Health Disparities (NIMHD) of the National Institute of Health (NIH) proposed a Research Framework [17], an extension to the well-known social ecological model [18], to help investigators systematically study health disparities. Recognized by the NIMHD Framework, individuals are embedded within the larger social system and constrained by the physical environment they live in. Within this framework, cancer outcomes are influenced by RFs from different levels (i.e., individual, interpersonal, community, and societal) and multiple domains (i.e., biological, behavioral, physical/built environment, sociocultural environment, and healthcare system). In this work, we adopted the NIMHD framework as the guiding theory for risk factor discovery and data source selection. Further, mIDA for cancer outcomes research requires the integration of data from multiple sources. However, data integration processes in the very limited number of existing mIDA studies [19, 20] are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility [21, 22]. The data integration processes are often time summarized in one or two sentences without explicitly documentation of the steps. For example, Guo et al. explored the impact of the relationships among socioeconomic status, individual smoking status, and community-level smoking rate on pharyngeal cancer survival [20]. The multi-level risk factors above were obtained and integrated from three different data sources (i.e., Florida Cancer Data System [FCDS], U.S. Census, and Behavioral Risk Factor Surveillance System [BRFSS]) as mentioned in the abstract. However, for the rest of the paper, there is no description of how the individual-level records from FCDS are linked with county-level smoking rate from BRFSS and census tract-level poverty rate from U.S. Census. Even though the integration process might be as simple as integrating these multi-level variables through the geographic code (e.g., county code), it still needs to be standardized and explicitly documented to avoid ambiguity. For example, the paper discussed that “regional smoking was measured as the average percentage of adult current smokers at the county level between 1996 and 2010” and the readers might be able to make an educated guess that the regional smoking rates were more likely to be generated using the BRFSS data rather than from the FCDS data; however, explicit documentation is needed as both BRFSS and FCDS data have individual smoking status. Keegan et al. explored and whether breast cancer survival patterns are influenced by factors such as nativity (individual level) and neighborhood socioeconomic status (community level). Similarly, they summarized integration process in one sentence by stating each patient was assigned a neighborhood socioeconomic status variable based the census block groups. However, the details such as variable names in each data sources, or whether the original geographic variables require pre-processing (e.g., derive census tract from zip codes) are not clearly documented [19]. The explicit documentation of these variable selection and data integration processes will help readers to better understand the study results, benefit other researchers who want to replicate the studies, but also more importantly, make it possible for machines to understand and replicate the steps (when these explicit documentations are encoded in a computable format such as with an ontology). Further, even though these mIDA studies above did not emphasize the need for data integration or integrated datasets, the fact that they can only investigated a handful of variables at a time indicated the lack of but needed support on data integration. Even in studies on building frameworks or platforms to support or automate the data integration process (especially those related to creating integrated dataset to support cancer research), they often ignored the need for documenting the integration steps to guarantee the transparency and reproducibility of their approaches. For example, semantic data integration approach —connecting variables across different databases at the semantic level through mapping them to standardized concepts in a global schema (e.g., often time a global ontology) — has been proposed in data integration studies in recent years to support generating integrated datasets for cancer research [23-25]. However, none of these studies mentioned the need for standardizing and documenting their integration steps, for example, most of them did not even discuss the rationale for selecting the specific data sources to integrate. Nevertheless, when reporting mIDA studies, it is critical to document the steps that were followed to select, integrate, and process the data so that others can repeat the same steps and reproduce the findings. To address challenges above, in this paper, we first developed a reporting guideline to guide and document the RF variable selection, data source selection, and data integration process. The guideline is informed by (1) the NIMHD research framework that provides guidance and promotes structural thinking on identifying multi-level cancer RFs; and (2) reviewing existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network [26]. Then, we proposed an ontology-based approach to annotate and document the RF selection and data integration process in mIDA studies based on the reporting guideline we developed. To do so, we developed the Ontology for the Documentation of vAriable selecTion and daTa sourcE Selection and inTegration process (OD-ATTEST) so that the RF selection and data integration report can be (1) explicitly modeled with a shared, controlled vocabulary, (2) understandable to humans and computable to computers, and (3) adaptive to changes when the reporting process is refined. In our prior work [27], we proposed a preliminary reporting guideline for RF variable and data source selection based on our own experience of pooling multi-level RFs from different data sources to support mIDAs of cancer survival [28, 29]. In this extended journal paper, we significantly expanded our ontology-based reporting guideline—ATTEST (vAriable selecTion and daTa sourcE Selection and inTegration): We conducted a systematic search of existing reporting guidelines from the EQUATOR network to extract reporting elements relevant to variable selection and data integration. We updated our reporting guideline based on the result of the systematic review to include new items regarding data integration (e.g., data processing, data integration strategy, data validation, etc.) as well as variable and data source selection. We completed building the OD-ATTEST following the best practice in ontology development to provide a formal presentation for the reporting guideline with standardized and controlled vocabularies. We provided an ontology (OD-ATTEST) annotated report generated based on a prior mIDA study to represent the annotated items and their relationships in reporting guideline.

Methods

Development of a reporting guideline for risk factor selection, data source selection, and data integration

To develop the reporting guideline, we started with summarizing our previous studies where we assessed the effect of data integration on predictive ability of cancer survival models [28] and created a semantic data integration framework to pool multi-level RFs from heterogenous data sources to support mIDA [29]. In the above studies, we went through the process of RF selection, data source selection, and data integration. To be able to ensure the reproducibility of these studies, a number of middle steps need to be documented as detailed in our previous paper [27]. For example, both rural-urban commuting area (RUCA) codes [30] and the National Center for Health Statistics (NCHS) urban-rural classification scheme [31] are often used to represent an geographic area’s rurality status. The difference between the two resides in the classification granularity, where RUCA focuses on classifying U.S. census tracts (i.e., tens levels from rural to metropolitan) while the NCHS urban-rural classification scheme focuses on classifying U.S. counties (i.e., a hierarchal definition with six levels). Thus, we need to clearly document which rural definition we used in the data analysis since different representations of the same variable (i.e., rurality in this case) have different impacts on model results, as shown in our prior work [28]. Further, before integration RFs from various data sources at different levels (e.g., census tract level vs. county level) and covered different time periods, we assume that area-level characteristics (e.g., social vulnerability index) derived from 2000 U.S. Census data were applicable across different time periods (as our individual level data from FCDS covered 1996 and 2010). Above experiences suggest that we must document these data integration nuances so that other researchers can repeat our data integration and data processing pipeline and reproduce the same results (e.g., integrated dataset). In sum, three key items need to be documented: (1) RF selection (e.g., individual vs county-level variables), (2) data source selection (e.g., individual-level data from FCDS and contextual-level data from US Census), and (3) data integration and data preprocessing strategies. Through discussions with expert biostatisticians, data analysts, and cancer outcomes researchers, we summarized the typical mIDA process and found there is little structured thinking when investigators selecting and identifying risk factors and their data sources. We thus propose to use the NIMHD research framework to provide a theory-driven guidance for multi-level and multi-domain RF and data source selections. The NIMHD framework is originally designed to depicts a wide range of health determinants (i.e., RFs from different levels and domains) relevant to understanding and addressing minority health and health disparities. The goal of using the NIMHD framework is to help investigators to structurally and comprehensively think and identify relevant RFs and corresponding data sources in their IDA studies. To build upon existing established reporting guidelines, we searched and identified relevant reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network—a comprehensive searchable database of guidelines for health research reporting. The EQUATOR network categorizes health researches into 13 study types (e.g., quantitative studies, experimental studies, and observational studies), where reporting guidelines for observational studies are most relevant to our mIDA use case. To further identify relevant reporting guidelines in EQUATOR, we developed a set of screening criteria to determine whether a reporting guideline in EQUATOR contains the information that can be used to improve our ATTEST reporting guideline as shown below: The reporting guideline is designed for secondary data analysis studies. The reporting guideline contains at least one of the following sections: data, outcomes (variables), and methods, as these sections will contain information related to variable selection, data source selection, and data integration methods. The reported data within the guideline must be health related. The use of the guideline (at least part of the guideline) can be extended to the cancer outcomes research, especially those related to variable selection, data source selection, and data integration. We reviewed all reporting guidelines designed for observational studies and eliminated guidelines that do not involve the tasks of RF and data source selection and integration. We then identified all reporting guidelines that contain the following sections: data, outcomes (variables), and methods. For those that do not have sections clearly marked, we manually reviewed the entire reporting guideline to identify whether they discussed one of the three aspects. We then extracted reporting items in the selected reporting guidelines that are relevant to RF selection, data source selection, and data integration. Two reviewers (HZ and JB) independently extracted these reporting items of interest and resolved conflicts with a third reviewer (YG). We further analyzed these extracted reporting items and discussed with experts (i.e., biostatisticians, data analysts and cancer outcomes researchers) to summarize items needed in our reporting guideline, especially those related to the data integration process.

Construction of an ontology for the documentation of variable and data source selection and integration process (OD-ATTEST)

The ATTEST reporting guideline we developed is used to guide the variable and data source selection and integration process in cancer outcomes research. We propose to use an ontology-based approach to annotate and document the items in the reporting guideline. The goal of the OD-ATTEST ontology is to standardize the terminology used in documenting the selection and integration steps of RF variables and data sources to support mIDA. The OD- ATTEST is developed using Protégé 5. We used Basic Formal Ontology (BFO) [32] as the upper-level ontology. We first adopted a top down approach to enumerate important entities (classes and relations) based on the reporting guideline we developed. Following the best practice, we reviewed existing widely accepted ontologies using the National Center for Biomedical Ontology (NCBO) BioPortal [33] to find the entities can be reused in OD-ATTEST. Then, we started with the definitions of the most general concepts in the domain and subsequent specialization of the concepts to develop the class hierarchy. We also took a bottom-up process, where we started with the definitions of the most specific classes, and then subsequent grouped similar classes into more general classes. For example, we started by identifying the most specific classes (i.e., the leaf nodes in the ontology hierarchy) for “median”, “maximum value”, “minimum value”, and “percentile”, and then created a common superclass for these classes named “descriptive statistic”. We also examined how these reporting items are associated with each other (e.g., “sample size” is determined by “primary outcome”) and determined what additional classes and relations were needed to fully represent these entities in OD-ATTEST.

An OD-ATTEST-annotated report generated based on a mIDA case study following the reporting guideline

To test the developed ATTEST reporting guideline and the OD-ATTEST ontology, we first created a ATTEST report based on our previous mIDA case study, where we explored the impact of the relationships among socioeconomic status, individual smoking status, and community-level smoking rate on pharyngeal cancer survival [20]. To annotate the ATTEST report using OD-ATTEST, we used the following annotation process: 1) identify information related to the reporting items in ATTEST through reviewing the original publication and supplementary materials; 2) annotate the information using the entities in OD-ATTEST; and 3) transform annotation results into semantic triples in Resource Description Framework (RDF) format using Turtle syntax [34].

Results

The ATTEST reporting guideline for RF variable and data source selection and data integration

We extended our preliminary reporting guideline [27] through a review of existing relevant reporting guidelines published in the EQUATOR network. Fig. 1 shows our review process. We reviewed 94 reporting guidelines designed for from observational studies in the EQUATOR network. Out of the 94 reporting guidelines, 30 contain the required data, outcomes (variables), and method sections, which we retained for data extraction. In the data extraction step, for each reporting guideline, we extracted items relevant to RF and data source selection and integration, where the data and outcomes (variables) sections often contain information regarding how RF variables and data sources are selected, while the method section contains information about how data are processed and integrated.

Fig. 1

Review of relevant reporting guidelines in the EQUATOR network

Review of relevant reporting guidelines in the EQUATOR network We categorized these reporting guidelines (Table 1) based on the domains and levels of the data sources reported in the guidelines and mapped them to the NIMHD framework. As shown in Table 1, these 29 reporting guidelines cover data sources from all domains and levels of influences. Among them, 9 guidelines focused on providing a general reporting guideline for observational studies without specifying a specific domain of influence; while the rest of the guidelines are designed for different domains. For example, the Genetic RIsk Prediction Studies (GRIPS) statement [45] is designed for risk prediction studies using genetic data. Furthermore, most guidelines only considered the data sources from individual level, while 2 of them considered data sources from multi-levels. For example, the Checklist for One Health Epidemiological Reporting of Evidence (COHERE) [55] considered both individual and environmental risk factors when studying a disease.

Table 1

Summary of reporting guidelines based on the data source domains and levels guided by the NIMHD framework

Domain of influences		Level of influences	Guidelines
Not specified^a		Individual level	[35–43]
Not specified^a		Societal level	[44]
Biological data	Genetics data	Individual level	[45, 46]
	Immunogenomic data		[47]
	Molecular epidemiological data		[48, 49]
	Drug safety data from biologics registers		[50, 51]
Behavioral data	Crime, violence data	Individual level	[52]
	Dietary or nutritional data		[53]
	Medication adherence		[54]
Sociocultural environment	Environmental data	Individual/ Community/ Societal/Interpersonal	[55]
Physical environment	Environmental data	Individual/ Community/ Societal/Interpersonal	[55]
Healthcare system	Administrative data, Electronic health records, Claim data, Patient or disease registries, Quality or safety surveillance databases	Individual level	[56–63]

aWhen reporting data sources or RF variables, these studies did not specify a specific data domain

Summary of reporting guidelines based on the data source domains and levels guided by the NIMHD framework aWhen reporting data sources or RF variables, these studies did not specify a specific data domain In our preliminary reporting guideline [27], we focused only on reporting items relevant to RF variables and data sources selection. In this review, we extracted items that can be used to improve our initial reporting guideline but with a focus on documenting the data integration process. In total, three reporting guidelines [57-59] were found containing information about data integration processes. However, items included in these 3 guidelines focus on data linkage and do not contain enough details about how to solve the heterogeneities of data from different sources. For example, when integrating variables across different levels (e.g., combine individual-level patient data and county-level smoking rate), none of the 3 guidelines have items on documenting the cross-level integration choices (e.g., layering the county-level smoking rate to individual based on residence of the individuals and county code), while this type of choices is frequently encountered in mIDA studies. Further, data processing steps such as the choices and algorithms used for creating new data elements (e.g., compute a body mass index variable from two separate variables, weight and height) are not documented in existing reporting guidelines. Therefore, we further extended the ATTEST to include these important data integration and data processing procedures based on our previous research experience on building data integration framework [29]. Informed by the NIMHD research framework and consistent with our prior work, the ATTEST reporting guideline consists of two main parts as shown in Fig. 2, reporting (1) the objective of the study including explaining the background and rationale for designing the study in one or two sentences and describing the hypothesis of the study; and (2) the study design for variable and data source selection processes and describing the data along with the data integration and processing strategies. The variable and data source selection process consists of five key steps: (1) define the outcome variables for primary and (if necessary) secondary outcomes; (2) for each outcome variable, follow an iterative process (see Fig. 2a) to determine the data sources according to NIMHD framework. After selecting each outcome variable and data sources, investigators need to think about how to select or consolidate similar outcome variables from the different selected data sources. For example, if the outcome of interest is an individual’s lung cancer risk, we shall first identify potential data sources (e.g., cancer registries or electronic health records [EHRs]) that contain individual-level patient data where lung cancer incidence data are available. Then, based on the cohort criteria and other information such as required sample size and data range (e.g., time coverage and geographic information) of the potential data sources, the investigator could determine the qualified data sources and choose an adequate one based on the objective and design of the study. For example, if 2 data sources, cancer registry and EHRs, are both available and contain individual-level lung cancer incidence data, the investigator has the choices to (1) choose one data source over the other, or (2) link the two data sources and integrate variables from the two data sources. If the investigator chooses to link and integrate the two data sources, she needs to explicitly document the linkage and integration processes for each of variables as shown in Fig. 2 (Report – Variables – E, F, G, H) so that others can repeat the processes to generate the same analytical dataset; (3) determine the individual-level predictors and covariates of the study; (4) for each individual-level predictor or covariate, follow loop B in Fig. 2 to identify the different levels/domains of predictors or covariates according to NIMHD framework. Similar to the outcome variables, different data sources could potentially contain the same predictor or covariate variable, thus, it is important to contrast and consolidate a new predictor or covariate with the existing selected predictors and covariates to resolve duplicates. If an investigator chooses to integrate the “duplicate” variables (e.g., choosing smoking status from cancer registry data over EHRs because cancer registries data are manually abstracted and typically have better data quality than raw EHRs), these data integration choices also need to be explicitly documented. Nevertheless, it is often a difficult choice and these “duplicate” variables might all need to be tested in models before a selection can be made. Regardless, these decisions and data processing steps need to be clearly documented; and (5) after selecting individual-level predictors and covariates, one can use a similar process, following loop C in Fig. 2 to identify additional contextual-level predictors and covariates and data sources of interest. In the end, a report of the selected data and data sources as well as the data integration processes shall be generated as shown in Fig. 2. The corresponding ATTEST reporting guideline checklist is shown in Table 2.

Fig. 2

An overview of the reporting guideline for RF variable and data source selection and data integration

Table 2

ATTEST reporting guideline checklist

	Item No	Recommendation
Objectives
Background/rationale	1	Explain the scientific background and rationale for the study being reported in one or two sentences
Prespecified hypotheses:	2	State prespecified hypotheses in on or two sentences
Study design: data sources selection & variables selection & data integration
Data sources	3a	Describe the time coverage
	3b	Describe the geographic coverage
	3c	Describe the sample size
	3d	Describe the demographic distribution
	3e	Describe the cohort criteria
	3f	Describe the sources of biases (e.g., sample bias)
	3 g	Describe the data collection approach
Dependent variables	4a	State the variable definition and variable type (e.g., primary outcome variable, secondary outcome variable)
	4b	State the data source of dependent variable
	4c	State the data type (e.g., numerical, categorical, date-time) of dependent variable
	4d	State descriptive statistics (e.g., min, max. Median, value range, percentile) of dependent variable
	4e	State the NIMHD^a domains and levels of dependent variable
Independent variables	5a	State the variable definition and variable type (e.g., primary predictor, secondary predictor)
	5b	State the data source of dependent variable
	5c	State the data type (e.g., numerical, categorical, date-time) of dependent variable
	5b	State descriptive statistics (e.g., min, max. Median, value range, percentile) of independent variable
	5e	State the NIMHD domains and levels of independent variable
Controlled variables	6a	State the variables type (e.g., numerical, categorical) of controlled variable
	6b	State the data source of controlled variable
	6c	State descriptive statistics (e.g., min, max. Median, value range, percentile) of controlled variable
	6d	State the NIMHD domains and levels of controlled variable
Missing data	7a	For each data source, describe whether required or expected variable that is not present
	7b	For each variable, describe method of how to handle missing data
	7c	For each variable, describe the missing rate
Data integration
Data processing	8a	Data extraction: for each variable, describe how to process the raw data source to extract the variable
Data processing	8b	Data cleaning: for each variable, describe the method used to detect and correct (or remove) the incorrect records, missing values or outliers
Integration strategy	9	Describe the integration strategy for each variable:1) Integrate with variables from same level, 2) Integrate with variables from different levels, and 3) Creation of additional computed elements
Integration algorithm	10	For each variable, describe the algorithm used to integrate it with variables from other data sources
Variable validation	11	For each variable, describe data validation rule for the selected variable. Rule should identify both the variable and the validation algorithms
Integrated variable	12	Describe the variable after integration and basic descriptive statistics (e.g., min, max. Median, value range, percentile)

Please document the items for each data source and variable separately

aNational Institute on Minority Health and Health Disparities (NIMHD)

An overview of the reporting guideline for RF variable and data source selection and data integration ATTEST reporting guideline checklist Please document the items for each data source and variable separately aNational Institute on Minority Health and Health Disparities (NIMHD)

Development of the OD-ATTEST ontology

Based on the ATTEST reporting protocol above, we identified that 48 classes and 25 properties are needed in OD-ATTEST to represent the ATTEST reporting guideline. Fig. 3 shows the class hierarchy of OD-ATTEST. We reused classes from the following existing well-known ontologies: Ontology for Biomedical Investigations (OBI), Information Artifact Ontology (IAO), National Cancer Institute Thesaurus (NCIt), Statistics Ontology (STATO) and Semanticscience Integrated Ontology (SIO) as shown in Table 3. Note that there are very few existing ontologies designed for the purpose of documenting the variable and data source selection and data integration process. The limited number of properties in these existing ontologies are not informative to represent the elements in the reporting guideline and their relationships, requiring us to create a large number of new properties in OD-ATTEST.

Fig. 3

The class hierarchy of OD-ATTEST

Table 3

The classes and properties reused or created for OD-ATTEST

	Label	Internationalized Resource Identifiers (IRIs)^a	Reference ontology
Classes	objecitve	iao:0000005	IAO^b
	data source	iao:0000100
	measurement datum	iao:0000109
	dependent variable	obi:0000751	OBI^c
	independent variable	obi:0000750
	controlled variable	obi:0000785
	data processing	obi:0200000
	study	ncit:C63536	NCIt^d
	hypothesis	ncit:C28362
	rationale	ncit:C80263
	primary outcome	ncit:C142644
	secondary outcome	ncit:C142680
	sample size	ncit:C53190
	missing data	ncit:C142610
	data validation	ncit:C142500
	data type	ncit:C42645
	data collection method	ncit:C103159
	data analysis	sio:001051	SIO^e
	minimum value	stato:0000150	STATO^f
	maximum value	stato:0000151
	median	stato:0000574
	mean	stato:0000573
	value range	stato:0000035
	percentile	stato:0000293
	data distribution	stato:0000161
	statistical sampling	stato:0000502
	outlier	stato:0000036
	primary predictor	od-attest:000015	OD-ATTEST^g
	secondary predictor	od-attest:000016
	demographic distribution	od-attest:000093
	outcome variable data source	od-attest:000019
	predictor data source	od-attest:000094
	cohort criteria	od-attest:000008
	descriptive statistic	od-attest:000012
	missing rate	od-attest:000068
	data source time coverage	od-attest:000023
	data source geographic coverage	od-attest:000024
	sources of bias	od-attest:000051
	data integration	od-attest:000052
	data extraction	od-attest:000054
	data cleaning	od-attest:000055
	integration strategy	od-attest:000056
	integrate variables from same level	od-attest:000057
	integrate variables from different levels	od-attest:000058
	creation of additional elements	od-attest:000059
	integration algorithm	od-attest:000060
	validation strategy	od-attest:000068
	integrated variable	od-attest:000096
Properties	is determined by	od-attest:000097	OD-ATTEST
	has rationale	od-attest:000098
	has objective	od-attest:000099
	has data source	od-attest:000100
	has cohort criteria	od-attest:000101
	has demographic distribution	od-attest:000102
	has sources of bias	od-attest:000103
	has controlled variable	od-attest:000104
	has independent variable	od-attest:000105
	has dependent variable	od-attest:000106
	has data type	od-attest:000107
	has descriptive statistics	od-attest:000108
	has NIMHD level	od-attest:000109
	has NIMHD domain	od-attest:000110
	has data collection approach	od-attest:000111
	has sample size	od-attest:000112
	has missing data	od-attest:000113
	has data integration	od-attest:000114
	has data processing	od-attest:000115
	has data validation	od-attest:000116
	has integration strategy	od-attest:000117
	extracted from	od-attest:000118
	has description	od-attest:000119
	has time coverage	od-attest:000120
	has geographic coverage	od-attest:000121

aPrefix: iao: ; obi: ; ncit:

sio: ; stato: ; od-attest:

bInformation Artifact Ontology

cOntology for Biomedical Investigations

dNational Cancer Institute Thesaurus

eStatistics Ontology

fSemanticscience Integrated Ontology

gOntology for the Documentation of Variable and Data Source Selection and Integration Process

The class hierarchy of OD-ATTEST The classes and properties reused or created for OD-ATTEST aPrefix: iao: ; obi: ; ncit: sio: ; stato: ; od-attest: bInformation Artifact Ontology cOntology for Biomedical Investigations dNational Cancer Institute Thesaurus eStatistics Ontology fSemanticscience Integrated Ontology gOntology for the Documentation of Variable and Data Source Selection and Integration Process We annotated two of our previously published mIDA case studies: (1) one study that explored the impact of the relationships among socioeconomic status, individual smoking status, and community-level smoking rate on pharyngeal cancer survival [20], and (2) another study that created a semantic data integration framework to pool multi-level RFs from heterogenous data sources to support mIDA [29]. Table 4 is the filled ATTEST checklist for the two studies. Fig. 4 shows a snippet of the ontology annotated variable and data source selection and integration process for the second study [29], while the corresponding semantic triples in RDF format using Turtle syntax is shown in Table 5. The items: RF variables, data sources, and data integration steps and their relationships are explicitly standardized and modeled using the classes and properties from OD-ATTEST.

Table 4

An example of two previous mIDA case studies annotated using ATTEST checklist

	Item No	Recommendation	Page No Study (1) [20]	Page No Study (2) [29]
Objectives
Background/rationale	1	Explain the scientific background and rationale for the study being reported in one or two sentences	Page 1, section “Abstract”, paragraph 1, line 1–7	Page 1, section “Abstract”, paragraph 1, line 1–4
Prespecified hypotheses	2	State prespecified hypotheses in on or two sentences	Page 2, section “Introduction”, paragraph 3, line 1–2	N/A
Study design: data sources selection & variables selection & data integration
Data source	3a	Describe the time coverage	FCDS: Page 2, section “Data source and case selection”, paragraph 1, line 2	FCDS: Page 4, section “Data sources”, paragraph 1, line 11
			BRFSS: Page 2, section “Data source and case selection”, paragraph 1, line 6	BRFSS: N/A
			2000 U.S. census data: Page 2, section “Data source and case selection”, paragraph 1, line 7	United States Census Bureau: Page 4, section “Data sources”, paragraph 1, line 23
				ATSDR: N/A
				County Health Ranking & Roadmaps: N/A
	3b	Describe the geographic coverage	FCDS: Page 2, section “Data source and case selection”, paragraph 1, line 4–5”	FCDS: Page 4, section “Data sources”, paragraph 1, line 12–14
			BRFSS: N/A	BRFSS: Page 10, section “Result”, paragraph 2, line 7–8
			2000 U.S. census data: N/A	United States Census Bureau: N/A
				ATSDR: N/A
				County Health Ranking & Roadmaps: N/A
	3c	Describe the sample size	FCDS: Page 2, section “Data source and case selection”, paragraph 2, line 7	FCDS: Page 4, section “Data sources”, paragraph 2, line 6–7
			BRFSS: N/A	BRFSS: N/A
			2000 U.S. census data: N/A	United States Census Bureau: N/A
				ATSDR: N/A
				County Health Ranking & Roadmaps: N/A
	3d	Describe the demographic distribution	FCDS: Page 2, Table 1	N/A
			BRFSS: N/A
			2000 U.S. census data: N/A
	3e	Describe the Cohort criteria	FCDS: Page 2, section “Data source and case selection”, paragraph 2, line 1–5	FCDS: Page 4, section “Data sources”, paragraph 2, line 1–6
			BRFSS: N/A	BRFSS: N/A
			2000 U.S. census data: N/A	United States Census Bureau: N/A
				ATSDR: N/A
				County Health Ranking & Roadmaps: N/A
	3f	Describe the sources of bias	N/A	N/A
	3 g	Describe the data collection approach	N/A	FCDS: N/A
				BRFSS: Page 4, section “Data sources”, paragraph 2, line 6–7
				United States Census Bureau: N/A
				ATSDR: N/A
				County Health Ranking & Roadmaps: N/A
Dependent variable	4a	State the variable definition and variable type (e.g., primary outcome variable, secondary outcome variable)	Survival time: Page 2, section “Variable definitions”, line 1–3	Cancer survival: Page 4, section “Data integration use case: The multi-level integrative data analysis of Cancer survival”, paragraph 1, line 1–2
	4b	State the data source of dependent variable	Survival time: Page 2, section “Data source and case selection”, paragraph 1, line 2	Cancer survival: Page 4, section “Data sources”, paragraph 1, line 9–14
	4c	State the data type (e.g., numerical, categorical, date-time) of dependent variable	Survival time: Page 2, section “Variable definitions”, paragraph 1, line 1	Cancer survival: N/A
	4d	State descriptive statistics (e.g., min, max. Median, value range, percentile) of dependent variable	Survival time: Page 4, Table 1	Cancer survival: N/A
	4e	State the NIMHD domain and levels of dependent variable	Survival time: Page 2, section “Data source and case selection”, paragraph 1, line 1–2	Cancer survival: Page 4, section “Data sources”, paragraph 2, line 15
Independent variable	5a	State the variable definition and variable type (e.g., primary predictor, secondary predictor)	Socioeconomic status: Page 2, section “Variable definitions”, paragraph 3, line 1–2	Demographic variables: Page 5, Table 1
			Individual smoking: Page 2, section “Data source and case selection”, paragraph 2, line 1–2	Smoking status: Page 10, section “The ontology for Cancer research variables (OCRV)”, paragraph 2, line 13–27
			Regional smoking: Page 3, section “Data source and case selection”, paragraph 2, line 4–6	Marital status: Page 14, section “Type 4: Queries that generate results based on the knowledge encoded in ontology”, paragraph 2, line 7–10
				Insurance payer: Page 5, Table 1
				Residency: Page 5, Table 1
				Age at diagnosis: Page 5, Table 1
				Year of diagnosis: Page 5, Table 1
				Tumor stage: Page 5, Table 1
				Tumor type: Page 5, Table 1
				Treatment procedure: Page 5, Table 1
				Census Tract SVI: Page 14, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 5–16
				Census tract high school completion rates: Page 5, Table 1
				Census tract family poverty rates: Page 5, Table 1
				Census tract rurality status: Page 4, section “Data integration use case: The multi-level integrative data analysis of Cancer survival”, paragraph 1, line 8–11
				County adult mental and physical health status: Page 5, Table 1
				County density of primary care physicians: Page 5, Table 1
				County smoking rate: Page 10, section “The ontology for Cancer research variables (OCRV)”, paragraph 2
				County alcohol consumption rate: Page 5, Table 1
	5b	State the data type (e.g., numerical, categorical) of independent variable	Socioeconomic status: Page 2, section “Variable definitions”, paragraph 3, line 9–10	Demographic variables: N/A
			Individual smoking: Page 2, section “Data source and case selection”, paragraph 2, line 2–3	Smoking status: Page 13, Table 3
			Regional smoking: Page 3, section “Data source and case selection”, paragraph 2, line 4–6	Marital status: Page 14, section “Type 4: Queries that generate results based on the knowledge encoded in ontology”, paragraph 2, line 7–10
				Insurance payer: N/A
				Residency: N/A
				Age at diagnosis: Page 16, Fig. 6
				Year of diagnosis: Page 16, Fig. 6
				Tumor stage: N/A
				Tumor type: Page 4, section “Data sources”, paragraph 2, line 1–6
				Treatment procedure: Page 5, Table 1
				Census Tract SVI: Page 14, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 5–16
				Census tract high school completion rates: N/A
				Census tract family poverty rates: N/A
				Census tract rurality status: N/A
				County adult mental and physical health status: N/A
				County density of primary care physicians: N/A
				County smoking rate: Page 10, section “The ontology for Cancer research variables (OCRV)”, paragraph 2
				County alcohol consumption rate: N/A
	5c	State the data source of independent variable	Socioeconomic status: Page 2, section “Data source and case selection”, paragraph 1, line 6–7	Page 5, Table 1
			Individual smoking: Page 2, section “Data source and case selection”, paragraph 1, line 1–2
			Regional smoking: Page 2, section “Data source and case selection”, paragraph 1, line 7–10
	5d	State descriptive statistics (e.g., min, max. Median, value range, percentile) of independent variable	Page 4, Table 1	N/A
	5e	State the NIMHD domain and levels of independent variable	Socioeconomic status: Page 2, section “Data source and case selection”, paragraph 1, line 6	Page 5, Table 1
			Individual smoking: Page 2, section “Data source and case selection”, paragraph 2, line 1
			Regional smoking: Page 3, section “Data source and case selection”, paragraph 2, line 4–6
Controlled variable	6a	State the controlled variable and variable type (e.g., numerical, categorical) of controlled variable	Age of diagnosis: Page 2, section “Variable definitions”, paragraph 1, line 10–13	N/A
			Anatomic site: Page 2, section “Variable definitions”, paragraph 1, line 2–9
			Race-ethnicity: Page 4, Table 1
			Marital status: Page 4, Table 1
			Insurance: Page 4, Table 1
			Year of diagnosis: Page 4, Table 1
			Gender: Page 4, Table 1
			Stage of diagnosis: Page 4, Table 1
			Treatment: Page 4, Table 1
	6b	State the data source of controlled variable	Page 2, section “Data source and case selection”, paragraph 1, line 2^a	N/A
	6c	State descriptive statistics (e.g., min, max. Median, value range, percentile) of controlled variable	Page 2, section “Data source and case selection”, paragraph 1, line 2^a	N/A
	6d	State the NIMHD domain and levels of controlled variable	Page 2, section “Data source and case selection”, paragraph 1, line 1–5^a	N/A
Missing data	7a	For each data source, describe whether required or expected variable that is not present	N/A	N/A
	7b	For each variable, describe method of how to handle missing data	N/A	N/A
	7c	For each variable, describe the missing rate	N/A	N/A
Data processing	9a	Data extraction: for each variable, describe how to process the raw data source to extract the variable	N/A	Demographic variables: Page 15, Fig. 5
				Age at diagnosis: Page 16, Fig. 6
				Census Tract SVI: Page 16, Fig. 7
				County smoking rate: Page 17, Fig. 8
				Marital status: Page 18, Fig. 9
	9b	Data cleaning: for each variable, describe the method used to detect and correct (or remove) the incorrect records, missing values or outliers	N/A	N/A
Integration strategy	10	Describe the integration strategy for each variable:1) Integrate with variables from same level, 2) Integrate with variables from different levels, and 3) Creation of additional computed elements	Socioeconomic status: Page 2, section “Variable definitions”, paragraph 3, line 6–7.	Demographic variables: Page 15, Fig. 5
			Regional smoking: Page 2, section “Variable definitions”, paragraph 2, line 4–5.	Age at diagnosis: Page 16, Fig. 6
				Census Tract SVI: Page 16, Fig. 7
				County smoking rate: Page 17, Fig. 8
				Marital status: Page 18, Fig. 9
				Census tract high school completion rates: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				Census tract family poverty rates: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				Census tract rurality status: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				County adult mental and physical health status: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				County density of primary care physicians: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				County alcohol consumption rate: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
Integration algorithms	11	For each variable, describe the algorithm used to integrate it with variables from other data sources	N/A	Demographic variables: Page 15, Fig. 5
				Age at diagnosis: Page 16, Fig. 6
				Census Tract SVI: Page 16, Fig. 7
				County smoking rate: Page 17, Fig. 8
				Marital status: Page 18, Fig. 9
				Census tract high school completion rates: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				Census tract family poverty rates: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				Census tract rurality status: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				County adult mental and physical health status: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				County density of primary care physicians: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
				County alcohol consumption rate: Page 15, section “Type 3: Queries that are used to link a patient to contextual factors through geographic variables”, paragraph 1, line 1–3
Variable validation	12	For each variable, describe data validation rule for the selected variable. Rule should identify both the variable and the validation algorithms	N/A	Demographic variables: Page 19, section “Data quality and consistency checks of the source data using the ontology”
Integrated variable	13	Describe the variable after integration and basic descriptive statistics (e.g., min, max. Median, value range, percentile)	N/A	Page 18, Table 4

FCDS Florida Cancer Data System

ATSDR Agency for Toxic Substances& Disease Registry

BRFSS behavioral risk factor surveillance system

aIf the reported items for all variables or data sources are described at the same place, you can list the page/section/table information at once. For the integration related items, we only presented variables that have the information (N/A will not be showed in the table)

Fig. 4

An OD-ATTEST-annotated report generated based on a mIDA case study

Table 5

An example of annotated semantic triples represented in RDF format using Turtle syntax

Prefix

@prefix od-attest: <http://www.semanticweb.org/od-attest#>.

@prefix ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

@prefix xsd: < http://www.w3.org/2001/XMLSchema#>.

RDF^a triples

od-attest:30066664

rdf:type ncit:study;

od-attest:has rationale od-attest:30066664/rationale;

od-attest:has objective od-attest:30066664/objective.

od-attest:30066664/rationale

rdf:type ncit:rationale;

od-attest:has description "Extant cancer survival analyses have..." ^^ xsd:string.

od-attest:30066664/objective

rdf:type ncit:objective;

od-attest:has description "built a semantic data integration …" ^^ xsd:string.

aResource Description Framework

An example of two previous mIDA case studies annotated using ATTEST checklist FCDS Florida Cancer Data System ATSDR Agency for Toxic Substances& Disease Registry BRFSS behavioral risk factor surveillance system aIf the reported items for all variables or data sources are described at the same place, you can list the page/section/table information at once. For the integration related items, we only presented variables that have the information (N/A will not be showed in the table) An OD-ATTEST-annotated report generated based on a mIDA case study An example of annotated semantic triples represented in RDF format using Turtle syntax @prefix od-attest: . @prefix ncit: . @prefix rdfs: . @prefix xsd: < http://www.w3.org/2001/XMLSchema#>. od-attest:30066664 rdf:type ncit:study; od-attest:has rationale od-attest:30066664/rationale; od-attest:has objective od-attest:30066664/objective. od-attest:30066664/rationale rdf:type ncit:rationale; od-attest:has description "Extant cancer survival analyses have..." ^^ xsd:string. od-attest:30066664/objective rdf:type ncit:objective; od-attest:has description "built a semantic data integration …" ^^ xsd:string. aResource Description Framework

Discussion

In this study, we first developed a reporting guideline, ATTEST, to provide a theory-driven approach to guide the RF variable and data source selection and integration process in cancer outcomes research. We then proposed an ontology-based approach to annotate the items in our reporting guideline so that information relevant to variables, data sources and data integration in mIDA studies can be explicitly documented. To develop the reporting guideline, we conducted a systematic search to identify useful reporting items to improve our selection and data integration process. We categorized these reporting guidelines based on their reported data source domains and levels according to NIMHD framework, so that we can identify items need to be reported when selecting variables or data sources from different domains and levels. For example, when report population-level estimates (variables) [44], the information regarding the sources of bias (e.g., selection bias) need to be documented. Therefore, we updated our previous reporting guideline and added “sources of bias” as a reporting item when documenting data sources. This is important, because subsequent data processing steps might be needed to correct the bias. Further, The use of NIMHD framework can also help researchers to systematically think and structure the variable and data source selection process when considering multi-level RF variables from heterogenous data sources. For example, if an investigator is considering smoking related risk factors in cancer outcomes research, following the NIMHD framework, one can start with variables in the behavioral domain and then list potential smoking related variables for each level of influences step by step, such as individual smoking status at the individual level, second hand smoke exposure at the interpersonal level, county level smoking rate at the community level, and smoking policies or laws (e.g., federal minimum age to purchase tobacco products) at the societal level. The same process can be applied to select other smoking related variables from other domains of influences. In this way, investigators can systematically think and evaluate the confounding effects and cross-level interactions among those selected variables which are usually ignored in previous cancer outcome studies using a single data source. We provided a ATTEST checklist (1) to help researchers clearly document each step of their RF and data source selection and integration process, and (2) to improve the completeness and transparency of their mIDA studies. As shown in Table 4, we used the ATTEST checklist to report two previous mIDA studies. Based on the checklist, we can easily (1) check whether these mIDA studies document required items that can help other researchers replicate their studies, and (2) compare their variables, data sources and data integration processes. As shown in Table 4, we found that there are 3 items never discussed in either of the two studies including “sources of bias”, “missing data” for selected variables, and “data cleaning” (i.e., method used to detect and correct or remove the incorrect records, missing values or outliers). All three items are relevant to data quality issues, where rarely being discussed or documented in these mIDA studies or even more broadly in cancer outcomes research. Nevertheless, data quality issues such as missing data can dramatically affect the results of the cancer outcomes research (e.g., in cancer survival prediction) [64]. Comparing the two case mIDA studies, the data integration process was not well-documented in the first study [20], where most of the items relevant to data integration are blank; while, in the other study [29], the processes about data processing, data integration, and data validation were all clearly documented according to the ATTEST checklist. Therefore, using this checklist, one can improve the completeness of their documentation on the selection and integration process as shown in Table 4. The OD-ATTEST ontology provides a way to standardize the documentation of the mIDA study process from variable and data source selection to data integration. Also, the ontology-based annotations of the report is beneficial because it provides an initial step towards a report that is not only readable and understandable by human but also potentially executable by machines. After transforming these annotations into semantic triples, the report can be stored into a knowledge base and represented as knowledge graphs (Fig. 4) to facilitate examination and analysis of these mIDA reports, enabling robust sharing and comparison of different mIDA studies.

Limitations and future work

Most of the reporting guidelines we reviewed from the EQUATOR network have limited information on how to document the data integration process, indicating a significant gap in existing practice. Nevertheless, we were able to summarize the key elements need to be reported for the integration process based on 3 existing guidelines and our own previous experience on semantic data integration case studies. As a future study, one shall conduct a systematic review on data integration literatures to summarize relevant reporting items to improve the reporting guideline. Meanwhile, we will conduct a yearly review of existing reporting guidelines following the reviewing process discussed in Fig. 1 to identify new reporting items of interest and keep our framework up to date. Further, beyond standardized reporting, our ultimate goal is to let computers understand the ontology-annotated report (in RDF triples) regarding (1) how different variables are defined and represented and (2) how different variables are selected and integrated, so that machines can automatically repeat these processes and generate integrated dataset based on an executable ontology-annotated report. For variable definition and representation, it is important to recognize and being interoperable with existing data standards and common data models (CDM) such as those that standardized exchanging of EHRs data including the national Patient-Centered Clinical Research Network (PCORnet) CDM, the Observational Medical Outcomes Partnership (OMOP) from the Observational Health Data Sciences and Informatics (OHDSI) network, and the uprising Fast Healthcare Interoperability Resources (FHIR) protocol adopted by major EHR system vendors. Developing the ontology against these CDMs that have already standardized existing data resources would be critical to assure the generalizability of our framework. Nevertheless, for modeling the variable selection and integration processes as shown in Fig. 4, more fine-grained information regarding the variables, data sources and the integration process are currently documented as free-text descriptions. We face challenges in transforming these “free-text” information into executable algorithms (e.g., a data processing step that calculates BMI using weight and height). Such information is related to the concept of data provenance—“a type of metadata, concerned with the history of data, its origin and changes made to it” [65]. The importance of data provenance is widely recognized, especially for study reproducibility and replicability. More than one-half of the systematic efforts to reproduce computational results across different fields have failed, mainly due to insufficient detail on digital artifacts, such as data, code, and computational workflow [66]. However, descriptions of data provenance are often neglected or inadequate in scientific literature due to the lack of a tractable, easily operated approach with supporting tools. Future studies that focus on the development of easy-to-use tools with a standardized framework to persist end-to-end data provenance with high integrity including intermediate processes and data products are urgently needed. Further, future developments of tools and platforms to automate the documentation process, where the data elements and associated information (e.g., levels and domains) are also automatically annotated with the standardized ontology are warranted.

Conclusions

In this paper, we have proposed and developed an ontology-based reporting guideline solving some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.

50 in total

1. Minimum data elements for research reports on CFS.

Authors: Leonard A Jason; Elizabeth R Unger; Jordan D Dimitrakoff; Adam P Fagin; Michael Houghton; Dane B Cook; Gailen D Marshall; Nancy Klimas; Christopher Snell
Journal: Brain Behav Immun Date: 2012-01-28 Impact factor: 7.217

2. Assessing the effect of data integration on predictive ability of cancer survival models.

Authors: Yi Guo; Jiang Bian; Francois Modave; Qian Li; Thomas J George; Mattia Prosperi; Elizabeth Shenkman
Journal: Health Informatics J Date: 2019-01-23 Impact factor: 2.681

3. Reporting guidance for violence risk assessment predictive validity studies: the RAGEE Statement.

Authors: Jay P Singh; Suzanne Yang; Edward P Mulvey
Journal: Law Hum Behav Date: 2014-08-18

4. The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance.

Authors: Mohamed S Barakat; Matthew Field; Aditya Ghose; David Stirling; Lois Holloway; Shalini Vinod; Andre Dekker; David Thwaites
Journal: Health Inf Sci Syst Date: 2017-12-06

5. Genome-wide association studies of cancer.

Authors: Zsofia K Stadler; Peter Thom; Mark E Robson; Jeffrey N Weitzel; Noah D Kauff; Karen E Hurley; Vincent Devlin; Bert Gold; Robert J Klein; Kenneth Offit
Journal: J Clin Oncol Date: 2010-06-28 Impact factor: 44.544

6. Development of the Standards of Reporting of Neurological Disorders (STROND) checklist: A guideline for the reporting of incidence and prevalence studies in neuroepidemiology.

Authors: Derrick A Bennett; Carol Brayne; Valery L Feigin; Suzanne Barker-Collo; Michael Brainin; Daniel Davis; Valentina Gallo; Nathalie Jetté; André Karch; John F Kurtzke; Pablo M Lavados; Giancarlo Logroscino; Gabriele Nagel; Pierre-Marie Preux; Peter M Rothwell; Lawrence W Svenson
Journal: Neurology Date: 2015-07-10 Impact factor: 9.910

Review 7. Strengthening the Reporting of Observational Studies in Epidemiology for Newborn Infection (STROBE-NI): an extension of the STROBE statement for neonatal infection research.

Authors: Elizabeth J A Fitchett; Anna C Seale; Stefania Vergnano; Michael Sharland; Paul T Heath; Samir K Saha; Ramesh Agarwal; Adejumoke I Ayede; Zulfiqar A Bhutta; Robert Black; Kalifa Bojang; Harry Campbell; Simon Cousens; Gary L Darmstadt; Shabir A Madhi; Ajoke Sobanjo-Ter Meulen; Neena Modi; Janna Patterson; Shamim Qazi; Stephanie J Schrag; Barbara J Stoll; Stephen N Wall; Robinson D Wammanda; Joy E Lawn
Journal: Lancet Infect Dis Date: 2016-09-12 Impact factor: 25.071

Review 8. Strengthening the Reporting of Observational Studies in Epidemiology for respondent-driven sampling studies: "STROBE-RDS" statement.

Authors: Richard G White; Avi J Hakim; Matthew J Salganik; Michael W Spiller; Lisa G Johnston; Ligia Kerr; Carl Kendall; Amy Drake; David Wilson; Kate Orroth; Matthias Egger; Wolfgang Hladik
Journal: J Clin Epidemiol Date: 2015-05-01 Impact factor: 6.437

9. Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research.

Authors: Patrick D Schloss
Journal: mBio Date: 2018-06-05 Impact factor: 7.867

10. STrengthening the REporting of Genetic Association Studies (STREGA): an extension of the STROBE statement.

Authors: Julian Little; Julian P T Higgins; John P A Ioannidis; David Moher; France Gagnon; Erik von Elm; Muin J Khoury; Barbara Cohen; George Davey-Smith; Jeremy Grimshaw; Paul Scheet; Marta Gwinn; Robin E Williamson; Guang Yong Zou; Kim Hutchings; Candice Y Johnson; Valerie Tait; Miriam Wiens; Jean Golding; Cornelia van Duijn; John McLaughlin; Andrew Paterson; George Wells; Isabel Fortier; Matthew Freedman; Maja Zecevic; Richard King; Claire Infante-Rivard; Alex Stewart; Nick Birkett
Journal: PLoS Med Date: 2009-02-03 Impact factor: 11.069

2 in total

Review 1. Ontologies and Knowledge Graphs in Oncology Research.

Authors: Marta Contreiras Silva; Patrícia Eugénio; Daniel Faria; Catia Pesquita
Journal: Cancers (Basel) Date: 2022-04-10 Impact factor: 6.575

2. Prostate cancer in omics era.

Authors: Nasrin Gholami; Amin Haghparast; Iraj Alipourfard; Majid Nazari
Journal: Cancer Cell Int Date: 2022-09-05 Impact factor: 6.429

2 in total