| Literature DB >> 32935052 |
S Asghari1, S Boyd1, J Knight2, J Blackmore1, O Hurley1, J Allison3, L Gilbert4, J Dowden5, P Lundrigan6.
Abstract
INTRODUCTION: Developing a comprehensive cohort of people living with HIV (PLHIV) to help improve healthcare has long been the vision of researchers, clinicians and decision makers. The development of this kind of database is challenging and requires strict adherence to privacy and confidentiality policies. We explored procedures, activities and events in database development.Entities:
Year: 2020 PMID: 32935052 PMCID: PMC7473269 DOI: 10.23889/ijpds.v5i1.1144
Source DB: PubMed Journal: Int J Popul Data Sci ISSN: 2399-4908
Figure 1: Iterative Database Development ProcessThis shows the cycle of an iterative process that repeats itself until research questions and objectives can be answered by the data provided.
| Potential Risk and Impact | Challenges |
|---|---|
| Patient privacy & confidentiality | Ensuring patient privacy and confidentiality is not compromised |
| Ethics standards compliance | |
|
| |
| Slow the progress of data acquisition | Data may have multiple custodians |
| Processes for requesting data can be cumbersome | |
| Maintaining data integrity | |
| Data availability | |
|
| |
| Data quality | Data consistency |
| Data accuracy | |
| Missing data | |
| Data comparability | |
|
| |
| Slow the progress of transforming data to information | Data mining may be complex |
| Staff changes | |
| Optimum data unit (level) | |
| Choosing an optimal analysis approach | |
| Challenges | Action(s) Taken | Lessons Learned |
|---|---|---|
| Ensuring patient privacy and confidentiality is not compromised |
We minimized the number of data elements requested to only those required to achieve the study’s research objectives. The ethics application also required explanation of the measures taken to ensure patient de-identification. We maintained the data on a secure server to which only the team members conducting the data analyses had access. We avoided citing data in a way that may be identifiable to avoid re-identification, e.g. considering small numbers of patients residing in certain geographic areas. | Consider each variable requested carefully and take necessary measures to minimize the probability of patient re-identification by minimizing the number of data elements requested, safeguarding data access and avoiding data citation in an identifiable manner. |
|
| ||
| Ethics standards compliance |
We reviewed the ethics file regularly to ensure that any changes in the database development were in line with the goals and objectives that were originally approved by the HREB. We regularly assessed that staff, students and researchers who work with data are familiar with the ethics application and implemented ways to ensure the data remained secure. We defined the level of stakeholder access to data/outputs, and data sharing policies to maintain safeguards. We maintained the ethics file throughout the study period, which is a provincially mandated requirement of conducting research that involves the secondary use of health data. |
Develop a timeline including electronic reminders for regular review of the ethics files to ensure any changes in the database development is in line with the goals and objectives in the current ethics file. Conduct data access audits as well as review data/information sharing agreements to prevent unauthorized sharing of information with stakeholder groups. Keep the ethics approval active and up to date until the final publication is published. |
| Challenges | Action(s) Taken | Lessons Learned |
|---|---|---|
| Data mining may be complex |
We trained (e.g. pertinent knowledge related to the purpose of the database, how to properly link data elements across sources/tables, etc.) all individuals before being involved in data mining. This minimized the amount of downtime attributed to training/briefing new team members involved in database development. We maintained detailed notes throughout all phases of the database development process to maintain efficiency and to use as reference material if issues are identified later in the study. These notes were useful whenever a new research team member was introduced to the project. We ensured that all individuals on the research team who were involved in the data mining process remained the same for as long as possible. |
Studies with relatively long database development periods should try to minimize the number of different staff involved in the project, as much as possible, over time. Keeping a record of the data mining process and detailed notes decreased the impact to the progress and speed of the development process when a change in personnel occurred. |
|
| ||
| Staff changes |
Every new staff member on the research team started with a brief training exercise related to best practices for data mining and data analyses. We developed a document outlining various procedures for staff to review when it was applicable. We had regular meetings with staff so the documents outlining the procedures were kept up-to-date and relevant based on the tasks that needed to be performed. We planned regular training sessions for all staff members to ensure their knowledge related to specific tasks was kept up-to-date. |
It is important to have a document outlining procedures related to database development tasks for staff to review when required. Regular training exercises that deal with database development for your staff ensures a standardization of database development methods that should, in turn, improve the overall quality of the end product. |
|
| ||
| Optimal data unit (level) |
Some variables are not available at the desired level requested (e.g. individual or aggregate-level). Once we identified the variables that could not be provided at the level requested, we considered alternative variables/databases that may contain the desired information with data stakeholders and custodians. Once alternatives were fully considered, we planned for subsequent data aggregation and identified the finest level of data available for multi-level data analysis. |
Multi-level data analysis (individual vs. group level data) may be required if certain variables are not available at the individual level. |
|
| ||
| Choosing an optimal analysis approach |
Before conducting any analyses, we discussed the key messages and target audiences with our team as well as expected outcomes and knowledge to be disseminated. We reviewed the analytical approach used to address needs that were defined by the research team. We then selected an optimal approach to data analysis and how best to disseminate the results. We reviewed both the format of each variable and the database structure. We broke them down and organized them in a manner that was conducive to answering our research question while also considering the type of expertise required to complete the analysis. We also created simple logics on:
How to examine PLHIV data. How to manipulate PLHIV data. How to communicate outputs from PLHIV data analysis. |
Throughout all stages of data analysis, you must think about the final product, choose the right platform for your target audience, consider the technology/software required to produce the desired result and consider your expertise and available resources to ensure the analysis is completed in a timely manner and as intended. |
Figure 2: Diagram of the data sources of PLHIV in NL who were still alive at the end of the study period| Challenges | Action(s) Taken | Lessons Learned |
|---|---|---|
| Data may have multiple custodians |
We developed a data governance team that included independent researchers and members from provincial data custodians and stakeholder groups. The data governance team provided prerequisite information in order to develop a variable list that would satisfy all research objectives and the repositories that contain the desired health information. We met with each data custodian separately to collect information related to their secondary data request process and to determine which approvals would need to be obtained prior to applying for data access. |
Establish a data governance team when required data elements are housed across multiple data custodians. The involvement of data custodians earlier in the database development process is essential. |
|
| ||
| Processes for requesting data can be cumbersome |
The rationale for the inclusion of data elements from different databases and their relation to the study’s research objective was drafted after local stakeholder groups and data custodian representatives were consulted. The rationale included both the preferred method of analysis and the reason each variable was included in our request. Our secondary data request also provided the inclusion dates for each of the variables as well as practical definitions for every variable especially for disease-specific case definitions. Prior to data abstraction and collection, we made sure all ethics-based prerequisites were met. This included taking oaths of confidentiality, obtaining ethics approval from provincial ethics boards and regional health authorities, completing legislated provincial privacy training and receiving custodian-specific approval. | When requesting data, researchers must first be clear on the purpose for creation of the database as well as research objectives. Once the purpose for the data request and research objectives have been established and understood by each of the data custodians, researchers should provide details on how certain health conditions are defined and what software will be used to link and analyze each of the databases. |
|
| ||
| Maintaining data integrity |
To ensure data was not compromised due to human errors or unintended transfer errors, we kept a copy of our original database in a separate folder. The server used to store all related databases was backed‐up on a daily basis to avoid issues related to hardware failure, malware, accidental deletion of critical files or data corruption. | It is important to keep a copy of your original database in a secure location that is routinely backed‐up. Do not assume the data custodian will indefinitely keep a copy of the extracted database. |
|
| ||
| Data availability |
When working with secondary data, there are times when some data is not available for research purposes. We regularly reviewed the approved databases with each of the custodians to determine whether new/up‐to‐date data was available. We discussed the feasibility of adding new databases, updating the database with new years of data over time, and adding data elements with the data custodian, as well as the process of obtaining and linking the new data to the existing database. In some cases, the data was only available in paper‐based records and therefore had to be entered into the database manually using electronic forms that were jointly developed with a provincial data custodian. |
Regularly assess the availability of any new data and the feasibility of adding new data to your database. Paper-based record data abstraction requires a significant amount of time and effort and should be rigorously scrutinized to meet the data quality standards required for subsequent data analysis. |
| Challenges | Action(s) Taken | Lessons Learned |
|---|---|---|
|
| ||
| Data consistency |
To ensure the data is usable and formatted in a consistent way, we reviewed each variable’s corresponding data to identify an optimum approach when data is transferred between different software (e.g. Excel to Access). We converted the data to the same/similar format when it was applicable to allow for data transfer/linkage. This approach helped to minimize missing information during the data transfer/linkage and helped to avoid issues related to subsequent data analysis. We discussed our preferred data format and preferred statistical software (e.g. SAS, STATA) with data custodians to ensure file/variable type compatibility and to reduce the need for subsequent data conversions. | It is important to ensure that each variable is entered into the database in the intended format and that proper data conversion methods are applied. |
|
| ||
| Data accuracy |
To ensure the data elements included in our database accurately reflected what they were supposed to measure, we developed a protocol to assess the data accuracy throughout the whole process. The protocol included double entry, assessing a random sample of data, reviewing types of data (nominal, ordinal, numeric), and assessing the appropriateness of each variable’s range of data (e.g. age 1001), data units (e.g. age = 15 months). We compared the values for each variable with the expected range, format, unit type, etc. and compared the results with existing national/provincial reports at every stage. We reviewed the data dictionary and learned about the sampling frame, study population and important considerations for secondary data analysis that were brought up by the data custodian. To review the data accuracy of each data source, we compared it to the other databases, and discussed any inaccuracies with the data custodian to identify the best approach to address the inaccuracy. |
When working with data, it is important to ensure the data always reflects what it is supposed to measure. To do so, the feasibility of database reviews and revisions at every stage of the study should be considered. This may involve sensitivity analysis through the comparison of the actual database with the expected product, comparing the database to pre-existing databases (if available), and reviewing early outputs with stakeholders. It is also critical to become familiar with sampling frame and the target population of each database. Developing or validating case definitions across multiple sources of data requires some measures on the part of the research team to limit the level of discrepancy between sources. |
|
| ||
| Missing data |
We discussed any variable that contained a large amount of missing data with each of the custodians during the data request process. We planned statistical analysis to deal with missing data through imputation and sensitivity analysis. We interpreted the results of our analysis within the context of the limitations associated with missing data. It is also useful to familiarize yourself with/consult a statistician on approaches for dealing with missing data (e.g. imputation and sensitivity analysis). |
Be prepared for missing data when working with secondary data. Reviewing databases with stakeholders is helpful to minimize the impact of missing data. It is also useful to familiarize yourself with/consult a statistician on approaches for dealing with missing data (e.g. imputation and sensitivity analysis). |
|
| ||
| Data comparability |
We evaluated all data elements included in our database to determine whether they conformed to what was originally requested. Content included in databases can change over the course of a study period, which may happen without prior notice. A data dictionary was developed by the research team based on the original data dictionary provided by the data custodian, which contains notes that document the changes made to the original databases during the study period. This more comprehensive data dictionary defined each of the variables (e.g. name, format, description, notes), described missing data, defined acronyms, and provided data ranges, etc. After new analysis was completed, we added information about any changes made to our database (e.g. new variables, composite variables, etc.) and the date of this change to the data dictionary. We regularly reviewed the data dictionary with the research team to ensure it captured all changes made to the originally requested database. |
After identifying and obtaining the requested data sources related to the research question, it is very important to develop a data dictionary (metadata) for the purposes of your own record keeping. Data dictionaries should include any changes made by the research team, and an up-to-date list of study variables, and may be different from the one that is provided alongside the original database by the data custodians. |