Literature DB >> 36199429

Publicly Available Health Research Datasets: Opportunities and Responsibilities.

Ahmed S BaHammam1, Michael W L Chee2.   

Abstract

Entities:  

Year:  2022        PMID: 36199429      PMCID: PMC9527360          DOI: 10.2147/NSS.S390292

Source DB:  PubMed          Journal:  Nat Sci Sleep        ISSN: 1179-1608


× No keyword cloud information.
We are moving slowly into an era where big data is the starting point, not the end. – Pearl Zhu, author of the “Digital Master”. The drive to share scientific data related to human health began with the Open Science movement in the 1990s and has recently been supported by governments; the move aims to advance science and scientific communication and transform contemporary culture and decision-making. Also, data sharing serves to accelerate discovery, offering accessibility to high-value large, complex datasets that many researchers cannot readily collect. Undoubtedly, this international effort represents one of the most significant developments in evidence-based practices this century.1 Publicly available data have created a fundamental and significant change in how we conduct research, make decisions, develop policy, and assess our actions. It is proposed that several tasks, including disease surveillance and signal identification, risk prediction, therapeutic intervention targeting, and last but not least, disease comprehension, may be accomplished with big data.2 Another advantage of publicly available datasets is the ability to utilize, share, and integrate these data with other datasets, which opens up new avenues for scientific interaction and cooperation. The reuse and scrutiny of collected data by parties other than the original researchers enhance transparency and rigor in the documentation and conduct of data collection. Reanalysis of original or aggregated datasets ensures the reproducibility and robustness of inferences made. In addition, health-related research, including sleep medicine research, is currently moving towards individual-tailored therapeutic interventions, and the big data era.3,4 Sleep medicine-related research is suitable for large digital data due to the nature of data recording and collection; for example, polysomnography (PSG) provides several physiological data that aid in clinical research and therapeutic decision-making. Moreover, wearable devices and self-quantification systems are other sources of large data. In addition, a growing number of large datasets pertaining to sleep are publicly accessible for analysis to researchers worldwide, such as PhysioNet, which provides large libraries of recorded physiologic signals that are available for free on the web, and the “Montreal Archive of Sleep Studies” (MASS).3 Besides, the “National Sleep Research Resource” (NSRR) is a new “National Heart, Lung, and Blood Institute” ad hoc site created to give the community of sleep researchers access to colossal data.5,6 The NSRR is a system for exchanging and reusing large-scale physiological signals developed with a single point of access (NSRR; R24HL114473). De-identified data for more than 35,000 patients (at the time of writing) related to sleep medicine from 22 US-based datasets, including PSGs and connections to risk factors and outcome data for research participants, are available on the NSRR’s free and public web platform.7 The functional architecture was created and implemented to allow ongoing data sharing and integration. It has been effectively adapted and enhanced to allow the collection of prospective data for epidemiological cohort studies supported by the “National Institutes of Health (NIH)”, such as the “Sleep Heart Health Study”, the “MrOS dataset” with its primary focus being PSG data, “Heart Biomarker Evaluation in Apnea Treatment”, and the “Multi-Ethnic Study of Atherosclerosis (MESA Sleep)”, which included a sleep questionnaire, actigraphy, and a full overnight unattended PSG, and to combine data from independent research groups, such as the “Wisconsin Sleep Cohort”.8 The UK Biobank (data on half a million participants) and the “Adolescent Brain Cognitive Development (ABCD)” studies are two prominent examples of government-funded programs developing an interest in sleep.9,10 In addition, scientist-led data-collecting consortia-like ENIGMA (50 active ENIGMA working groups) have also recently developed sleep sections.11 The availability of such colossal health data, once only available to the country or province of high-end well-funded laboratories, has stimulated a surge in work on automated sleep staging systems and opened new vistas for the early detection of sleep-disordered breathing disorders and other sleep disorders. Although open data provide golden opportunities for scientists and researchers, they also have shortcomings that need to be discussed. For example, despite all the unprecedented advantages of publicly available datasets, there have been reports of low-quality association studies with limited clinical utility that utilized public database data.12,13 In addition, the machine learning community has recently discovered a concerning number of potential ethical and legal issues with many of the most widely used picture datasets, including representational harms, bias effects, invasions of privacy, and ambiguous or questionable downstream uses.14,15 Moreover, there are some public concerns about preserving privacy and confidentiality,16,17 particularly with healthcare and public health databases. However, there is an understanding among the research community that the desire for openness and transparency must be balanced with the requirement to protect confidentiality. Additionally, because few datasets have high visibility, easy access, and usability, there is a risk that researchers may choose a tiny, biased pool of data, which could result in significant biases. Therefore, publicly available data need precise standards to assure transparency about the source, how the data were developed, the credibility of the collected data, proper analysis and its potential combinability with other datasets, and their weaknesses. Due to the intrinsic nature of the secondary analysis, the available data are not gathered to answer a specific research question or to test a specific hypothesis. As a result, some significant variables frequently are not available for analysis. Similar to this, not all population groupings or geographic areas of interest may have their data collected. Moreover, the fact that the researchers evaluating the data are frequently different from those who were part of the data collection process is another significant constraint of data analysis. As a result, they are probably unaware of details or flaws unique to the study that may affect how certain the dataset’s variables are interpreted. In addition, users may overlook crucial information if it is not prominently shown in the documents since there is sometimes an overwhelming amount of material (primarily designed for sophisticated, extensive surveys conducted by government bodies). However, in Nature and Science of Sleep, we recognize the importance of publicly available datasets in advancing public health and sleep medicine research. Therefore, we proposed guidelines for authors submitting studies utilizing publicly available databases in the Scope and Aims of the journal to ensure that valuable and high-quality scientific work derived from publicly available databases reaches the readers of Nature and Science of Sleep and the sleep medicine community. Nature and Science of Sleep aims to ensure that the used data should meet the values recognized as pertinent to big data in health and research on a substantive and procedural level, with accepted definitions utilizing an ethical decision-making framework.18,19 Data analysis must meet certain requirements, which should include a thorough description of the population being studied, a sample plan and strategy, a time period for data collection, assessment tools, response rates, and quality control procedures; additionally, it should be specified if the data analysis follows a “question-driven” or “data-driven” approach and how missing variable were managed. While data imputation techniques can help cover gaps from missing data, variables that are key to testing focused hypotheses may not have been collected.20 Further, the statistical power in large numbers cannot make up for gaps in representation (for example, of racial groups and social status) that limit the generalizability of the data. Moreover, dataset’s strengths and limitations should be clearly described to the readers. This task is mutual between the curators and the researchers. On the one hand, curators need to be open about the limitations or flaws in the collected data, some of which get uncovered by users or following updates in technology or standards. On the other hand, researchers need to thoroughly study all essential documentation offered to database users to evaluate the data’s internal and external validity and decide whether there are enough cases in the dataset to produce accurate estimates about the topic of interest. Moreover, creators need to oversee the usage of their datasets, make license and documentation adjustments as needed and, if required, restrict access.21 Additionally, the researchers should legibly define the exposure variables, outcome variables, covariates, and confounding factors that will be considered in the analysis before beginning the investigation. A significant shift in dataset approach is required going forward14,21 since the ethical implications of a dataset are challenging to predict and handle at the time. Peng et al underlined the need for damage reduction and stewardship throughout the dataset’s life cycle of dataset construction, as well as moral and societal standards that may evolve over time. Nature and Science of Sleep shall continue to observe the evolving community standards closely and encourage authors to submit high-quality research papers generated using publicly available research datasets. Furthermore, Nature and Science of Sleep will closely monitor for any articles that employ retracted datasets and pay close attention to data citation and availability statements.
  13 in total

1.  An open-source toolbox for standardized use of PhysioNet Sleep EDF Expanded Database.

Authors:  Syed Anas Imtiaz; Esther Rodriguez-Villegas
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2015

2.  Missing data imputation: focusing on single imputation.

Authors:  Zhongheng Zhang
Journal:  Ann Transl Med       Date:  2016-01

Review 3.  Scaling Up Scientific Discovery in Sleep Medicine: The National Sleep Research Resource.

Authors:  Dennis A Dean; Ary L Goldberger; Remo Mueller; Matthew Kim; Michael Rueschman; Daniel Mobley; Satya S Sahoo; Catherine P Jayapandian; Licong Cui; Michael G Morrical; Susan Surovec; Guo-Qiang Zhang; Susan Redline
Journal:  Sleep       Date:  2016-05-01       Impact factor: 5.849

4.  High-profile coronavirus retractions raise concerns about data oversight.

Authors:  Heidi Ledford; Richard Van Noorden
Journal:  Nature       Date:  2020-06       Impact factor: 49.962

5.  UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

Authors:  Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
Journal:  PLoS Med       Date:  2015-03-31       Impact factor: 11.069

Review 6.  Big Data's Role in Precision Public Health.

Authors:  Shawn Dolley
Journal:  Front Public Health       Date:  2018-03-07

7.  An Ethics Framework for Big Data in Health and Research.

Authors:  Vicki Xafis; G Owen Schaefer; Markus K Labude; Iain Brassington; Angela Ballantyne; Hannah Yeefen Lim; Wendy Lipworth; Tamra Lysaght; Cameron Stewart; Shirley Sun; Graeme T Laurie; E Shyong Tai
Journal:  Asian Bioeth Rev       Date:  2019-10-01

8.  The National Sleep Research Resource: towards a sleep data commons.

Authors:  Guo-Qiang Zhang; Licong Cui; Remo Mueller; Shiqiang Tao; Matthew Kim; Michael Rueschman; Sara Mariani; Daniel Mobley; Susan Redline
Journal:  J Am Med Inform Assoc       Date:  2018-10-01       Impact factor: 4.497

Review 9.  SleepOMICS: How Big Data Can Revolutionize Sleep Science.

Authors:  Nicola Luigi Bragazzi; Ottavia Guglielmi; Sergio Garbarino
Journal:  Int J Environ Res Public Health       Date:  2019-01-21       Impact factor: 3.390

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.