Literature DB >> 29202203

User needs analysis and usability assessment of DataMed - a biomedical data discovery index.

Ram Dixit¹, Deevakar Rogith¹, Vidya Narayana¹, Mandana Salimi¹, Anupama Gururaj¹, Lucila Ohno-Machado², Hua Xu¹, Todd R Johnson¹.

Abstract

OBJECTIVE: To present user needs and usability evaluations of DataMed, a Data Discovery Index (DDI) that allows searching for biomedical data from multiple sources.
MATERIALS AND METHODS: We conducted 2 phases of user studies. Phase 1 was a user needs analysis conducted before the development of DataMed, consisting of interviews with researchers. Phase 2 involved iterative usability evaluations of DataMed prototypes. We analyzed data qualitatively to document researchers' information and user interface needs.
RESULTS: Biomedical researchers' information needs in data discovery are complex, multidimensional, and shaped by their context, domain knowledge, and technical experience. User needs analyses validate the need for a DDI, while usability evaluations of DataMed show that even though aggregating metadata into a common search engine and applying traditional information retrieval tools are promising first steps, there remain challenges for DataMed due to incomplete metadata and the complexity of data discovery. DISCUSSION: Biomedical data poses distinct problems for search when compared to websites or publications. Making data available is not enough to facilitate biomedical data discovery: new retrieval techniques and user interfaces are necessary for dataset exploration. Consistent, complete, and high-quality metadata are vital to enable this process.
CONCLUSION: While available data and researchers' information needs are complex and heterogeneous, a successful DDI must meet those needs and fit into the processes of biomedical researchers. Research directions include formalizing researchers' information needs, standardizing overviews of data to facilitate relevance judgments, implementing user interfaces for concept-based searching, and developing evaluation methods for open-ended discovery systems such as DDIs.

Entities: Chemical Disease Species

Keywords: data discovery; information retrieval; metadata; usability; user needs

Year: 2018 PMID： 29202203 PMCID： PMC7378884 DOI： 10.1093/jamia/ocx134

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

INTRODUCTION

As the number, size, and public availability of biomedical datasets grow, so do the opportunities for new forms of research to advance biomedical knowledge. However, the heterogeneous nature of biomedical data, the complexity of data-intensive research, and the lack of data discovery infrastructure pose significant challenges for researchers to take advantage of this opportunity. Fragmented data environments, lack of data standards, and poor documentation are key issues that limit the direction and scope of data-driven research, often in the initial discovery phase., Data must be better organized to facilitate the advancement of biomedical science.,, In 2013, the National Institutes of Health Big Data to Knowledge (BD2K) initiative issued a call to assemble data from multiple sources into a discovery system termed a Data Discovery Index (DDI). A DDI aims to accelerate data-intensive biomedical science by providing a mechanism for searching publicly available data. A prototype DDI, DataMed, was launched in 2015 by the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) project. DataMed uses a common model (DatA Tag Suite [DATS], described elsewhere) to index metadata from biomedical data repositories, providing researchers with a PubMed-like search engine for discovering datasets relevant to their research interests. Aggregating datasets poses challenges in information retrieval and user interface design to enable the provision of meaningful search results to users., While information seeking in literature indexes such as PubMed has been studied, biomedical researchers’ information needs in data discovery are poorly understood and the conceptualization of data as an information resource in biomedicine is still emerging. Existing work has partially addressed the complexity and open-endedness of dataset search and evaluation, the diversity of research purposes and expertise, and the importance of context in understanding researchers’ information needs.,, Additionally, dataset retrieval differs from literature retrieval, because the heterogeneous sources and types of data and metadata make traditional indexing techniques insufficient., A successful DDI must meet researchers’ information needs and fit into their research practices. User-centered design (UCD) ensures that technology meets users’ needs through an iterative process of design and evaluation. UCD aims to produce systems that are useful, usable, and effective. Usable systems complement users’ knowledge, skills, and contexts, improving the effectiveness and satisfaction with which they accomplish their goals. Usability evaluation in UCD consists of both quantitative and qualitative measures. However, existing quantitative and formal methods for evaluating information retrieval systems, such as precision and recall, do not adequately measure the subjective process of “discovering” data. Qualitative methods such as user needs analysis, which involves understanding users’ work goals and priorities, and usability testing, which consists of simulating representative tasks by representative users on system prototypes, are thus well suited for guiding the design and development of data discovery systems. In this paper, we present biomedical researchers’ information needs as discovered through user needs analyses and a qualitative usability assessment of DataMed. We share these insights to benefit researchers and system developers working to facilitate biomedical data science, highlighting the importance of understanding biomedical researchers’ needs for the success of the systems they build, and suggesting promising areas of future research for the development of successful biomedical data discovery systems. The protocols for our studies are included as appendices to the paper that can be repurposed or used as a starting point for evaluating systems for searching or exploring biomedical data.

MATERIALS AND METHODS

We conducted user analyses in 2 phases to understand researchers’ needs and guide the development of DataMed (Figure 1). In Phase 1, we interviewed researchers to understand why and how they look for online biomedical datasets. In Phase 2, we conducted “think-aloud” usability evaluations to assess DataMed’s effectiveness at facilitating data discovery.

Figure 1.

Diagram showing UCD process for DataMed. Phase 1 research was conducted prior to the development of DataMed; Phase 2 evaluations were conducted on versions 0.5 and 2.0 of DataMed.

Diagram showing UCD process for DataMed. Phase 1 research was conducted prior to the development of DataMed; Phase 2 evaluations were conducted on versions 0.5 and 2.0 of DataMed. Potential users of DataMed were defined as researchers involved in biomedical research with experience in biomedical data analysis. Our sampling plan included researchers at various levels of expertise (graduate students, postdoctoral researchers, and faculty members) and various research domains to capture broad patterns of data discovery. Demographic characteristics such as age, gender, and ethnicity were noted during study sessions but not used as selection criteria. Our study protocols were deemed exempt from review by the University of Texas Health Science Center Committee for the Protection of Human Subjects, since we did not collect personally identifiable information and the study did not involve vulnerable populations. Participants were compensated for their time.

Phase 1: user needs analysis

Participants in the user needs study were recruited by an e-mail sent to various universities affiliated with the BD2K group and Texas Medical Center. Thirteen researchers responded to our call and were approved for the user needs analysis (Table 1).

Table 1.

Characteristics of participants in the Phase 1 user needs analysis for DataMed

Research Domain	Position	Count
Clinical Translational Science	Professor	1
Cardiology	Professor	1
Genomics	Postdoctoral Researcher	1
Biomedical Informatics	Professor	4
Biomedical Informatics	Postdoctoral Researcher	1
Molecular Biology	Postdoctoral Researcher	1
Neuroscience	Professor	1
Mobile Health	Postdoctoral Researcher	1
Public Health	PhD Student	1
Anesthesiology	Professor	1
Total		13

Characteristics of participants in the Phase 1 user needs analysis for DataMed Author VN conducted interviews, in person or remotely depending on the participant’s location and preference, lasting from 30 min to an hour. Interview questions were developed by authors VN and TJ as prompts to discuss researchers’ current data discovery practices and to identify user needs for a DDI (study protocol available in Appendix Phase 1). During the interview, participants were introduced to the bioCADDIE project and were asked to describe their research area and their experience with 4 aspects of data discovery: searching for data, metadata, data formats, and data visualization. Additional questions were asked to clarify and probe topics brought up by participants. Author VN took detailed notes on responses to each aspect of data discovery, and then coded them manually using standard word processing software to identify existing practices for data discovery, challenges or areas of difficulty, and design ideas. The coded interviews were summarized across participants to inform the DataMed development.

Phase 2: usability evaluations

We conducted usability evaluations of DataMed versions 0.5 and 2.0 following their releases. Participants were recruited by e-mail and through flyers from universities affiliated with BD2K projects. Interested researchers filled out a survey sharing their research area and prior experience with DataMed to ensure that those who had taken part in previous studies or had previously used DataMed were excluded. Eight qualified researchers responded to the call for the DataMed 0.5 study and 11 for the DataMed 2.0 study (Table 2).

Table 2.

Characteristics of participants in Phase 2 usability evaluation of DataMed versions 0.5 and 2.0

DataMed Version	Research Domain	Position	Count
0.5	Molecular Biology	Postdoctoral Researcher	2
		Data-related Professional	1
		PhD Student	1
	Chemistry	Professor	1
	Biomedical Informatics	PhD Student	1
	Library Science	Data-related Professional	2
	Total (Version 0.5)		8
2.0	Cancer Biology and Genetics	MD, PhD Student	1
	Cancer Biology and Genetics	PhD Student	1
	Cancer Genomics	PhD Student	1
	Public Health	Professor	1
	Public Health	PhD Student	1
	Genetic Epidemiology	Professor	1
	Systems Biology	Postdoctoral Researcher	2
	Data Curation	Data-related Professional	1
	Medical Library	Medical Librarian	1
	Neuroscience	Postdoctoral Researcher	1
	Total (Version 2.0)		11

Characteristics of participants in Phase 2 usability evaluation of DataMed versions 0.5 and 2.0 Authors MS and RD conducted the moderated “think-aloud” usability tests for versions 0.5 and 2.0, respectively. The semistructured usability test plans (see Appendix Phase II) were developed by each author along with feedback from TJ to simulate representative tasks on the interface and gather feedback on specific design features. The sessions lasted 1 h either in person or remotely, depending on the participant’s location and preference. Participants were given a brief introduction to DataMed, then asked to search for datasets related to their research area or interest, stopping when they had found relevant data or could not proceed further. Additional questions were asked to clarify and probe specific topics or issues. Finally, participants were asked to complete a standard usability questionnaire for DataMed 0.5, System Usability Scale for DataMed 2.0, and answer open-ended reflection questions about their experience with DataMed., Both authors took detailed notes on participants’ use of the system and feedback during each session; the notes for each session were coded using standard word processing software to identify participants’ information needs while exploring DataMed, as well as usability issues that arose during their use of DataMed and suggestions for improvement. These categories were synthesized into recommendations for DataMed’s features and user interface. To better understand participants’ data discovery processes, several trade-offs were made in the analysis. Given the formative and exploratory nature of these studies, we focused on a detailed qualitative analysis of volunteer researchers. This allowed us to comprehensively analyze participants’ interaction with the system; however, this also meant that the quantitative questionnaire results lacked statistical power and did not play much of a role in our analysis. For these reasons, we have omitted the questionnaire results and focus on the qualitative insights from these studies in the following section. Additionally, many participants were from the biological sciences (10 out of 19), which affects the representativeness of our findings for other fields of biomedicine. However, this also reflects the distribution of data indexed in DataMed and suggests that data-intensive biomedical practices using publicly available data may be skewed toward fields such as molecular biology and genetics. As we report and discuss our qualitative findings, we emphasize the perspectives of participants in the translational, clinical, and public health fields as well.

RESULTS

Phase 1: user needs in biomedical data discovery

Participants reported significant effort and difficulty in finding and evaluating relevant data online. One informatics researcher mentioned,“For all of [our] studies, we would like to integrate relevant studies from other sources. One challenge is knowing what data out there is relevant to what we are doing.” Common reasons for searching online for data included to validate their own work (such as the effect of an intervention on a cell line), enrich or guide their analyses (such as translational research combining -omics and clinical data), or access data they could not generate on their own (such as hospital data). While many had established strategies such as searching Google or specific data repositories, they found it challenging to know what potentially relevant data was available. A common frustration was the lack of information in metadata describing datasets. Metadata often contain only partial descriptions of crucial information, such as the samples and techniques used in generating the data. One molecular biology researcher commented,“The metadata that I would like but usually don’t get from my metadata sources is a clean description of tissues that the experimental results came from, the condition of that tissue, the methods of analysis for the phenotypes of interest that were being studied when the experiment was done.” Additionally, the variety of terminologies used to describe data, the lack of definitions, and poor documentation about the context of the data collection made it difficult to assess its potential usefulness. In these situations, researchers either had to download the dataset to inspect its contents or forgo using it altogether. Further, accessing data through various sources was often fraught with poor documentation of processes required to download the data. This was especially troublesome for those interested in clinical data due to ambiguous institutional review board approval processes. As a neuroscience researcher put it,“We usually know what we want, but to get access is really like diving into the ocean and trying to reach the other side of the world.” The variability of data formats and levels of processing also required significant work to wrangle the data into formats compatible with their processes and tools. An example given by a neuroscience researcher about the limitations of data archives was,“If you had a collection of image data and genetics… and you wanted to say, find me all the people who have this particular brain difference and also have these two SNPs, you can’t do that at all.” As with metadata, standard information and proper documentation of how the data was generated – its provenance – was necessary for evaluating the potential utility of a dataset. Participants also recognized visualization as a useful way to provide an overview of datasets, variables, and analytic results. However, few online data sources provided such visualization tools, and current visualization techniques were often not flexible or scalable enough to meet their needs, often requiring costly investment in custom visualizations. Our analysis, summarized inTable 3, validated the concept of a DDI and highlighted key issues with metadata, data standards, and visualization in discovering biomedical data. These findings indicated that researchers would benefit from a centralized source and complete metadata documentation for finding and assessing potentially relevant datasets. Additionally, they need clear protocols to download the data, the ability to download in multiple formats, and a means to visually explore datasets.

Table 3.

Summary of user needs analysis for biomedical data discovery

Topic	Difficulties	User Needs
Searching for Data	Time and effort spent finding relevant data for research purposes	Centralized source for available data and tools for finding research-related data
Searching for Data	Poor documentation and protocols for accessing data	Standard documentation and protocols for data access
Metadata	Assessing validity and utility of dataset for secondary use	Standard metadata, vocabularies, and documentation of datasets
Metadata	Incomplete, inconsistent, and poor-quality metadata	Tools and guidelines for authors to create metadata
Data Format	Data wrangling and compatibility with analytic methods	Documentation of data provenance
Data Format	Availability of data at various degrees of processing: raw to summarized	Availability of data for compatibility with analytic tools
Visualization	Manual work required for creating custom overviews of data	Online visualization of datasets
Visualization	Limitation of current methods for visualizing and exploring large datasets	New techniques for representing and exploring large datasets

Summary of user needs analysis for biomedical data discovery These results informed the initial development of DataMed. While researchers currently use generic search engines such as Google, DataMed is intended to aggregate metadata across multiple repositories to provide both broad coverage and effective data-specific retrieval tools, accelerating researchers’ exploration of and exposure to potentially relevant data. The DATS model was implemented as a common metadata standard to address the disparity of metadata across repositories. Information retrieval techniques were applied to address the variability in terminology and provide an efficient user interface.

Phase 2: DataMed evaluation

This section describes the combined results of usability evaluations of DataMed v0.5 and 2.0 (Figure 2) following their public release.

Figure 2.

The homepage of DataMed version 2.0 as of May 8, 2017.

The homepage of DataMed version 2.0 as of May 8, 2017. Participants encountered difficulty generating queries in the system to describe their information needs. While the search bar on the homepage suggested an intuitive search interaction like PubMed or Google, it was not clear how this interaction would work for complex queries. One researcher commented,“I would love to search for phenotypes.… For instance, you could search for headache, but I’m not just interested in headache, I’m interested in genes.… How do you search for both?” Participants’ information needs were thus multidimensional, layered depending on whether they had specific research questions they wanted to find data about or were exploring what datasets were available in a domain or for an analytic technique. Participants also faced difficulties assessing the relevance of datasets returned in DataMed (Figure 3). The most significant problem in assessing the utility of a dataset was inconsistent, incomplete, and poor-quality metadata. Upon searching for a specific cancer, one researcher said of a returned result,“This doesn’t give you any information.… I wouldn’t even know what this is about at all.… What does that mean?” Participants looking for combinations of biomedical concepts and data provenance items found the information about most datasets in DataMed insufficient to determine whether they would be useful. When asked what metadata they would need to evaluate the relevance of a dataset, participants mentioned items included in their information needs, and also added characteristics such as ownership, research organization, and publications based on the data. Common metadata needs across participants are summarized inTable 4.

Figure 3.

An example of search results for the query “MRI patients Parkinsons” in DataMed version 2.0.

Table 4.

Summary and examples of participants’ expressed metadata needs in searching DataMed

Metadata Field	Examples
Biomedical Concepts	De Novo Acute Myeloid Leukemia
Data Type	Gene Expression, Clinical Outcomes
Data Collection Technique	Survey, Magnetic Resonance Imaging
Data Format	Text, Comma-separated Values, Digital Imaging and Communications in Medicine
Data Processing	Raw Data, Abstracted Data, Secondary Data
Sample Description	Number of Samples, Species, Population
Intervention/Study Design	Case-Control, Cohort
Date of Collection	January 2010 to January 2015
Variables	Cell Lines, Hormone Levels, Gene Knockouts
Instructions for Data Usage	Data Processing Tools, Algorithms, Tutorials
Permissions and Ownership	Protected Health Information, Institutional Review Board, Commercial or Academic Research
Research Organization and Principal Investigator	University, Private Institute, International Data
Publications Based on Data	Citations, Papers, Related Items

Summary and examples of participants’ expressed metadata needs in searching DataMed An example of search results for the query “MRI patients Parkinsons” in DataMed version 2.0. Overall, while participants found the concept of a DDI valuable, they faced difficulty in understanding the scope of DataMed; it was not immediately clear to them who or what DataMed was intended for and whether it contained data relevant for their research topics. One public health researcher’s initial thoughts upon seeing DataMed were:“This makes me think it’s more of a bioinformatics type thing.… The numbers make me scared.… Can I do this, can I not do this?… Do I need a person from a bioinformatics side?” Users were confused about whether returned results were data or publications (“I don’t know what I’m looking at.… [Is this] a paper, or a grant, or a project?”). Additionally, information retrieval tools such as query expansion, faceted filtering, and advanced search did not support participants as they explored results due to metadata inconsistency across datasets in indexed repositories. Finally, even when potentially relevant search results were identified, participants had to investigate related publications, the dataset repository’s data description, or the data itself to gather additional information or understand the terminology necessary to determine its relevance. Major suggestions for improving DataMed included embedding domain knowledge and concepts into the organization of the system. Many suggested providing support for query generation, through interactive ontologies or a “conceptual map,” to help them understand how DataMed works and express their information needs more explicitly. They also suggested organizing results by biomedical concepts to provide an overview of returned results, expose DataMed’s search process, and improve the utility of faceted filtering for narrowing down relevant results. They suggested more consistent navigation, menus, and customization of metadata fields to improve navigation. Preliminary analyses or summaries of the data itself were mentioned as potentially useful in identifying what findings or associations were discovered in the data and what it could be used for. Finally, researchers appreciated links to other relevant data or publications that indicated other researchers’ use of the data and its utility.

DISCUSSION

Our study provides unique insight into researchers’ needs in biomedical data discovery through the development and evaluation of DataMed. While aggregating metadata into a common search engine and applying information retrieval tools is a promising first step, there remain challenges for researchers in finding useful data due to the complexity of data discovery, the lack of metadata standards, and variability in needs across domains and levels of expertise. Here we identify these challenges and suggest promising areas of future research for information retrieval systems to support data discovery.

Challenges for information retrieval in data discovery

Supporting exploratory search

Search and discovery in open-ended information systems such as DataMed have no predefined goal; rather, they are exploratory processes motivated by complex information needs directed toward items with characteristics that may or may not exist in the system.,, While DataMed provides users with access to a vast number of datasets, the heterogeneity of the information space and ineffectiveness of retrieval tools makes it difficult for researchers to find and evaluate relevant items.,,, These results corroborate previous work and highlight gaps in current techniques for supporting exploratory search., Specifically, while user interface elements such as the initial search bar are seemingly intuitive, researchers were uncertain about what they were searching, what constraints they could enter in the search field, and the degree of specificity needed to express their information needs. Additionally, they were unable to adjust their query or search strategy based on the returned results due to the unclear presentation of results, their perceived ineffectiveness of retrieval methods, and the opaqueness of the search process. Indexing systems and user interfaces that organize the complex multidimensional space of biomedical data and support users in navigating it are interrelated technical and design challenges for DDIs.

Evaluating data as an information resource

Our study of DataMed also shows that the relevance of a dataset as an information resource cannot be determined solely from summaries such as keywords, title, or an abstract, as has been suggested may be the case for publications or websites., Evaluating the contents and utility of a dataset requires significant context about the data’s provenance – its original purpose, characteristics, processing history, and insights derived – to determine whether it could be reused in a different context. Researchers looking for combinations of biomedical concepts and data characteristics in DataMed results encountered difficulty due to incomplete metadata (missing information), poor-quality metadata (incoherent descriptions), and inconsistency in metadata (available fields and terminology) across datasets. Metadata issues also limited the effectiveness of query expansion, faceted filtering, and advanced search features in providing researchers with overviews and navigation tools for identifying and navigating to items of interest. Thus, adherence to standards for metadata representation and quality is another challenge DDIs must address.

Variability across research domains and levels of expertise

Participants’ experience and information needs in DataMed also varied with their research domain and level of domain and technical expertise. Novices in technical domains were intimidated by the amounts of data available, while domain experts mentioned that the interface should be more intelligent to understand their information needs. Metadata issues were common across all participants: even advanced researchers could not interpret datasets that had poor metadata. Supporting users from a variety of research and technical backgrounds and with a variety of purposes is an ongoing challenge for any DDI.

Future research areas for biomedical data discovery

Structured metadata abstract

Our study of DataMed shows us that making data available is not enough – researchers must be able to easily evaluate the relevance of a dataset and access the data.,, Consistent, complete, and high-quality metadata is vital to any data discovery system, and our documentation of researchers’ common metadata needs outlines essential information for effective data discovery. To further this area of research and support the development of effective discovery systems, a common model for describing and evaluating data as an information resource (eg, a “structured data abstract”) needs to be used. A structured data abstract containing common metadata fields such as those we have documented in this study would facilitate data discovery by providing consistent descriptions of the characteristics and context of data to enable the effective use of information retrieval tools and allow researchers to easily evaluate datasets for their utility. Similar efforts in bioscience, such as the Information Sharing Architecture framework, are converging on a common model; such models must be expanded to encompass all domains of biomedicine and refined to enable complex human and computer interpretation. The DATS metadata model is a first step toward this goal, and has evolved considerably since the first prototypes of DataMed were released.

Information retrieval for discovery

Researchers’ difficulty in using DataMed reflects a mismatch between current retrieval methods based on specific terms and researchers’ needs related to biological or data concepts. These findings corroborate previous work regarding open-ended information systems and have similar implications for the development of new methods and user interfaces for exploratory search.,, Traditional search methods such as document-level keyword matching do not adequately provide relevant results for datasets – the information participants are looking for about data are more complex and may be distributed across multiple metadata fields, datasets, or associated publications. Researchers’ suggestion to embed biomedical concepts into DataMed is akin to using exploratory search guidelines to leverage the semantics of indexed items to organize search and allow conceptual navigation and discovery in the context of biomedical knowledge. Developing retrieval methods that more closely match the semantics of users’ information needs is necessary to facilitate data discovery.

User interface design and visualization

In addition to retrieval techniques, another challenge is designing multifaceted, expressive, and concept-based interfaces that allow users with varying backgrounds to learn from interactions and form a clear mental model of the system. Findings from this study and previous work indicate that support for and control over query expressiveness, transparency about the search process and the format of the results, and guidance on search strategies that provide an overview of search results and foster exploratory behavior can support discovery in open-ended information systems., Visualization techniques also have the potential to support navigation and analysis of datasets, but current methods do not support this kind of interaction at scale and must be enhanced to support data discovery and dataset exploration. The design space for dynamic search interfaces that incorporate domain knowledge and biomedical concepts to support exploration of datasets is an exciting area for future research.

Evaluation of open-ended discovery systems

Finally, our study is limited in its short-term evaluation of researchers’ discovery practices, the small sample of biomedical research domains, and the evaluation of early-stage prototypes of DataMed. While our qualitative and semistructured evaluations of DataMed helped us identify crucial challenges for data discovery, these methods are limited in scope and scale, capturing rich details, high-level descriptions, and short-term interactions with only a few individuals who may or may not be good representatives of the biomedical science community. Data discovery is a continuous process unfolding over time in research contexts; more study is needed to understand data-intensive research practices across domains of biomedicine and to clarify notions of “discoverability” and “relevance.” New quantitative or formal methods must be developed to investigate the evolution of data discovery and to complement qualitative methods for analyzing users’ search behaviors and system performance in the user-centered design process for DDIs.,

CONCLUSION

This paper presents findings from a user needs analysis and usability evaluations conducted during the development of DataMed. The user needs analysis validated the need for a DDI and highlighted issues of metadata, data standards, and visualization in data discovery. Usability evaluations of DataMed further validated the concept of a DDI, provided insight into researchers’ information needs, and identified unique challenges in designing a DDI. Making data available is not enough to facilitate data discovery: new information retrieval techniques and user interfaces are necessary for dataset exploration. Consistent, complete, and high-quality metadata is vital to enable this process. We emphasize the importance of understanding researchers’ information needs in designing data infrastructures to support biomedical data discovery. While available data and researchers’ information needs are complex and heterogeneous, a successful DDI must meet these needs and fit the processes of biomedical researchers.

Funding

This work was supported by the National Institutes of Health, grant number U24AI117966.

Competing interests

The authors have no competing interests to declare.

Contributors

LO-M and HX provided the overall conceptual design for DataMed, including the initial target user groups. HX also identified the scope and user groups for the user studies reported here and participated in the design of the study protocols. TRJ, in collaboration with DR, AG, RD, VN, and MS, designed the detailed protocols and oversaw data collection and evaluation. VN, RD, and MS collected data from the study participants. RD, in collaboration with TRJ, wrote the first draft, incorporating a draft on the first study written by VN. All authors reviewed and offered critical comments and/or revisions of the final version and approved its publication.

16 in total

1. Enterprise Data Analysis and Visualization: An Interview Study.

Authors: S Kandel; A Paepcke; J M Hellerstein; J Heer
Journal: IEEE Trans Vis Comput Graph Date: 2012-12 Impact factor: 4.579

Review 2. Literature mining for the biologist: from information retrieval to biological discovery.

Authors: Lars Juhl Jensen; Jasmin Saric; Peer Bork
Journal: Nat Rev Genet Date: 2006-02 Impact factor: 53.242

3. TURF: toward a unified framework of EHR usability.

Authors: Jiajie Zhang; Muhammad F Walji
Journal: J Biomed Inform Date: 2011-08-16 Impact factor: 6.317

4. Crafting the third century of the National Library of Medicine.

Authors: Patricia Flatley Brennan
Journal: J Am Med Inform Assoc Date: 2016-09 Impact factor: 4.497

5. Characterizing Data Discovery and End-User Computing Needs in Clinical Translational Science.

Authors: Parmit K Chilana; Elishema Fishman; Estella M Geraghty; Peter Tarczy-Hornoch; Fredric M Wolf; Nick R Anderson
Journal: J Organ End User Comput Date: 2011 Impact factor: 4.349

6. Cognitive and usability engineering methods for the evaluation of clinical information systems.

Authors: Andre W Kushniruk; Vimla L Patel
Journal: J Biomed Inform Date: 2004-02 Impact factor: 6.317

7. Finding useful data across multiple biomedical data repositories using DataMed.

Authors: Lucila Ohno-Machado; Susanna-Assunta Sansone; George Alter; Ian Fore; Jeffrey Grethe; Hua Xu; Alejandra Gonzalez-Beltran; Philippe Rocca-Serra; Anupama E Gururaj; Elizabeth Bell; Ergin Soysal; Nansu Zong; Hyeon-Eui Kim
Journal: Nat Genet Date: 2017-05-26 Impact factor: 38.330

8. The center for expanded data annotation and retrieval.

Authors: Mark A Musen; Carol A Bean; Kei-Hoi Cheung; Michel Dumontier; Kim A Durante; Olivier Gevaert; Alejandra Gonzalez-Beltran; Purvesh Khatri; Steven H Kleinstein; Martin J O'Connor; Yannick Pouliot; Philippe Rocca-Serra; Susanna-Assunta Sansone; Jeffrey A Wiser
Journal: J Am Med Inform Assoc Date: 2015-06-25 Impact factor: 4.497

9. DATS, the data tag suite to enable discoverability of datasets.

Authors: Susanna-Assunta Sansone; Alejandra Gonzalez-Beltran; Philippe Rocca-Serra; George Alter; Jeffrey S Grethe; Hua Xu; Ian M Fore; Jared Lyle; Anupama E Gururaj; Xiaoling Chen; Hyeon-Eui Kim; Nansu Zong; Yueling Li; Ruiling Liu; I Burak Ozyurt; Lucila Ohno-Machado
Journal: Sci Data Date: 2017-06-06 Impact factor: 6.444

10. Specialized tools are needed when searching the web for rare disease diagnoses.

Authors: Radu Dragusin; Paula Petcu; Christina Lioma; Birger Larsen; Henrik L Jørgensen; Ingemar J Cox; Lars Kai Hansen; Peter Ingwersen; Ole Winther
Journal: Rare Dis Date: 2013-05-16

3 in total

1. Biomedical informatics and data science: evolving fields with significant overlap.

Authors: Patricia Flatley Brennan; Michael F Chiang; Lucila Ohno-Machado
Journal: J Am Med Inform Assoc Date: 2018-01-01 Impact factor: 4.497

2. Challenges with organization, discoverability and access in Canadian open health data repositories.

Authors: Gail M Thornton; Ali Shiri
Journal: J Can Health Libr Assoc Date: 2021-04-02

3. DataMed - an open source discovery index for finding biomedical datasets.

Authors: Xiaoling Chen; Anupama E Gururaj; Burak Ozyurt; Ruiling Liu; Ergin Soysal; Trevor Cohen; Firat Tiryaki; Yueling Li; Nansu Zong; Min Jiang; Deevakar Rogith; Mandana Salimi; Hyeon-Eui Kim; Philippe Rocca-Serra; Alejandra Gonzalez-Beltran; Claudiu Farcas; Todd Johnson; Ron Margolis; George Alter; Susanna-Assunta Sansone; Ian M Fore; Lucila Ohno-Machado; Jeffrey S Grethe; Hua Xu
Journal: J Am Med Inform Assoc Date: 2018-03-01 Impact factor: 4.497

3 in total