Literature DB >> 17108360

The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System--a one-stop gateway to online bioinformatics databases and software tools.

Yi-Bu Chen¹, Ansuman Chattopadhyay, Phillip Bergen, Cynthia Gadd, Nancy Tannery.

Abstract

To bridge the gap between the rising information needs of biological and medical researchers and the rapidly growing number of online bioinformatics resources, we have created the Online Bioinformatics Resources Collection (OBRC) at the Health Sciences Library System (HSLS) at the University of Pittsburgh. The OBRC, containing 1542 major online bioinformatics databases and software tools, was constructed using the HSLS content management system built on the Zope Web application server. To enhance the output of search results, we further implemented the Vivísimo Clustering Engine, which automatically organizes the search results into categories created dynamically based on the textual information of the retrieved records. As the largest online collection of its kind and the only one with advanced search results clustering, OBRC is aimed at becoming a one-stop guided information gateway to the major bioinformatics databases and software tools on the Web. OBRC is available at the University of Pittsburgh's HSLS Web site (http://www.hsls.pitt.edu/guides/genetics/obrc).

Entities: Species

Mesh：

Year: 2006 PMID： 17108360 PMCID： PMC1669712 DOI： 10.1093/nar/gkl781

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In the past decade, the emergence and rapid advance of genomic and proteomic technologies have generated never-before-seen amounts of genomic and proteomic data. As the genomes of 294 model organisms have been sequenced with 1206 more on the way (1), the amount of nucleotide sequence data alone nearly doubles every year. Such explosive growth of data has spawned hundreds of Web-based, publicly available bioinformatics resources, including databases and software tools, in various fields of biological sciences. The number of the online databases listed in the Nucleic Acids Research (NAR) Molecular Biology Database Collection alone has increased more than 14-fold from 58 in 1996 to 858 in 2006 (2). The majority of these newly emerged online resources are specialized databases and Web servers that provide not only sequence information, but also data on gene expression, macromolecular structures, genotype and phenotype of model organisms, as well as computational tools for analyzing macromolecular sequences/structures and global gene expression. Representing the best state of knowledge in the corresponding fields, these expert curated databases and specialized software tools may greatly assist researchers in designing their own experiments, as well as interpreting and validating their results. Although the proliferation of bioinformatics databases is a manifestation of collective efforts by the life science community to help individual researchers coping with the phenomenal growth of biological data and information, many researchers find themselves struggling to keep up-to-date with the research in their fields (3,4). The situation is further exacerbated by the fact that locating such large numbers of online resources is anything but an easy task (5). The problem stems from the fact that the information about these online resources is scattered in various life science journals and around the Web, and that few web sites currently provide a guided access point with searchable links to a majority of these resources. Studies suggested that locating bioinformatics resources through literature searches is often very difficult (6–8). One study reported that >50% of the participating researchers use the Web to search for bioinformatics resources (9). However, searches using popular Web search engines, such as Google, are often ineffective. This is because Web search engines rank web sites by popularity rather than their relevance, and that Web search engines do not discriminate between reliable and unreliable web sites. The lack of standard search terms and the fact that Web search engines lump all hits together regardless of the nature of each hit, as long as they all contain the searched terms, further reduces the usefulness of the Web search engines as a mean to locate bioinformatics resources (5). The urgent need of organizing the bioinformatics resources has recently been raised (5,10). Among the existing efforts to solve the problem are the Molecular Biology Database Collection compiled by the NAR (2), the Bioinformatics Links Directory (11,12), the Expasy Life Sciences Directory (), the DBcat (13), the Database of Databases (14) and the Pathguide (15). Although these projects are highly valuable, their sole reliance on categorical content structure, limitations in annotation and coverage, and the lack of sophisticated search features may affect their usability and appeal to a wide audiences. For example, the output of search results from the Bioinformatics Links Directory is pages of a scrollable list, which may require users to examine the entire list in order to find the results relevant to their queries. There are also no ranking of the results or indications of any relationships that may exist among the results. Such limitations may pose even bigger problems as the number of the bioinformatics resources is expected to continuously grow at a rapid pace. Different approaches, such as using document clustering techniques (16) to organize search results, may enable users to quickly navigate through a large number of search results (17,18). In order to help biomedical researchers to quickly find the most relevant bioinformatics resources for their specific information needs, we sought to develop a concrete and innovative search strategy as a part of a fledging library-based molecular biology information service at the Universtiy of Pittsburgh (19). For this purpose, we constructed the Online Bioinformatics Resources Collection (OBRC) at the Health Sciences Library System (HSLS), University of Pittsburgh. This collection currently includes 1542 online bioinformatics databases and software tools, most of which have been published by NAR or listed in its Molecular Biology Database Collection (2). In addition, we implemented the Vivísimo Clustering Engine® to OBRC to help users navigate through their search results.

METHODOLOGY

The new search strategy consists of two major components: a centralized collection of the curated information on major online bioinformatics databases and software tools, and the implementation of the Vivísimo Clustering Engine® to enhance the output of search results.

Source materials

The primary sources of OBRC are the databases and software tools published by the NAR (). Specifically, the source materials were mainly the databases published in the NAR Annual Database Issues from 2001 to 2006, and the software tools published in the NAR Annual Web Server Issues from 2004 to 2006. Other databases listed in the NAR Molecular Biology Database Collection, including those published by NAR before 2001 and those not published by the NAR, were also selected. Selected databases and software tools described in other peer-reviewed journals, such as Bioinformatics and BMC Bioinformatics, were included in the collections. In addition, a number of unpublished but popular online software tools were also entered.

Collection construction, organization and maintenance

Information on each resource was entered using the HSLS content management system built on the Zope® Web application server. For each entry, the information for the following fields was entered: URL to the resource; name of the resource; a one-sentence description of the major functions; URL to the relevant PubMed abstract(s); last modification date of the entry; highlights of the resource; and keywords. The title, description and highlights for each entry were generated based on the PubMed abstract(s), as well as the content and scope of the resource. Together with the keywords, the textual information in these fields are automatically indexed by the Zope® Zcatalog and subsequently processed by the Zope®-based search engine. As a major part of curation efforts, keywords were generated based on the information in the PubMed abstract(s), the MESH terms of the abstract(s), the information posted on corresponding web site, as well as the domain knowledge in molecular biology. Standard terminologies, commonly used by researchers in their publications, were used. The main types of keywords include biological concepts, entities, organism names, widely studied gene and protein names, and common molecular biology tasks. Whenever possible, common synonyms of the most important keywords were included as a conscious effort to improve the recall. We implemented a categorical structure and basic classification theme that were derived from those used in the NAR Molecular Biology Database Collection (2). To facilitate users to browse OBRC, we consolidated the category structure and limited it to three levels. We also expanded the category names to make them more self-evident. To ensure the up-to-dateness and running status of each entry, we perform link analysis and content verification at least every 6 months. The results are used to update the URLs and remove the entries that are no longer available.

Vivísimo Clustering Engine® implementation

The Vivísimo Clustering Engine® is based on a novel, intricate three-pass algorithm that is augmented with hundreds of special processing heuristics and endowed with thousands of specific facts and general patterns of English and other languages (). It automatically organizes large number of search results into different groups and enables users to quickly survey and identify relevant groups. The Vivísimo Clustering Engine® has been successfully applied on the Web by search engines such as the Clusty () and ClusterMed™ (). Queries can be formed with basic Boolean operators. Queries are first processed by the Zope®-based search engine that leverages on Zope® search tools. The results are then processed by the Vivísimo Clustering Engine® on-the-fly using the textual information from a set of fields selected from the following fields: title, descriptions, highlights and keywords. The search results organized by the Vivísimo Clustering Engine® are finally presented to the users.

RESULTS

Figure 1 shows a sample record display of OBRC.

Figure 1

The screenshot of a sample record display of OBRC.

The screenshot of a sample record display of OBRC. There are a total of 1542 unique online bioinformatics resources in the current version of OBRC. The databases (475) and software tools (397) published in NAR Annual Database Issues (2001–2006) and Web Server Issues (2004–2006) contribute to ∼30.8 and 25.7% of the total entries in OBRC, respectively. The resources published in other journals (488) contribute to ∼31.6%. In addition, all the valid databases listed in the latest NAR Molecular Biology Database Collection (2) are included. Organized with a three-level hierarchical category classification, OBRC was divided into 13 major categories, 40 secondary-categories and 12 tertiary-categories to assist users browsing the entire collection (Supplementary Table 1). The top five main categories are ‘DNA Sequence Databases and Analysis Tools’ (325), ‘Protein Sequence Databases and Analysis Tools’ (306), ‘Genomic Databases and Analysis Tools’ (270), ‘Structure Databases and Analysis Tools’ (244) and ‘RNA Databases and Tools’ (130). The top five specific topics are ‘Protein structures’ (214), ‘Regulatory sites and transcription factors’ (112), ‘Protein sequence motifs, active or functional sites, and functional annotations’ (77), ‘Human mutations and diseases’ (76) and ‘General protein sequence databases, sequence similarity search, analysis, and alignment tools’ (68). Some resources were listed in multiple categories.

DISCUSSION

Studies have shown that the clustered results display is more efficient and user friendly than the traditional sequential search results display (20,21). Applying the Vivísimo Clustering Engine® to the search results offers the users not only a quick overview of all the search results requiring little scrolling, but also shows how the search results are related to each other, as represented by the themes (Figure 2). This advantage becomes compelling in cases where a large number of search results are returned, as the clustered results display drastically reduce the effort needed to navigate through the results set in order to locate the most relevant ones. The sequential display, as employed by popular Web search engines, requires users to scroll down page by page in order to find the results specific to their needs. Another benefit brought by the Vivísimo Clustering Engine® is that users can use relatively broad query terms and may still able to find specific results quickly. This could be particularly helpful to users during their searches as it may reduce the efforts on query reformulation. Furthermore, with Vivísimo's document clustering, there is little need for the expensive and laborious tasks of creating a controlled vocabulary and/or to extensively indexing or pre-labeling the documents.

Figure 2

(a) The screenshot of the first page results for the testing query ‘transcription factor or factors’ from searching the OBRC using the Zope®-based search engine coupled with the Vivísimo Clustering Engine®. (b) The expanded view of the major clusters of the search results. Our preliminary evaluation study suggests that OBRC search strategy performs much better than Web search engine based strategy, largely attributed to its centralized collection and curated keywords (data not shown). However, the recall and precision are still imperfect. A close examination of the search results indicates that the false negatives, which lower the recall, are primarily due to the synonym problems that have long plagued information retrieval in the biological literatures (22). Another main cause is the singular or plural form of terminologies. Such problems can be largely circumvented by implementing a special online thesaurus or synonym mapping protocol in OBRC. The false positives, which lower the precision, are mainly attributed to the fact that the Zope®-based search engine searches all the text fields of each OBRC entry, and sometimes words in some of the fields match with the queries despite their irrelevance to the major content/function of the corresponding database/software tool. Such false positives could be entirely eliminated if the Zope®-based search engine searched only the keyword field of each OBRC entry. A tradeoff of such strategy is that the keywords are generated to represent only the main concepts, contents and functions of an underlying database/software tool, thus restricting the search to only the keywords field may result in lower recall as the less relevant database/software tools are likely to be left out.

CONCLUSIONS

We have created the OBRC, covering the most widely used and authoritative open source bioinformatics databases and software tools on the Web. The implementation of the Vivísimo Clustering Engine® in OBRC enhances the output of search results and may help users to navigate through large numbers of results with ease. The rich content in OBRC coupled with the advance search features represents a novel search solution for online bioinformatics resources that will benefit biomedical researchers at large. Its aggregated content may also be useful as part of an integrated biological information system. A future direction will be to continue to expand OBRC to include databases and software tools published in other journals. We will also explore new methods, such as constructing an embedded synonym mapping protocol, implementing the Vivísimo domain-specific controlled vocabularies to further boost the recall and precision, as well as to enhance the results clustering process. Additionally, we will improve the usability of OBRC by studying user experiences and implementing other features, such as adding RSS feed and user/curator preferences/ratings of each resource. We welcome any comments and suggestions on further improvement of OBRC.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

15 in total

Review 1. The extraction of useful information from the biomedical literature.

Authors: R Kostoff
Journal: Acad Med Date: 2001-12 Impact factor: 6.893

2. Mining the bibliome: searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information.

Authors: Les Grivell
Journal: EMBO Rep Date: 2002-03 Impact factor: 8.807

3. Bioinformatics leads charge by publishing more Internet addresses in abstracts than any other journal.

Authors: Lisa M Schilling; Jonathan D Wren; Robert P Dellavalle
Journal: Bioinformatics Date: 2004-07-01 Impact factor: 6.937

4. 404 not found: the stability and persistence of URLs published in MEDLINE.

Authors: Jonathan D Wren
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

Review 5. Mining the biomedical literature in the genomic era: an overview.

Authors: Hagit Shatkay; Ronen Feldman
Journal: J Comput Biol Date: 2003 Impact factor: 1.479

6. Design and implementation of a library-based information service in molecular biology and genetics at the University of Pittsburgh.

Authors: Ansuman Chattopadhyay; Nancy Hrinya Tannery; Deborah A L Silverman; Phillip Bergen; Barbara A Epstein
Journal: J Med Libr Assoc Date: 2006-07

7. Pathguide: a pathway resource list.

Authors: Gary D Bader; Michael P Cary; Chris Sander
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. The Molecular Biology Database Collection: 2006 update.

Authors: Michael Y Galperin
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide.

Authors: Konstantinos Liolios; Nektarios Tavernarakis; Philip Hugenholtz; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. The Bioinformatics Links Directory: a compilation of molecular biology web servers.

Authors: Joanne A Fox; Stefanie L Butland; Scott McMillan; Graeme Campbell; B F Francis Ouellette
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

11 in total

Review 1. OpenHelix: bioinformatics education outside of a different box.

Authors: Jennifer M Williams; Mary E Mangan; Cynthia Perreault-Micale; Scott Lathe; Neeraj Sirohi; Warren C Lathe
Journal: Brief Bioinform Date: 2010-08-26 Impact factor: 11.622

Review 2. Identification of genes encoding tRNA modification enzymes by comparative genomics.

Authors: Valérie de Crécy-Lagard
Journal: Methods Enzymol Date: 2007 Impact factor: 1.600

3. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature.

Authors: Guillermo de la Calle; Miguel García-Remesal; Stefano Chiesa; Diana de la Iglesia; Victor Maojo
Journal: BMC Bioinformatics Date: 2009-10-07 Impact factor: 3.169

4. Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the Commons Framework Pilots workshop.

Authors: Kathleen M Jagodnik; Simon Koplev; Sherry L Jenkins; Lucila Ohno-Machado; Benedict Paten; Stephan C Schurer; Michel Dumontier; Ruben Verborgh; Alex Bui; Peipei Ping; Neil J McKenna; Ravi Madduri; Ajay Pillai; Avi Ma'ayan
Journal: J Biomed Inform Date: 2017-05-10 Impact factor: 6.317

5. A quick guide to large-scale genomic data mining.

Authors: Curtis Huttenhower; Oliver Hofmann
Journal: PLoS Comput Biol Date: 2010-05-27 Impact factor: 4.475

6. Using Google blogs and discussions to recommend biomedical resources: a case study.

Authors: Robyn B Reed; Ansuman Chattopadhyay; Carrie L Iwema
Journal: Med Ref Serv Q Date: 2013

7. MetaBase--the wiki-database of biological databases.

Authors: Dan M Bolser; Pierre-Yves Chibon; Nicolas Palopoli; Sungsam Gong; Daniel Jacob; Victoria Dominguez Del Angel; Dan Swan; Sebastian Bassi; Virginia González; Prashanth Suravajhala; Seungwoo Hwang; Paolo Romano; Rob Edwards; Bryan Bishop; John Eargle; Timur Shtatland; Nicholas J Provart; Dave Clements; Daniel P Renfro; Daeui Bhak; Jong Bhak
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971