| Literature DB >> 22434839 |
A E Bandrowski1, J Cachat, Y Li, H M Müller, P W Sternberg, P Ciccarese, T Clark, L Marenco, R Wang, V Astakhov, J S Grethe, M E Martone.
Abstract
The breadth of information resources available to researchers on the Internet continues to expand, particularly in light of recently implemented data-sharing policies required by funding agencies. However, the nature of dense, multifaceted neuroscience data and the design of contemporary search engine systems makes efficient, reliable and relevant discovery of such information a significant challenge. This challenge is specifically pertinent for online databases, whose dynamic content is 'hidden' from search engines. The Neuroscience Information Framework (NIF; http://www.neuinfo.org) was funded by the NIH Blueprint for Neuroscience Research to address the problem of finding and utilizing neuroscience-relevant resources such as software tools, data sets, experimental animals and antibodies across the Internet. From the outset, NIF sought to provide an accounting of available resources, whereas developing technical solutions to finding, accessing and utilizing them. The curators therefore, are tasked with identifying and registering resources, examining data, writing configuration files to index and display data and keeping the contents current. In the initial phases of the project, all aspects of the registration and curation processes were manual. However, as the number of resources grew, manual curation became impractical. This report describes our experiences and successes with developing automated resource discovery and semiautomated type characterization with text-mining scripts that facilitate curation team efforts to discover, integrate and display new content. We also describe the DISCO framework, a suite of automated web services that significantly reduce manual curation efforts to periodically check for resource updates. Lastly, we discuss DOMEO, a semi-automated annotation tool that improves the discovery and curation of resources that are not necessarily website-based (i.e. reagents, software tools). Although the ultimate goal of automation was to reduce the workload of the curators, it has resulted in valuable analytic by-products that address accessibility, use and citation of resources that can now be shared with resource owners and the larger scientific community. DATABASE URL: http://neuinfo.org.Entities:
Mesh:
Year: 2012 PMID: 22434839 PMCID: PMC3308161 DOI: 10.1093/database/bas005
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.NIF Resource Landscape. Background: each point on the map represents a global location that houses one or more resources registered with the NIF (via NeuroLex). Red points represent NIF registry entries and blue points represent databases and data sets incorporated into the data federation. Foreground: the blue line represents a plot of the number of federated data sources over time, and the green line represents the number of records in the NIF Data Federation over the same time (note these records come from only the blue dots and that the scale is logarithmic). The DISCO protocols and automated resource crawling were integrated into NIF system function in November 2009 and led to a growth of NIF holdings and in mid 2011, significant enhancements of the DISCO protocols allowing for enhanced automation of data ingestion, as well as the current resource discovery pipeline (RDP) were implemented.
Figure 2.A high level overview of the NIF system. This figure emphasizes where inputs and outputs of the NIF lie as a function of some of NIF's tools. Red arrows represent human steps, blue arrows represent automated steps and green boxes represent places in the system where community interactions are likely. The input of data is done using a suite of tools including NeuroLex (the first step for all data ingestion), DISCO (for deep data registration), LinkOut (linking data to PubMed, PMC PubMed Central literature), DOMEO (for literature annotation) and the RDP automated text-mining resource discovery pipeline that recognizes resources and recommends them to curators for possible inclusion in the NIF Registry. The creation of indices is informed by the ontology, as are the search tools and public web services. Note, all data moves through a process where it is recommended, registered to the NeuroLex, then included in the NIF Registry index and becomes available to DISCO tools for deeper content integration.
The NIF curation process
| NIF curation process | Details | Status | NIF system integration |
|---|---|---|---|
| (1) Recommend a resource | Is it relevant to neuroscience? yes—register with NeuroLex, receives wiki page and unique id | Semiautomated | NIF Registry—RDP |
| (2) Determine resource type | What type of resource is it? How is it accessible? What is its structure? Can users add data? Inputs? Outputs? | Semiautomated | NIF Registry—RDP |
| (3) Check periodically for resource validity and updates | Includes automated validity check of resource status (is the web site responding), and neuroscience content value (what is the neuroscience score from extracted text), as well as contacting resource providers to review their pages | Semiautomated | NIF Registry—RDP |
| (4) Write configuration files (Interop) | Display and index properly, integrate with existing resources (normalize), weights and keywords | Manual/curators or resource owners | NIF Data Federation—DISCO |
| (5) Check periodically for data validity and updates | Includes use of DISCO tool suite to automatically crawl database content, compare to the previous version and assign a date modified stamp, manual approval from curators of new data, and contacting resource providers to review their NIF representation | Semiautomated | NIF Data Federation—DISCO |
| (6) Annotate literature | Using the DOMEO tool, curators and community members can add data (e.g. unique identifiers of reagents or proteins) to literature that can be used to enhance search capabilities of NIF and other search engines. | Manual/curators or community | NIF Literature |
Example of evaluation of a potential resource found in PMC archive
| RDP actions | Example result |
|---|---|
| Use pattern matching to find in full text | URL: |
| Use script to download the page and extract the page title | Name: NBase, Neisseria Meningitidis Online Database |
| Submit extracted text to public NIFSTD web service (tags indicate type of the annotated words) | Partial page content:
NBase, a database of Neisseria genomes created by the Jordan Lab in the School of Biology, GIT, funded by Centers for Disease Control and Prevention to advance research into the genetic causes of virulence in N. meningitidis Neisseria meningitidis is a gram-negative encapsulated bacterium that is the leading cause of bacterial meningitis worldwide Annotated content: ‘NBase A <span class="nifAnnotation" data-nif="Database, nlx_res_20090405, resource|Databases, nlx_res_20090405, resource">database</span> of Neisseria <span class="nifAnnotation" data-nif="Genomics, nlx_inv_100629,| genome,SO_0001026,| Genomics,nlx_200905_433,"> genomes </span> created by the Jordan <span class="nifAnnotation" data-nif="Lab,652930, gene-invertebrates|Lab, 100147702, gene-invertebrates|lab,40817, gene-invertebrates">Lab</span> in the School of <span class="nifAnnotation" data-nif="Biology, nlx_inv_100612,">Biology</span>, GIT, <span class="nifAnnotation" data-nif="Funding, nlx_res_20090107,resource">funded</span> by <span class="nifAnnotation" data-nif="Centers for disease control and prevention, nlx_inv_1005036, institution|disease, DOID_4,|Disease, birnlex_11013,">Centers for Disease Control and </span>Prevention to advance research into the … meningitidis Neisseria meningitidis is a gram-negative encapsulated bacterium that is the <span class="nifAnnotation" data-nif="lead,, toxin|Lead,,toxin"> leading </span> cause of <span class="nifAnnotation" data-nif="bacterial meningitis, DOID_9470, disease| meningitis, DOID_9471, disease|Meninges, nlx_anat_090204, anatomical_structure">bacterial meningitis</span> worldwide’ |
| Extract all nlx_res_tags and count | Possible Type for the entire page: Databases=3 Data=2 Funding=1 |
The web service discussed above is linked to and documented in NIF's developers’ section http://neuinfo.org/developers.
Figure 3.The NIF Registration Pipeline. The NIF registration pipeline starts at a wiki page for each resource (i). This step shows an example public wiki page for the ModelDB resource. Anyone can nominate a resource, the curators will standardize the entry, the resource owner can change the description by simply hitting the edit button and adding information to the form and the owner can sign up to watch the page so that when any changes are made, he/she is notified. When the description is adequate, the curators will change the curation status to ‘curated’ and the ‘click here to generate sitemap’ link becomes visible. This link activates the DISCO system to generate a sitemap file using the text from the stable version of the resource in the NeuroLex wiki (ii). The event tracking system is activated, generating an email to the resource-provider tracking group in NIF, and instructions prompt the user to download the DISCO interop file (iii) and place it into the root directory of the resource. When this is complete, the DISCO dashboard updates and a new page is generated for the resource (iv) that allows the curators or the resource owner to regenerate, or edit the files that were created, schedule a crawl frequency and add additional files allowing for deeper interoperation with NIF such as including data in the Data Federation.
List of abbreviations