| Literature DB >> 29220453 |
Trevor Cohen1, Kirk Roberts1, Anupama E Gururaj1, Xiaoling Chen1, Saeid Pournejati1, George Alter2, William R Hersh3, Dina Demner-Fushman4, Lucila Ohno-Machado5, Hua Xu1.
Abstract
Database URL: https://biocaddie.org/benchmark-data.Entities:
Mesh:
Year: 2017 PMID: 29220453 PMCID: PMC5737202 DOI: 10.1093/database/bax061
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Repositories harvested to generate the corpus of dataset metadata
| Arrayexpress (60 881) | ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments, and provides these data for reuse to the research community. |
| Bioproject (155 850) | A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. |
| The cancer imaging archive (63) | The Cancer Imaging Archive (TCIA) is a large archive of medical images of cancer accessible for public download. All images are stored in DICOM file format. The images are organized as ‘Collections’, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. |
| Clinicaltrials (192 500) | ClinicalTrials.gov is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world. |
| Clinical trials network (46) | A repository of data from completed CTN clinical trials to be distributed to investigators in order to promote new research, encourage further analyses, and disseminate information to the community. Secondary analyses produced from data sharing multiply the scientific contribution of the original research. |
| Cardiovascular research Grid ( | The CardioVascular Research Grid (CVRG) project is creating an infrastructure for sharing cardiovascular data and data analysis tools. CVRG tools are developed using the Software as a Service model, allowing users to access tools through their browser, thus eliminating the need to install and maintain complex software. |
| Dataverse (60 303) | A Dataverse repository is the software installation, which then hosts multiple dataverses. Each dataverse contains datasets, and each dataset contains descriptive metadata and data files (including documentation and code that accompany the data). As an organizing method, dataverses may also contain other dataverses. |
| Dryad (67 455) | DataDryad.org is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. |
| Gemma (2285) | Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles. Gemma contains data from thousands of public studies, referencing thousands of published papers. |
| Gene expression omnibus (105 033) | Gene Expression Omnibus is a public functional genomics data repository supporting MIAME-compliant submissions of array- and sequence-based data. Tools are provided to help users query and download experiments and curated gene expression profiles. |
| Mouse phenome database (235) | The Mouse Phenome Database (MPD) has characterizations of hundreds of strains of laboratory mice to facilitate translational discoveries and to assist in selection of strains for experimental studies. |
| Neuromorpho (34 082) | NeuroMorpho.Org is a centrally curated inventory of digitally reconstructed neurons associated with peer-reviewed publications. It contains contributions from over 80 laboratories worldwide and is continuously updated as new morphological reconstructions are collected, published, and shared. |
| Nuclear receptor signaling atlas (NURSA) (389) | The Nuclear Receptor Signaling Atlas (NURSA) was created to foster the development of a comprehensive understanding of the structure, function, and role in disease of nuclear receptors (NRs) and coregulators. NURSA seeks to elucidate the roles played by NRs and coregulators in metabolism and the development of metabolic disorders (including type 2 diabetes, obesity, osteoporosis, and lipid dysregulation), as well as in cardiovascular disease, oncology, regenerative medicine and the effects of environmental agents on their actions. |
| Openfmri ( | OpenfMRI.org is a project dedicated to the free and open sharing of functional magnetic resonance imaging (fMRI) datasets, including raw data. The focus of the database is on task fMRI data. |
| Peptideatlas (76) | PeptideAtlas is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments. Mass spectrometer output files are collected for human, mouse, yeast, and several other organisms, and searched using the latest search engines and protein sequences. |
| Phenodisco (dbGaP) (429) | Phendisco is derived from the database of Genotypes and Phenotypes (dbGap), with additional metadata ( |
| Physiobank (70) | PhysioBank is a large and growing archive of well-characterized digital recordings of physiologic signals and related data for use by the biomedical research community. PhysioBank currently includes databases of multi-parameter cardiopulmonary, neural, and other biomedical signals from healthy subjects and patients with a variety of conditions with major public health implications, including sudden cardiac death, congestive heart failure, epilepsy, gait disorders, sleep apnea, and aging. |
| Protein data bank (113 493) | The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. |
| ProteomeXchange (1716) | The ProteomeXchange consortium has been set up to provide a single point of submission of MS proteomics data to the main existing proteomics repositories, and to encourage the data exchange between them for optimal data dissemination. |
| Yale protein expression database ( | The Yale Protein Expression Database (YPED) is an open source system for storage, retrieval, and integrated analysis of large amounts of data from high throughput proteomic technologies. YPED currently handles LCMS, MudPIT, ICAT, iTRAQ, SILAC, 2D Gel and DIGE, Label Free Quantitation (Progenesis), Label Free Quantitation (Skyline), MRM analysis and SWATH This repository contains data sets which have been released for public viewing and downloading by the responsible Primary Investigators. |
| Total (794 992) |
The numbers in parentheses indicate the number of datasets in each repository when the corpus was constructed.
Sample Dataset in XML format
Main fields are in boldface, and subfields are in italics.
Figure 1.Overview of construction of the reference standard.
Examples of queries, showing keyword and expanded forms
| Curator query | Constraint types | Keyword query | Expanded keyword query |
|---|---|---|---|
| ‘Find data on the NF-KB signaling pathway in MG (Myasthenia gravis) patients’ | Biological process Disease | NF KB signaling pathway Myasthenia gravis MG | NF KB signaling pathway Myasthenia gravis MG Immunoglobulin Enhancer Binding Protein Transcription Factor NF kB Ig EBP 1 Enhancer Binding Protein Immunoglobulin kappa B Enhancer Binding Protein kappaB Nuclear Factor kappa B Immunoglobulin Enhancer Binding Protein Factor Myasthenia Gravis Ocular Myasthenia Gravis Generalized Erb Goldflam disease Myasthenia gravis disorder MG Myasthenia gravis |
| ‘Search for all data types related to gene TP53INP1 in relation to p53 activation across all databases’. | Gene Biological process | TP53INP1 p53 activation | tumor protein p53 inducible nuclear protein 1 TP53INP1 Teap FLJ22139 DKFZp434M1317 SIP TP53INP1AP53DINP1 TP53INP1B Gene p53 TP53 LFS1 tumor protein p53 Li Fraumeni syndrome |
Top 5 datasets retrieved in response to second query in Table 3 (DataMed, 6/10/17)
| Query: keywords | Repository | Dataset title, |
|---|---|---|
‘Search for all data types related to gene TP53INP1 in relation to p53 activation across all databases’: | ArrayExpress | A large intergenic non-coding RNA induced by p53 mediates global gene repression in the p53 transcriptional response, |
| Uniprot:Swiss-prot | ||
| BioProject | ||
| Uniprot:Swiss-prot | T53I2_RAT (Tumor protein p53-inducible nuclear protein 2) | |
| Sequence Read Archive (SRA) |
Information Retrieval systems for the initial pooling experiments
| System | Description and key algorithms. Algorithms deployed to generate the pooled results for the reference standard appear in |
| Apache Lucene ( | Underlies the ElasticSearch implementation used for the bioCADDIE prototype. |
| Indri ( | |
| Terrier ( | |
| Semantic Vectors ( | Extends Apache Lucene. Implicit query expansion via term similarity - Random Indexing ( |
Figure 2.Venn diagram showing the overlap in ‘definitely relevant’ results across systems with (right) and without (left) terminology-based query expansion. Produced using Venny (36).