| Literature DB >> 23635042 |
María Pérez1, Rafael Berlanga, Ismael Sanz, María José Aramburu.
Abstract
BACKGROUND: Open metadata registries are a fundamental tool for researchers in the Life Sciences trying to locate resources. While most current registries assume that resources are annotated with well-structured metadata, evidence shows that most of the resource annotations simply consists of informal free text. This reality must be taken into account in order to develop effective techniques for resource discovery in Life Sciences.Entities:
Year: 2013 PMID: 23635042 PMCID: PMC3698192 DOI: 10.1186/2041-1480-4-12
Source DB: PubMed Journal: J Biomed Semantics
Comparative table of registries of web resources in Life Sciences
| Feta [ | Keywords | Input, output, | Manually | N/A |
| | from ontology | operation type, | | |
| | | task | | |
| BioMoby [ | Keywords | Resource type, I/O | Resource type, | N/A |
| | | | object type | |
| EMBRACE [ | Keywords | String matching | Syntactically annotated | N/A |
| | | | with BioXSD | |
| BioCatalogue [ | Keywords | String matching, | Categories, some tags | N/A |
| | | categories, filters | | |
| SSWAP [ | Keywords, | RDF | Third-party ontologies, | N/A |
| | Resource | | reasoning | |
| | Query Graph | | | |
| Magallanes [ | Keywords | String matching in | N/A | Yes |
| | | data type, | | |
| | | resource type | | |
| myExperiment | Keywords | String matching, | Tags | N/A |
| [ | | filters | | |
| Taverna [ | Keywords | String matching, | BioMoby metadata | Workflow |
| | | ontology concepts | | composition |
| SADI [ | SPARQL | RDF | Third-party ontologies | Yes |
This comparative table presents the main characteristics of the most popular registries in Life Sciences.
Figure 1Overview of the proposed approach. The approach is divided on three phases: requirements specification, normalization and web resource discovery.
Figure 2Requirements model. This figure shows the requirements model defined by a user who wants to compare specific genes in different organisms.
Figure 3Information of a normalized task. This figure shows the information of the user-defined tasks once they have been normalized. For each task, it shows the facets values, the semantic annotations and the selected resource.
Results for the case study
| Retrieve gene sequence | Input: gene | getColiCardIDs_by_ | 1 |
| | Output: sequence | InteractingPartnersResource | |
| Search similar sequences | Input: sequence | Database of Protein | 1 |
| | Output: blast report | Subcellular Localization Resource | |
| Predict gene structure | Input: gene | GlimmerResource | 1 |
| | Output: gene model | | |
| Align protein sequences | Input: protein sequence | T-Coffee | 1 |
| | Output: sequence alignment | | |
| Build phylogenetic trees | Input: sequence alignment | INB: | 5 |
| | Output: phylogenetic tree | runCreateTreeFromClustalw | |
| Analyze domains | Input: protein sequence | INB:inb.bsc.es:parseRulesFromMotif | 1 |
This table shows the selected resources for each user-defined task.
Example of extraction patterns for facets
| Input | Inputs? | (is |are) |
| | - | Given |
| | | Taking |
| | (((of)? E)+ | it) | gets |
| Output | Outputs? | (is |are) |
| | (((of)? E)+ |it) | Constructs |
| | | Finds |
| | | Retrieves |
| | | Calculates |
| | | Contains |
| | | Produces |
| | | Extracts |
| | | Returns |
| Method | (((of)? E)+ | it) | Maps |
| | | Executes |
| | | Performs |
| | | Implements |
| | | Applies |
| | | Applying |
| | | Runs |
| | | Running |
| | | Based on |
| | | Computes |
| | | Carries out |
| Processes |
It shows some of the extraction patterns specified for the facets: input, output and method.
BioUSeR evaluation
| Search proteins with a functional domain | 0.79 | 0.76 | 0.73 | 0.45 | 0.63 | 0.53 |
| Search similar sequences | 0.91 | 0.85 | 0.77 | 0.22 | 0.72 | 0.33 |
| Analyze transgenic model organism | 0.93 | 0.94 | 0.89 | 0.58 | 0.89 | 0.7 |
| Find genes with functional relationships | 0.78 | 0.75 | 0.66 | 0.34 | 0.39 | 0.36 |
| Predict structure | 0.91 | 0.91 | 0.85 | 0.47 | 0.29 | 0.36 |
| Analyze phylogeny | 0.8 | 0.8 | 0.79 | 0.52 | 0.36 | 0.43 |
| Align sequences | 0.73 | 0.76 | 0.74 | 0.62 | 0.3 | 0.41 |
This table shows the precision, recall and F-measure of the results obtained for the queries of the query pool, associated to the 7 base tasks of the gold standard (GS). It also includes the precision for the top-5, top-10 and top-20 results.
Facets gold standard
| Input | 52 | 48 |
| Output | 47 | 48 |
| Method | 135 | 434 |
| Disease | 7 | 5 |
| Species | 27 | 61 |
This table shows for each facet: the number of BioCatalogue tags in the facets gold standard (GS) and the number of resources that are tagged with them.
Facets extraction evaluation
| Input | 259 | 399 | 0.69 | 0.93 | 0.73 |
| Output | 266 | 274 | 0.6 | 0.94 | 0.64 |
| Method | 136 | 210 | 0.44 | 0.53 | 0.35 |
| Disease | 142 | 144 | - | - | - |
| Species | 292 | 287 | - | - | - |
This table shows the number of concepts that our approach has automatically extracted for each facet and the number of resources that are annotated with those concepts. Moreover, for the input/output and method facets the precision, recall and F measures are shown. These measures have been calculated for the automatically extracted information with respect to the GS. The disease and species facets have not been evaluated because of their poor representation in the GS.
BioCatalogue keyword search evaluation
| Search proteins with | 0.4 | 0.41 | 0.41 | 0.41 | 0.02 | 0.04 | 2.45 | 3.4 |
| a functional domain | | | | | | | | |
| Search similar sequences | 0.4 | 0.4 | 0.4 | 0.36 | 0.07 | 0.12 | 2.87 | 3.8 |
| Analyze transgenic | 0.74 | 0.71 | 0.71 | 0.71 | 0.17 | 0.27 | 3.25 | 2.94 |
| model organism | | | | | | | | |
| Find genes with | 0.27 | 0.26 | 0.27 | 0.26 | 0.04 | 0.07 | 3.15 | 2.13 |
| functional relationships | | | | | | | | |
| Predict structure | 0.67 | 0.66 | 0.65 | 0.64 | 0.04 | 0.07 | 3.27 | 2.93 |
| Analyze phylogeny | 0.18 | 0.2 | 0.2 | 0.18 | 0.01 | 0.02 | 2.8 | 2.56 |
| Align sequences | 0.72 | 0.72 | 0.75 | 0.69 | 0.07 | 0.13 | 2.48 | 4.16 |
This table shows the evaluation of the keyword-based search of BioCatalogue. The evaluation has been made using the queries in our GS and precision, recall and F measures have been calculated for each topic. The edition cost is the cost of translating the requirements into keywords queries, which corresponds to the number of failed queries executed before getting results. Finally, keywords is the average number of keywords of the successful queries.
BioCatalogue navigational search evaluation
| Search proteins with a functional domain | 0.92 | 0.92 | 0.82 | 0.75 | 0.15 | 0.25 | 2.67/3.3 |
| Search similar sequences | 1 | 1 | 1 | 1 | 0.3 | 0.46 | 2.0/4.25 |
| Analyze transgenic model organism | 0.8 | 0.9 | 0.95 | 0.94 | 0.4 | 0.56 | 0.03/10.77 |
| Find genes with functional relationships | 0.91 | 0.95 | 0.89 | 0.89 | 0.26 | 0.4 | 1.0/3.0 |
| Predict structure | 0.87 | 0.93 | 0.96 | 0.9 | 0.1 | 0.18 | 2.29/3.42 |
| Analyze phylogeny | 0.8 | 0.88 | 0.88 | 0.88 | 0.03 | 0.06 | 0.0/11.0 |
| Align sequences | 0.98 | 0.99 | 0.99 | 0.99 | 0.06 | 0.11 | 2.94/2.1 |
This table shows the evaluation of the BioCatalogue navigational search based on categories. The evaluation has been made using the queries in our GS and the average precision, recall and F-measure have been calculated for each topic. The cost of edition of the queries is represented by the depth of the category in the taxonomy of categories and the number of siblings of the selected category.
Concept reference formats
| UMLS | UMLS:C<number>: STypes | STypesare the semantic types associated to UMLS concepts (e.g. Disease, Protein, etc.) |
| Wikipedia | Wiki:W<number>: Categs | Categs are the categories associated to the page entry of the referred concept. |
| myGRID | myGR:D<number>: | These concepts are extracted from the myGRID ontologies. |
| EDAM | EDAM_<number>: | These concepts are extracted from the EDAM ontology. |
This table shows the concept reference formats used for the different semantic sources. The generic format for a reference is Source:ConceptID:SemanticTags.
Figure 4Semantic annotation of a task and a resource description. This table shows the semantic annotation and the corresponding semantic vector of the task “build phylogenetic trees” and a fragment of the description of the resource “Blast”. We have used the IeXML notation [33] to show the generated annotations.