| Literature DB >> 29529024 |
Jinbo Chen1, Uwe Scholz1, Ruonan Zhou1, Matthias Lange1.
Abstract
In order to access and filter content of life-science databases, full text search is a widely applied query interface. But its high flexibility and intuitiveness is paid for with potentially imprecise and incomplete query results. To reduce this drawback, query assistance systems suggest those combinations of keywords with the highest potential to match most of the relevant data records. Widespread approaches are syntactic query corrections that avoid misspelling and support expansion of words by suffixes and prefixes. Synonym expansion approaches apply thesauri, ontologies, and query logs. All need laborious curation and maintenance. Furthermore, access to query logs is in general restricted. Approaches that infer related queries by their query profile like research field, geographic location, co-authorship, affiliation etc. require user's registration and its public accessibility that contradict privacy concerns. To overcome these drawbacks, we implemented LAILAPS-QSM, a machine learning approach that reconstruct possible linguistic contexts of a given keyword query. The context is referred from the text records that are stored in the databases that are going to be queried or extracted for a general purpose query suggestion from PubMed abstracts and UniProt data. The supplied tool suite enables the pre-processing of these text records and the further computation of customized distributed word vectors. The latter are used to suggest alternative keyword queries. An evaluated of the query suggestion quality was done for plant science use cases. Locally present experts enable a cost-efficient quality assessment in the categories trait, biological entity, taxonomy, affiliation, and metabolic function which has been performed using ontology term similarities. LAILAPS-QSM mean information content similarity for 15 representative queries is 0.70, whereas 34% have a score above 0.80. In comparison, the information content similarity for human expert made query suggestions is 0.90. The software is either available as tool set to build and train dedicated query suggestion services or as already trained general purpose RESTful web service. The service uses open interfaces to be seamless embeddable into database frontends. The JAVA implementation uses highly optimized data structures and streamlined code to provide fast and scalable response for web service calls. The source code of LAILAPS-QSM is available under GNU General Public License version 2 in Bitbucket GIT repository: https://bitbucket.org/ipk_bit_team/bioescorte-suggestion.Entities:
Mesh:
Year: 2018 PMID: 29529024 PMCID: PMC5871001 DOI: 10.1371/journal.pcbi.1006058
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Example for word vector representation computed by a feedforward neural network.
The word vector representations estimate the influence of a word in the context of the semantic relationship expressed by the particular word vector. This matrix is a 2 × 4 matrix, representing a vocabulary size of 4 and vector dimensions (number of expected relationships) of 2. The word vector w1 could represent the relationships “yield” and w2 “lipid source” respectively.
| word | ||
|---|---|---|
| produce | 0.83 | 0.52 |
| yield | 0.99 | 0.34 |
| vegetable | 0.62 | 0.86 |
| oil | 0.41 | 0.92 |
Test set for query suggestion benchmarking. Based on query log analysis five major classes of query can be identified: Trait, biological entity, taxonomy metabolic function.
In the second column, related to each class most frequent query objectives where randomly selected from query log.
| query class | subquery class | query |
|---|---|---|
| trait | stress response | salt stress |
| drought tolerance | ||
| agronomic traits | grain yield | |
| phenotypic traits | male sterility | |
| biological entity | protein name/id | alcohol dehydrogenase |
| substance name/id | dextrins | |
| taxonomy | species name | wheat |
| subspecies name | zea mays | |
| cultivar name | oryza glaberrima | |
| metabolic function | catalytic process | sucrose synthase |
| transport process | sucrose transporter | |
| primary metabolism | photosynthesis | |
| secondary metabolism | terpene synthase | |
| metabolic diseases | leaf rust | |
| metabolic engineering | acetolactate synthase |