| Literature DB >> 25480116 |
Maria Esch1, Jinbo Chen2, Christian Colmsee2, Matthias Klapperstück2, Eva Grafahrend-Belau2, Uwe Scholz2, Matthias Lange2.
Abstract
With the number of sequenced plant genomes growing, the number of predicted genes and functional annotations is also increasing. The association between genes and phenotypic traits is currently of great interest. Unfortunately, the information available today is widely scattered over a number of different databases. Information retrieval (IR) has become an all-encompassing bioinformatics methodology for extracting knowledge from complex, heterogeneous and distributed databases, and therefore can be a useful tool for obtaining a comprehensive view of plant genomics, from genes to traits. Here we describe LAILAPS (http://lailaps.ipk-gatersleben.de), an IR system designed to link plant genomic data in the context of phenotypic attributes for a detailed forward genetic research. LAILAPS comprises around 65 million indexed documents, encompassing >13 major life science databases with around 80 million links to plant genomic resources. The LAILAPS search engine allows fuzzy querying for candidate genes linked to specific traits over a loosely integrated system of indexed and interlinked genome databases. Query assistance and an evidence-based annotation system enable time-efficient and comprehensive information retrieval. An artificial neural network incorporating user feedback and behavior tracking allows relevance sorting of results. We fully describe LAILAPS's functionality and capabilities by comparing this system's performance with other widely used systems and by reporting both a validation in maize and a knowledge discovery use-case focusing on candidate genes in barley.Entities:
Keywords: Functional gene annotation; Information retrieval; Integrative search engine; Plant genomics resources; Traits
Mesh:
Year: 2014 PMID: 25480116 PMCID: PMC4301746 DOI: 10.1093/pcp/pcu185
Source DB: PubMed Journal: Plant Cell Physiol ISSN: 0032-0781 Impact factor: 4.927
Fig. 1Screenshots of the LAILAPS web interface: (A) illustrates the search and results page, where the user logs in (1) and starts the search with a request (2). Spelling correction and an estimation of the expected results are shown. All hits, including a short excerpt of relevant text positions (3) and a list of annotation links to related information (4), are provided. Links can be direct (green) or indirect (red). Special filter options such as data sources or synonyms are located on the left side of the result page (5). All results can be downloaded as a Microsoft Excel sheet (6). Document links open a new tabulator (B) showing the database entry and a rating system (7) where the user can validate the obtained results.
Number of data entries for indexed and linked databases
| Indexed database | Records | Linked database | Linked IDs |
|---|---|---|---|
| garlic shallot core collection | 176 | BMRF4Arabidopsis | 110,788 |
| gene_ontology | 40,730 | BMRF4Glycinemax | 637,480 |
| genebank information system of the ipk gatersleben | 146,420 | BMRF4Medicagotruncatula | 418,129 |
| gramene taxonomy ontology | 58,585 | BMRF4Oryzasativa | 907,405 |
| ncbi taxonomy | 1,139,973 | BARLEX | 75,257 |
| pdb | 98,894 | BioModels | 34,597 |
| pfam | 14,831 | CR-EST | 2,066,967 |
| plant_ontology | 1,691 | EnsemblPlants | 8,549,929 |
| taxonomic allium reference collection | 3,871 | PlantsDB | 131,928 |
| trait ontology | 1,327 | PolapgenDB | 10 |
| uniprot_sprot | 542,782 | PubMed | 64,766,473 |
| uniprot_trembl | 54,247,468 | ensembl | 4,198,205 |
| gnpis | 1,007,056 | ||
| Metacrop (link to conversions) | 981 | ||
| Metacrop (link to substance) | 253 | ||
| Optimas-DW | 916,911 |
Bargsten et al. (2014).
http://plants.ensembl.org/.
Steinbach et al. (2013).
Summarized features of common plant search portals
| URL | Age of data | Data range | Ranking | Pertinence | Data cart | Query assistance | Interactive result filtering | Linking related data | |
|---|---|---|---|---|---|---|---|---|---|
| LAILAPS | Updated quarterly | Resources and annotations | Neural network | User Login and personalized search | Result download as Excel sheet | Query correction and suggestion | Filtering by data source | Linking to related data | |
| EB-eye | Automatically updates and re-indexes data on a daily basis | Data resources hosted at the EMBL-EBI | Sort order is based on the proximity of the terms in the entries | Not provided | Possible via EMBL-EBI resource websites | Apache Lucene query syntax allowing query refinement through adding additional terms to the query | Result filter for sources | Explore related information | |
| Real-time upload | Entire web | Feature rank-based | Personalized search | Not provided | Query correction and suggestion | Filtering by many different criteria | Not provided | ||
| UniProt/UniProt Beta | Updated and distributed every 4 weeks | UniProtKB, UniRef, UniParc, Supporting data | Sort by score (descending) | UniProt Beta provides a basket system | Data download possible in different formats | ‘Did you mean’ function for small spelling errors in UniProt | Filtering by source | Cross-references | |
| NCBI Entrez/GQuery (PubMed) | Depending on NCBI services (updated when new publications available) | All NCBI data | Sorting by relevance is possible | Login to NCBI provided | Different download formats provided | Automatic correction and query suggestion | Interactive result filtering is provided | Related citations in PubMed | |
| Ensembl Plants | On demand | Genome information of different plants | Not provided | Personal configurations via login | Download of different file formats possible | Not provided | Filtering by species | External references | |
| DBGET Search (Kegg) | Suited for maintaining large daily updated databases | Major databases: GenBank, EMBL, SWISS-PROT, PDB, PROSITE, EPD, PIR, PRF, KEGG Genes | Not provided | Not provided | Download RDF | Not provided | Not provided | Links to other DBs like UniProt, MIPS and more | |
| IntegromeDB | Re-loaded on a quarterly basis | Relevance score | Not provided | Result download as CSV and RDF | Provides related suggestions | Not provided | Relations to query, synonyms and related information | ||
| AmiGO 2 | Built at regular intervals | Gene Ontology (GO) data | Alphabetical sorting | Not provided | Result table downloadable as txt file | Query suggestion | Interactive filtering by different criteria | Link to related internal data | |
| MIPS PlantsDB | Updated regularly, if new data available | Hosting of databases for different plant species | Not provided | Not provided | Genetic element download possible | No assistance provided | Automatically filtered by organism. No interactive filtering provided. | References provided |
List of 20 traits and their expression as keyword queries
| Query class | Subquery class | Query |
|---|---|---|
| Trait | Stress response | Salt stress |
| Trait | Agronomic traits | Yield |
| Trait | Morphological/ phenotypic traits | Ear emergence |
| Trait | Stress response | Barley salt stress |
| Trait | Agronomic traits | Barley yield |
| Biological entity | Protein name/ID | WUS protein |
| Biological entity | Gene name/ID | WUS |
| Biological entity | Gene name/ID | WUS Arabidopsis |
| Taxonomy | Cultivar name | Barley Morex |
| Taxonomy | Geography | Barley fertile crescent |
| Taxonomy | Subspecies name | Hordeum vulgare spontaneum seed |
| Aaffiliation | Institute name | MIPS muenchen |
| Affiliation | Institute name | Barley IPK |
| Metabolic function | Catalytic process | Sucrose synthase |
| Metabolic function | Primary metabolism | Photosynthesis barley leaf |
| Metabolic function | Metabolic engineering | Rice phytoene synthase |
| Metabolic function | Secondary metabolism | GABA barley |
| Regulatory function | Regulation of enzyme activity | Regulation of starch synthase activity |
| Regulatory function | Regulation of process | WUS regulation |
| Regulatory function | Regulation of process | WUS meristem |
Fig. 2Comparison between the TF–IDF-based relevance ranking of Apache Lucene’s information retrieval API (left side) and LAILAPS neural network-based relevance prediction (right side). There are five classes of document relevance ranging from ‘no relevance’ to ‘fully agree’. A biological expert evaluated the relevance of a document. The boxplots show an improved ranking that separates relevant from non-relevant results using LAILAPS compared with Apache Lucene’s API.
Fig. 3Overview of the LAILAPS architecture and workflow. On the client side (web browser), the user makes a request, which is sent to the server. Information resources and annotations are stored using different backend systems. Processing and search modules are used to find documents that are related to a request. All results are received by the client and can be investigated on the web browser or downloaded for later investigation.
Neural network features and feature descriptions
| Feature | Description |
|---|---|
| Attribute | Attribute for which the query term was found |
| Database | Database in which the database entry is included |
| Frequency | Frequency of all query terms in the database entry and attribute |
| Co-occurrence | Closeness and order to the document terms |
| Keyword | Provides information regarding whether good or bad keywords are present near the query terms |
| Organism | Organism the database entry relates to |
| Sequence length | Length of the sequence described by the database entry |
| Text position | Portion of the attribute that is covered by the query term |
| Synonym | Provides information regarding whether the hit was produced by an automatic synonym expansion |