| Literature DB >> 18928517 |
Michael J Gilchrist1, Mikkel B Christensen, Richard Harland, Nicolas Pollet, James C Smith, Naoto Ueno, Nancy Papalopulu.
Abstract
BACKGROUND: Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links.Entities:
Mesh:
Year: 2008 PMID: 18928517 PMCID: PMC2587480 DOI: 10.1186/1471-2105-9-442
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Analysis of access methods used by other image data providers
| FlyBase | ImageBrowse/Fly Express | gene name, anatomy, or development stage | |
| Allen Brain Atlas | gene name, accession numbers and other IDs, anatomy, or markers | ||
| EMAP | EMAGE | gene, anatomy or development stage | |
| MGI | gene, anatomical structure, developmental stage, GO terms, assay type | ||
| 4DXpress | gene names, pre-computed orthologs, ontologies | ||
| Xenbase | found on Gene pages | ||
| UCSC | VisiGene | gene name or key word | |
| NIBB | WISH Photo Browser | development stage, view or clone name | |
| WormBase | Expression Pattern Search | cell, cell group, or life stage | |
| ANISEED | Expression Search Tools | development stage, or molecule ID | |
| ZFIN | Search for Gene Expression Data | gene name, anatomy, or development stage, and other more specific terms, indirect via BLAST | |
Summary of search methods used by available public image databases for accessing images, found at the time of writing. Data gathered by visiting each database and reading associated publications.
Figure 1Generic application logic used in indirect sequence similarity search for gene data. (1.) the user pastes a gene sequence into the browser window and sends it to the search engine; (2.) the gene sequence is blasted against the database of sequences associated with the gene data; (3.) IDs of matching sequence are returned to the search engine; (4.) the matching sequence IDs are used to query the local managing database for available gene data; (5.) a list of matching gene data and descriptive text is returned to the search engine; (6.) an html formatted page containing the retrieved gene data and descriptive text is returned to the user's browser.
Contributing image collections for quickImage
| Richard Harland | University of California, Berkeley | |
| Nancy Papalopulu | University of Manchester, UK | |
| Nicolas Pollet | Universite Paris-Sud, Orsay, France | |
| Jim Smith | Gurdon Institute, Cambridge, UK | morpholino screen with |
| Naoto Ueno | NIBB, Japan | |
| Patrick Lemaire | IBDML, Marseille | |
The images collections and the key individual responsible for coordinating transfer of data to the image search database. A URL is given where the images are available as an existing public resource.
Figure 2Example output of quickImage. The query sequence was X. tropicalis myf5, used to retrieve image data for this and related genes. The upper panel shows alignment and similarity between the query sequence and the matching image source sequences. The first three sets of retrieved images are shown; for each set, the accession number of the image source sequence and the best BLAST matches against human, mouse and Xenopus proteins are provided for identification purposes, as well as the originating image collection and species. Images marked A and B show highly similar expression of myf5 in the two frog species at the same development stage. The image marked C shows an interestingly similar expression pattern for the related gene myod/myf3 at a slightly later stage.
Figure 3Example output of quickLit. The query sequence was X. tropicalis brachyury, used to retrieve literature references for this and related genes. The retrieved references are shown for the first few matching sequences. The retrieved data shows a high degree of apparent relevance as indicated by the title of each paper, and clear organisation of reference by species. Reference summaries and associated sequence data were downloaded from NCBI GenBank and various model organism databases.
Figure 4Example output of quickGene. The query sequence was X. tropicalis brachyury, used to search gene name data from Entrez Gene. Note the variable nature of the retrieved gene names for this set of related genes.
Comparison of text based and sequence based retrieval methods for image data for an arbitrary set of genes
| chordin | NM_001088309 | 3 | 0 | 3 | |||
| hairy and enhancer of split 1 | NM_001085917 | 1 | 1 | 1 | |||
| noggin | NM_001085644 | 1 | 0 | 1 | |||
| homeobox protein SIX1 | NM_001088558 | 1 | 1 | 1 | |||
| BMP and activin membrane-bound inhibitor | NM_001008193 | 2 | 2 | 2 | |||
| bone morphogenetic protein 4 | Xt7.1-XZT65619.5.5 | 3 | 1 | 3 | mRNA from Entrez Gene appears to be truncated, used EST-based contig sequence instead | ||
| fibroblast growth factor 8 | NM_001008162 | 1 | 0 | 1 | |||
| LIM homeobox 1 | NM_001100228 | 2 | 0 | 2 | |||
| SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily d, member 1 | NM_001004862 | 1 | 0 | 0 | probe design sequences were in 3'UTR so there were no BLAST hits for text identification | ||
| SRY (sex determining region Y)-box 2 | NM_213704 | 4 | 0 | 4 | alias gene symbol 'sox-2' worked better than 'sox2' | ||
| T, brachyury homolog | NM_001008138 | 6 | !! | 6 | a large number of protein descriptions contain the letter 't' | ||
| tumor protein p53 | NM_001001903 | 2 | 0 | 2 | older alias gene symbol 'p53' retrieved both image sets | ||
Image sets are defined by their associated sequence and source collection. Each associated sequence has been blasted against the NCBI protein databases, retaining the best match for human, mouse and the two frog species. Text based retrieval used simple text matching (allowing wild cards) against the protein description returned by BLAST. Sequence based retrieval used BLASTn against a database of the image associated sequences. For each gene the number of images sets retrieved by the sequence method, using the full-length mRNA, was noted. Text searches with various combinations of the gene symbol, exact full name, and more commonly used names, confirmed that the sequence method appeared to have retrieved all image sets for the target gene in each case. Care was taken to disambiguate search results on percent identity or protein description (as appropriate) by inspection, where images for other genes were retrieved along with the target gene images.