| Literature DB >> 21609954 |
Jean-Fred Fontaine1, Florian Priller, Adriano Barbosa-Silva, Miguel A Andrade-Navarro.
Abstract
UNLABELLED: Biomedical literature is traditionally used as a way to inform scientists of the relevance of genes in relation to a research topic. However many genes, especially from poorly studied organisms, are not discussed in the literature. Moreover, a manual and comprehensive summarization of the literature attached to the genes of an organism is in general impossible due to the high number of genes and abstracts involved. We introduce the novel Génie algorithm that overcomes these problems by evaluating the literature attached to all genes in a genome and to their orthologs according to a selected topic. Génie showed high precision (up to 100%) and the best performance in comparison to other algorithms in most of the benchmarks, especially when high sensitivity was required. Moreover, the prioritization of zebrafish genes involved in heart development, using human and mouse orthologs, showed high enrichment in differentially expressed genes from microarray experiments. The Génie web server supports hundreds of species, millions of genes and offers novel functionalities. Common run times below a minute, even when analyzing the human genome with hundreds of thousands of literature records, allows the use of Génie in routine lab work. AVAILABILITY: http://cbdm.mdc-berlin.de/tools/genie/.Entities:
Mesh:
Year: 2011 PMID: 21609954 PMCID: PMC3125729 DOI: 10.1093/nar/gkr246
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Flow chart of the Génie web tool and algorithm. As an example, a user could query human genes related to a disease or a molecular pathway using chicken and rat orthologs. Usage of orthology information is optional. Data are extracted from four NCBI databases: Taxonomy, Gene, MEDLINE and HomoloGene. As the retrieved literature associated to the topic may not be complete, it is used to train a text mining classifier that will select relevant gene literature. The output gene list (human genes in the given example) is ranked using Fisher’s statistics.
Manual evaluation of the top genes for five species and five topics
| Species | Topics | Evaluated genes | True positives | Precision (%) |
|---|---|---|---|---|
| Host–pathogen interactions | 50 | 50 | 100 | |
| Cell cycle | 50 | 49 | 98 | |
| Pain measurement and knockout mice | 50 | 48 | 96 | |
| Alzheimer's disease | 50 | 47 | 94 | |
| Planar cell polarity | 24 | 22 | 92 |
Figure 2.Benchmarks. (a) Génie confidence scores versus log2-fold expression changes for all up-regulated probes (at least 2-fold expression change) in a zebrafish microarray data set between hearts from 3-day-old zebrafish embryos and whole body tissue. All probes with a positive confidence score were selected by Génie using orthology to zebrafish, mice and humans (red diamonds and black crosses). Probes also selected by Génie using only zebrafish-related abstracts are plotted with black crosses. Genes not selected by Génie have a score equal to zero (blue circles). The scores and gene expression fold changes for each gene are available as Supplementary Table S6. (b) Precision when predicting differentially expressed genes using gene ranks given by Génie. From the zebrafish microarray data analysis, differentially expressed genes are selected by a FDR < 0.01 and a minimum 2-fold expression change between heart and body samples. (c) These precision–recall plots show the performance of Génie (red curves), Fable (blue curves) and PolySearch (black curves) when ranking genes from eight randomly chosen KEGG pathways. The three tools were used with default parameters. PolySearch returned no results for two pathways: drug metabolism cytochrome P450 and fructose mannose metabolism (see Supplementary Methods). Génie was run without using orthology expansion of the literature.
Features comparison
| Feature | Génie | Fable | PolySearch |
|---|---|---|---|
| Abstracts database | Medline | Medline + ‘Publisher Status’ from PubMed | PubMed |
| Updates | Weekly | Several times a year | Direct access (web services) |
| Concepts to query | Unlimited | Unlimited | Unlimited |
| Query input | Text, PMIDs or MeSH terms | Keywords | Keywords |
| Running time | 2–25 s | 2–10 s | 1 min to hours |
| Species | 4418 | 1 | 1 |
| Orthology species extension | Yes, 23 species | No | No |
| Gene-concept association method | Naïve Bayesian classifier | cooccurrence statistics | co-occurrence statistics |
| Gene ranking method | Fisher's exact test | Frequency | |
| Gene name extraction | Manual | Trained probabilistic model | Dictionary based |
| Synonymous resolution | Manual | Dictionary based | Dictionary based |
| Gene names ambiguity problem | no | Yes | Yes |
aThe run-time depends strongly on the parameters for Génie (here queries without orthology information) and PolySearch.