| Literature DB >> 19858102 |
Theodoros G Soldatos1, Seán I O'Donoghue, Venkata P Satagopam, Lars J Jensen, Nigel P Brown, Adriano Barbosa-Silva, Reinhard Schneider.
Abstract
Life scientists are often interested to compare two gene sets to gain insight into differences between two distinct, but related, phenotypes or conditions. Several tools have been developed for comparing gene sets, most of which find Gene Ontology (GO) terms that are significantly over-represented in one gene set. However, such tools often return GO terms that are too generic or too few to be informative. Here, we present Martini, an easy-to-use tool for comparing gene sets. Martini is based, not on GO, but on keywords extracted from Medline abstracts; Martini also supports a much wider range of species than comparable tools. To evaluate Martini we created a benchmark based on the human cell cycle, and we tested several comparable tools (CoPub, FatiGO, Marmite and ProfCom). Martini had the best benchmark performance, delivering a more detailed and accurate description of function. Martini also gave best or equal performance with three other datasets (related to Arabidopsis, melanoma and ovarian cancer), suggesting that Martini represents an advance in the automated comparison of gene sets. In agreement with previous studies, our results further suggest that literature-derived keywords are a richer source of gene-function information than GO annotations. Martini is freely available at http://martini.embl.de.Entities:
Mesh:
Year: 2009 PMID: 19858102 PMCID: PMC2800231 DOI: 10.1093/nar/gkp876
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Martini keyword output for the Arabidopsis dataset. All significantly enhanced keywords are shown first as a ‘keyword cloud’, where the size of each keyword is proportional to its statistical significance. The keywords assigned to input sets A or B are colored blue or black, respectively. Below the keyword cloud, the significant keywords are shown again in a table form, including: the number of times each keyword occurs in each set; the enhancement factor (i.e. the ratio of the previous numbers); finally, the table gives an adjusted p-value, which is an estimation of the likelihood that the given level of keyword enhancement occurred by chance. Note that the total number of genes or abstracts shown in this table may be slightly less than the number in the user-defined input. This may happen for two reasons: first, depending on the user’s choice of genes or abstracts as input, Martini will remove common items; secondly, some abstracts may not have been indexed in AKS2, and hence they are not counted.
Martini performance
| Total input | Keyword enhancement time |
|---|---|
| 100 abstracts | 3 s |
| 100 genes | 2 min |
This table can be used to estimate the time required for a Martini analysis, assuming linear scaling with total input size. For example, to perform a keyword enhancement using two sets of 500 genes (=total input of 1000 genes) takes ∼20 min, i.e. 10 times longer than for 100 genes. The estimates given here are for genes with nine Medline abstracts (i.e. the median number for human genes). Scaling can be highly non-linear, e.g. including well-studied genes can take much longer. However, in practice the actual time taken is often less than the time estimated from this table.
Figure 2.Keywords found by Martini from cell-cycle genes. The figure shows all keywords found by Martini using 600 cell-cycle-regulated genes that have been experimentally assigned to specific time points within the human cell cycle. Percentage numbering indicates cell cycle progress, with cell division occurring at 100% or 12 o’clock. The arc spanned by each keyword shows the exact region where it is statistically significant. The radius of each keyword is determined by word length. The left-portion of the figure shows keywords that describe biological processes and functions—these keywords cluster into three distinct phases: M-, S- and a pre-replication phase. The right-portion of the figure is a close-up of the pre-replication and S-phase regions showing keywords that specify genes, proteins or complexes. The keywords shown, and their positions show a surprisingly accurate and precise match to the sub-phases, processes, and entities known to occur in the cell cycle (see also Table 3).
Cell-cycle benchmark and score-card
| Cell-cycle phases, processes and components | Martini | Marmite | CoPub | FatiGO | ProfCom | |
|---|---|---|---|---|---|---|
| Synonyms/related genes | ||||||
| 1. G1-Phase | Gap 1 | |||||
| 2. S-Phase | DNA metabolism, synthesis; synthesis phase | 1 | 1 | |||
| 1 | ||||||
| (i) Replication initiation | Chromatin silencing; Hyperphosphorylation | |||||
| (ii) DNA replication | DNA methylation, synthesis, recombination | 1 | 1 | 1 | ||
| (iii) DNA repair | Base-excision repair; DNA damage response, unwinding; double-strand break repair; mismatch repair; nucleotide-excision repair; post-replication repair; Telomere maintenance | 1 | 1 | |||
| (i) Origin of replication complex | Claspin; ORC | |||||
| (ii) Mini-chromosome maintenance complex | MCM2-7 | |||||
| (iii) Replication fork | CHL12; DNA ligase; DNA polymerase; DNA replication factor; Helicase; holoenzyme; lagging strand; leading strand; Okazaki fragments; PCNA; PCNA-binding protein; pre-replication complex; primase; processivity; replication protein A (RP-A); replication factor C (RFC); single-stranded DNA; ssDNA-binding proteins; Strand displacement; Topoisomerase. | 1 | 1 | |||
| 3. DNA repair proteins | Ataxia Telangiectasia mutated gene (ATM), ATR, ATR-interacting proteins; ATRIP; CHK1 kinase, CHK2 kinase; HUS1 | 1 | ||||
| 4. G2-Phase | Gap 2 | |||||
| 5. M-Phase | Cell division; ‘Karyokinesis and cytokinesis’; Mitosis; Mitotic division; ‘Not interphase’ | 1 | 1 | 1 | ||
| 1 | 1 | |||||
| (i) Prophase | Envelope breakdown. | |||||
| (ii) Prometaphase | Chromosome condensation; spindle assembly; spindle elongation | 1 | 1 | |||
| (iii) Metaphase | BubR1; chromosome alignment; hyperphosphorylation; metaphase–anaphase transition; mitotic checkpoint; mitotic exit checkpoint; mitotic spindle checkpoint; spindle stabilization | 1 | 1 | |||
| (iv) Anaphase | APC/C; centrosome separation; chromatid separation; chromosome segregation; sister chromatid separation | 1 | 1 | |||
| (v) Telophase | Multinuclear | 1 | 1 | |||
| (vi) Cytokinesis | Abscission | 1 | ||||
| 1 | 1 | 1 | ||||
| (i) Mitotic spindle | Aster; centrosomes; centriole pair; Kinetochore; microtubules; mitotic center | |||||
| (ii) Metaphase plate | Midzone | 1 | ||||
| (iii) Cleavage furrow | Contractile ring; non-muscle myosin II+actin filaments | |||||
| True Positives (non-redundant) | 12 | 0 | 12 | 5 | 0 | |
| False Negatives | 8 | 20 | 8 | 15 | 20 | |
| False Positives | 0 | 0 | 58 | 6 | 0 | |
| Recall | 60% | 0% | 60% | 25% | 0% | |
| Precision | 100% | – | 17% | 46% | – | |
This table lists 20 key features of the human cell cycle, and uses this feature list as a benchmark to compare the performance of different tools based on their output (Table 2).
Cell-cycle keywords and GO terms
| Tool | G1 | S | G2 | M |
|---|---|---|---|---|
| Martini | – | – | ||
| Marmite | – | – | – | – |
| CoPub | ||||
| FatiGO | – | |||
| ProfCom | – | – |
The output of different tools applied to four gene sets, corresponding to the cell-cycle phases G1, S, G2 and M. Each output term was categorized as ‘true positive’, ‘false positive’ or ‘uninformative’. Superscripts indicate matches to the cell-cycle benchmark (Table 3). Qualitatively, Martini gave the best performance.
Keywords for metastatic versus primary melanoma
| Tool | Keywords/GO terms |
|---|---|
| Martini | |
| Marmite | |
| FatiGO | |
| ProfCom | — |
In this table, different tools have been used to compare a set of genes associated with primary melanoma, and a second set of genes associated with metastatic melanoma. Each keyword or GO term found has been classified as either mitosis-related, uninformative, or ‘not mitosis-related’. Compared with the other tools, Martini found more keywords, more-specific keywords, but also more uninformative keywords.