| Literature DB >> 24086311 |
Nan Qiao1, Yi Huang, Hammad Naveed, Christopher D Green, Jing-Dong J Han.
Abstract
A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called "CoCiter" based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/.Entities:
Mesh:
Year: 2013 PMID: 24086311 PMCID: PMC3781068 DOI: 10.1371/journal.pone.0074074
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic view of the functions in CoCiter.
The Gene-Gene and Gene-Term association analysis functions (asterisked) include significance test for the result.
Unique features of CoCiter compared with those of existing gene function analysis tools.
| Category | Name | Input | Species | Type | Gene function enrichment analysis | Compare to user defined gene sets | Compare to user defined term sets |
|
|
| Genes | 3 | ArrayTrack plug-in | √ | ||
|
| Genes | 24 | Cytoscape plug-in | √ | |||
|
| Genes | 14 | Cytoscape plug-in | √ | |||
|
| Genes | 6 | Web + API | √ | |||
|
| Genes | Many | Web + API | √ | |||
|
| Genes | 10 | Web | √ | √ | ||
|
|
| Genes | 8 | Web | |||
|
| Genes and terms | Many | Web + API | ||||
|
| Genes and terms | Many | Web | ||||
|
| Genes | Many | Web | ||||
|
| Genes and terms | Many | Standalone | √ | |||
|
| Genes and terms | 3 | Web + API | √ | |||
|
| Genes | Many | Web | √ | √ | ||
|
| Genes and terms | Many | Web + API | √ | √ | √ | |
|
| Genes and terms | Many | Web + API | √ | √ | √ | |
by utilizing GO enrichment analysis.
limited to a small set of predefined terms.
only finds the differences between two conditions (conditions could be genes or terms) based on key words.
Performances of CoCiter, FatiGO, Martini and Marmite on the disease resistant and randomly selected dataset.
| Tools | Comparison | Time | P | CI | Description |
| CoCiter | Disease resistance gene vs. Disease resistance term | 4 sec | <0.001 | 6.3576 | |
| Random genes vs. Disease resistance terms | 7 sec | 0.337 | 4.1699 | ||
| Disease resistance genes vs. random genes | 9 sec | 0.979 | 10.3859 | ||
| FatiGO | Disease resistance gene vs. Disease resistance term background | 4 min | <0.001 | defense response, immune response | Unable to accept user defined terms |
| Random genes vs. Disease resistance term background | 4 min | NA | Nothing enriched in these genes | Unable to accept user defined terms | |
| Disease resistance gene vs. randomly picked gene | 2 min | <0.001 | response to stress, defense response, immune response | ||
| Martini | Disease resistance gene vs. Disease resistance term | NA | NA | NA | too many entries to carry on |
| Random selected genes vs. Disease resistance terms | NA | NA | NA | too many entries to carry on | |
| Disease resistance genes vs. random genes | <30s | <0.001 | disease, resistance, avirulent, pathogen, plant diseases | ||
| Marmite | Disease resistance gene vs. Disease resistance term | NA | NA | NA | Unable to accept user defined terms |
| Random gene vs. Disease resistance term | NA | NA | NA | Unable to accept user defined terms | |
| Disease resistance gene vs. randomly picked gene | <30 sec | NA | NA | No entities found |
Strike-through fonts indicate unavailable functions.
Performances of CoCiter, FatiGO, Martini and Marmite on the nuclear and plasma membrane protein-coding dataset.
| Tools | Comparison | Time | P | CI | Description |
| CoCiter | Plasma membrane protein-coding genes vs. plasma membrane term | 1 min 15 sec | <0.001 | 12.3923 | |
| Nuclear protein coding genes vs. nucleus term | 1 min 50 sec | <0.001 | 12.9814 | ||
| Plasma membrane protein coding genes vs. nuclear protein-coding genes | 5 min 15 sec | 0.072 | 14.4227 | ||
| FatiGO | Plasma membrane protein-coding genes vs. plasma membrane term background | 18 min | <0.001 | membrane fraction, extrinsic to plasma membrane, lateral plasma membrane | Unable to accept user defined terms |
| Nuclear protein-coding genes vs. nucleus term backgroundz | 18 min | <0.01 | chromosome, ribonucleoprotein complex, pronucleus, membrane fraction, endomembrane system, extracellular space | Unable to accept user defined terms | |
| Plasma membrane protein coding genes vs. nuclear protein coding genes | 13 min | <0.001 | extracellular matrix, extracellular region part, extracellular space, basolateral plasma membrane, lateral plasma membrane, extrinsic to plasma membrane, Golgi apparatus, organelle membrane, endomembrane system, nuclear part, chromosome | ||
| Martini | Plasma membrane protein-coding genes vs. plasma membrane term | NA | NA | NA | Too many entries to carry on |
| Nuclear protein coding genes vs. nucleus term | NA | NA | NA | Too many entries to carry on | |
| Plasma membrane protein coding genes vs. nuclear protein-coding genes | 1 day 8 hour | <0.001 | transmembrane, membrane, plasma membrane, subnucleus, nucleus | ||
| Marmite | Plasma membrane protein-coding genes vs. plasma membrane term | NA | NA | NA | Unable to accept user defined terms |
| Nuclear protein coding genes vs. nucleus term | NA | NA | NA | Unable to accept user defined terms | |
| Plasma membrane protein coding genes vs. nuclear protein-coding genes | <1 min | 0.057 | Cancer |
Strike-through fonts indicate unavailable functions.
Figure 2ROC curves of CoCiter and GO enrichment analysis by Fisher exact test.
The analysis was based on 2097 gold standard positives (GSP) and 603 gold standard negatives (GSN) selected from the overlapping and non-overlapping GO and KEGG annotations, respectively (Supplemental Methods in File S1). The curve for CoCiter_Gene_Gene association function was obtained by using the KEGG genes and GO genes as input, while that for CoCiter_Gene_Term association function was obtained using the KEGG pathway keywords as terms and GO genes as input.
Significance of association for 10 GSP and GSN gene and term sets detected by CoCiter or the GO-based analysis.
| ID | Type | CoCiter gene- gene analysis | CoCiter gene- term analysis | Fisher's exact test |
| hsa00071: Fatty acid metabolism VS. GO:0000038∼ very-long-chain fatty acid metabolic process | GSP | 0 | 0.002 | 0.010020764 |
| hsa04310: Wnt signaling pathway VS. GO:0017147∼ Wnt-protein binding | GSP | 0 | 0 | 0.010598356 |
| hsa04070: Phosphatidylinositol signaling system VS. GO:0008526∼ phosphatidylinositol transporter activity | GSP | 0.004 | 0.003 | 0.011084552 |
| hsa00252: Alanine and aspartate metabolism VS. GO:0009067∼ aspartate family amino acid biosynthetic process | GSP | 0.014 | 0.006 | 0.011268153 |
| hsa04540: Gap junction VS. GO:0005243∼ gap junction channel activity | GSP | 0.002 | 0 | 0.011605198 |
| hsa04010: MAPK signaling pathway VS. GO:0043409∼ negative regulation of MAPKKK cascade | GSP | 0 | 0.006 | 0.012145396 |
| hsa04020: Calcium signaling pathway VS. GO:0051925∼ regulation of calcium ion transport via voltage-gated calcium channel | GSP | 0 | 0.008 | 0.012581597 |
| hsa04910: Insulin signaling pathway VS. GO:0032868∼ response to insulin stimulus | GSP | 0.004 | 0.004 | 0.013070018 |
| hsa04210: Apoptosis VS. GO:0042771∼ DNA damage response, signal transduction by p53 class mediator resulting in induction of apoptosis | GSP | 0 | 0 | 0.013908051 |
| hsa04510: Focal adhesion VS. GO:0051895∼ negative regulation of focal adhesion formation | GSP | 0 | 0 | 0.014279408 |
| hsa05030: Amyotrophic lateral sclerosis (ALS) VS. GO:0008624∼ induction of apoptosis by extracellular signals | GSP | 0 | 0.009 | 0.016153439 |
| hsa05219: Bladder cancer VS. GO:0044444∼ cytoplasmic part | GSN | 0 | 0 | 0.000597928 |
| hsa04664: Fc epsilon RI signaling pathway VS. GO:0046456∼ icosanoid biosynthetic process | GSN | 0.001 | 0 | 0.001236482 |
| hsa05040: Huntington's disease VS. GO:0030685∼ nucleolar preribosome | GSN | 0 | 0.017 | 0.002211231 |
| hsa04012: ErbB signaling pathway VS. GO:0050877∼ neurological system process | GSN | 0 | 0 | 0.00252409 |
| hsa00561: Glycerolipid metabolism VS. GO:0010033∼ response to organic substance | GSN | 0.019 | 1 | 0.00258624 |
| hsa04614: Renin-angiotensin system VS. GO:0004245∼ neprilysin activity | GSN | 0 | 1 | 0.002828214 |
| hsa04130: SNARE interactions in vesicular transport VS. GO:0006906∼ vesicle fusion | GSN | 0.002 | 1 | 0.005413028 |
| hsa00252: Alanine and aspartate metabolism VS. GO:0016885∼ ligase activity, forming carbon-carbon bonds | GSN | 0 | 0.104 | 0.005649569 |
| hsa05212: Pancreatic cancer VS. GO:0050801∼ ion homeostasis | GSN | 0 | 0.007 | 0.008080875 |
| hsa04350: TGF-beta signaling pathway VS. GO:0045687∼ positive regulation of glial cell differentiation | GSN | 0 | 1 | 0.008538547 |
These GSP and GSN pairs are at the border of the Fisher exact test significance level p = 0.01. The full table is shown in Table S8 in File S2.