| Literature DB >> 15186494 |
Patrick Glenisson1, Bert Coessens, Steven Van Vooren, Janick Mathys, Yves Moreau, Bart De Moor.
Abstract
We implemented a framework called TXTGate that combines literature indices of selected public biological resources in a flexible text-mining system designed towards the analysis of groups of genes. By means of tailored vocabularies, term- as well as gene-centric views are offered on selected textual fields and MEDLINE abstracts used in LocusLink and the Saccharomyces Genome Database. Subclustering and links to external resources allow for in-depth analysis of the resulting term profiles.Entities:
Mesh:
Year: 2004 PMID: 15186494 PMCID: PMC463076 DOI: 10.1186/gb-2004-5-6-r43
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Conceptual overview of TXTGate. We indexed two different sources of textual information about genes (LocusLink and SGD) using different domain vocabularies (offline process). These indices are used online for textual gene profiling and clustering of interesting gene groups. TXTGate's link-out feature to external databases makes it possible to investigate the profiles in more detail.
Overview of the indexed resources of textual information in TXTGate
| Resource | Information fields | Domain vocabularies used |
| LocusLink | Linked MEDLINE abstracts | GO, MeSH, eVOC, OMIM, HUGO gene symbols |
| GeneRIF annotations | GO | |
| Functional summaries | GO | |
| GO annotations | GO | |
| SGD | Linked MEDLINE abstracts | GO-pruned, SGD gene symbols |
| GO annotations | GO-pruned |
In the second column we specify which fields of the resource were used. The third column lists the domain vocabularies with which the information was indexed.
Overview of the domain vocabularies in TXTGate
| Domain vocabulary | Number of terms |
| Term-centric | |
| GO | 17,965 |
| GO-pruned (yeast) | 3,867 |
| MESH | 27,930 |
| OMIM | 2,969 |
| eVOC | 1,553 |
| Gene-centric | |
| HUGO gene symbols (human) | 26,511 |
| SGD gene symbols (yeast) | 11,319 |
The vocabularies are named after the resource they stem from.
Significance of coherence score C
| Gene groups | Size | Coherence score |
| Cell-cycle control | 19 | 1.01E-167 |
| DNA repair | 3 | 3.91E-61 |
| Fatty acids/lipids | 25 | 4.28E-08 |
| Glycosylation | 7 | 6.29E-06 |
| Methionine | 5 | 9.88E-28 |
| Mitotic exit | 9 | 1.50E-82 |
| Nutrition | 19 | 1.76E-18 |
| Pseudohyphae | 10 | 2.79E-05 |
| Secretion | 13 | 1.11E-06 |
| Sporulation | 16 | 1.11E-01 |
The significance is calulated with respect to 100-fold randomization for 10 cell-cycle related, functional groups selected from Figure 7 in Spellman et al. [37]. All groups are functionally coherent according to our score, except for the sporulation group.
TXTGate profiling of cluster E from Eisen et al. [39]
| Gene symbol | Cluster terms in Blaschke | Terms from TXTGate | |
| Subcluster E1 | glyceraldehyde-3-phosphate* | glyceraldehyd_3_phosphat_dehydrogenas | |
| glyceraldehyde-3-phosphate dehydrogenase* | glycolyt | ||
| phosphoglycerate kinase* | glucos | ||
| phosphoglycerate* | enzym | ||
| mutase* | glycolysi | ||
| dehydrogenase | carbon | ||
| enolase | pyruv_kinas | ||
| glycerol-3-phosphate dehydrogenase | ethanol | ||
| osmotic stress | phosphoglycer_kinas | ||
| phospoglycerate | growth | ||
| Subcluster E2 | alcohol* | pyruv_decarboxylas | |
| transketolase* | pyruv | ||
| catabolite repression | glucos | ||
| decarboxylase | enzym | ||
| ethanol | alcohol | ||
| glucose | decarboxyl | ||
| glucose repression | ethanol | ||
| hexokinases | ferment | ||
| pyruvate | thiamin | ||
| pyruvate decarboxylase | decarboxylas | ||
Profiling is by subclustering (k = 2). High-scoring terms are shown for each subcluster E1 and E2. We also show the terms (excluding gene names) resulting from a similar analysis conducted by Blaschke et al. [16]. *Terms that were labeled specific to a subcluster by Blaschke et al. Although several of their settings are different from ours (because of the differences in MEDLINE corpus, textual analysis and the cluster algorithm used), a comparison of the term profiles in both analyses shows that TXTGate also identifies E1 as related to glycerol, whereas E2 is more related to pyruvate metabolism and ethanol fermentation. Complete data can be found in Additional data file 1.
TXTGate profiling of clusters a, b, c, and d from Chaussabel and Sher [6] (GO vocabulary)
| Gene symbol | Cluster terms in [ | Terms from TXTGate | ||
| Cluster a | Lipoprotein | |||
| Density | lipas | |||
| Cholesterol | ldl | |||
| Lipid | ldl_receptor | |||
| Adipose | ||||
| hdl | ||||
| scaveng_receptor | ||||
| high_densiti_lipoprotein | ||||
| low_densiti_lipoprotein_receptor | ||||
| low_densiti_lipoprotein | ||||
| Cluster b | Invasive | Collagenase | ||
| Invasion | Collagen | |||
| Metastasis | Matrix | metalloendopeptidas | ||
| UPAR | MMP | |||
| UPA | Metalloproteinase | extracellular_matrix | ||
| Plasminogen | Molecule-1 | alpha | ||
| Urokinase-type | Adhesion | |||
| Urokinase | Vascular | |||
| Plasmin | Endothelial | interstiti | ||
| Activator | ||||
| Cluster c | Adenosine | purinerg | ||
| A2A | ||||
| A1 | deaminas | |||
| Antagonist | p2 | |||
| Agonist | p2x | |||
| NM | p1 | |||
| receptor | ||||
| adenosin_receptor | ||||
| ada | ||||
| Cluster d | Interferon | tumor_necrosi_factor | ||
| IFN-alpha | cytokin | |||
| IFN | induc | |||
| Interferon-gamma | ||||
| IFN-gamma | inflammatori | |||
| Inducible | antigen | |||
| lymphocyt_activ | ||||
| stimul | ||||
| chemokin | ||||
| monocyt | ||||
Corresponding terms in Chaussabel and Sher [6] and TXTGate are in bold. TXTGate's profiles are comparably informative. Complete data can be found in Additional data file 2.
Comparison of the terms in cluster e found by Chaussabel and Sher [6] with those found by TXTGate (OMIM vocabulary)
| Gene symbol | Cluster terms in Chaussabel and Sher [ | Terms from TXTGate |
| Cluster e | ||
| Population | deaminas | |
| Frequency | ||
| Allele | creatin | |
| Unrelated | lipoprotein | |
| Families | ||
| Recessive | ||
| Autosomal | ||
| Disorder | bear | |
| Severe | leukodystrophi | |
| Patient | receptor | |
| Deficiency | down | |
| hdl | ||
| nucleosid | ||
| retinoblastoma | ||
| junction | ||
| adhesion | ||
| congenit_heart_defect | ||
The diversity of the diseases the member genes are related to makes the relevant terms display high variance, rather than high mean. The terms that were also found by Chaussabel and Sher [6] after manual investigation are marked in bold. Complete data can be found in Additional data file 2.
Various perspectives on textual information in TXTGate
| GO | OMIM | MeSH | eVOC |
| mismatch_repair | colorect | colorect_neoplasm | colorect |
| tumor | colorect_cancer | mismatch | tumour |
| dna_repair | tumor | cancer | malign_tumour |
| mismatch | kinas | colorect | colon |
| pair | colon | mutat | growth |
| tumor_suppressor | hereditari | repair | cell |
| apc | cancer | dna_repair | carcinoma |
| kinas | colon_cancer | colon | metabol |
| somat | associ | neoplasm_protein | fibroblast |
| ra | on | tumor | chain |
Here we show how term-centric vocabularies based on GO, OMIM, MeSH and eVOC profile a group of genes involved in colon and colorectal cancer.
Co-linkage analysis of genes with gene-centric vocabularies
| Gene name | Description |
| hnpcc | Hereditary nonpolyposis colon cancer |
| apc | Adenomatous polyposis coli protein |
| p53 | Cellular tumor antigen P53 (tumor suppressor P53) |
| mlh1 | DNA mismatch repair protein MLH1 (mutL protein homolog 1) |
| E. coli mismatch repair gene mutS | |
| Cyclin-dependent kinase inhibitor 1A | |
| msh2 | DNA mismatch repair protein MSH2 (mutS protein homolog 2) |
| bax | BAX protein, cytoplasmic isoform delta |
| Wingless-type MMTV integration site family members | |
| pms2 | DNA mismatch repair protein PMS2 |
| src | Proto-oncogene tyrosine protein kinase SRC |
| dcc | Tumor suppressor protein DCC precursor (colorectal cancer suppressor) |
| mcc | Colorectal mutant cancer protein MCC |
| braf | Proto-oncogene serine/threonine protein kinase B-RAF |
| fgfr3 | Fibroblast growth factor receptor 3 precursor |
| hcc | Hepatocellular carcinoma |
| dra | Chloride anion exchanger DRA |
| axin2 | AXIS inhibition protein 2 |
| pms1 | DNA mismatch repair protein PMS1 |
| Abelson murine leukemia viral oncogene homolog 1 | |
| bub1 | Mitotic checkpoint serine/threonine protein kinase BUB1 |
| Protein tyrosine phosphatase family | |
| bcl10 | B cell lymphoma/leukemia 10 |
| Protein tyrosine phosphatase family with C-terminal PEST-motif | |
| PDGF-receptor beta-like tumor suppressor |
This table shows the top-25 colinked gene symbols in the pool of abstracts of the colon and colorectal cancer case. Genes that were not in the query list are indicated in bold.
Textual profile of a gene group from a mouse model for human benign tumors of the salivary glands
| Terms sorted by mean | Terms sorted by variance |
| organ | organ |
| intern | intern |
| normal | growth |
| red | development |
| male | fibroblast |
| femal | tumour |
| visual | red |
| capillari | nucleu |
| system | normal |
| optic | embryo |
| retina | tera |
| viral | depend |
| bacteri | stem_cell |
| adult | kidnei |
| chain | epithelium |
| cell | visual |
| growth | multipl |
| tissu | skin |
| development | muscl_cell |
| metabol | system |
| embryo | capillari |
| fibroblast | mammari |
| tumour | type_ii |
| depend | bacteri |
| genet | male |
This table shows the 25 top-ranking terms (for both mean and variance) of the textual profile of a group of 350 genes that were upregulated in a mouse model for human benign tumors of the salivary glands processed with the eVOC domain vocabulary.
Figure 2Background distributions for cluster incoherence. Cluster incoherence is defined as the median distance in vector space between the mean cluster profile and all individual gene profiles. Probability density functions (pdf) are shown for random clusters of size 350 (blue curve) and random clusters of random size (blue bars). For randomly sized clusters, the cumulative distribution function (cdf) is also shown (red curve).