| Literature DB >> 16438716 |
Monica Chagoyen1, Pedro Carmona-Saez, Hagit Shatkay, Jose M Carazo, Alberto Pascual-Montano.
Abstract
BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16438716 PMCID: PMC1386711 DOI: 10.1186/1471-2105-7-41
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Method overview. Schematic overview of the method and corresponding gene representation.
Biological processes in test data set (according to GO Slim annotations in "Biological Process Category")
| GO:0007049 | cell cycle | 77 | 5 DNA metabolism |
| 2 response to stress | |||
| GO:0007047 | cell wall organization and biogenesis | 32 | 3 signal transduction |
| GO:0006259 | DNA metabolism | 146 | 5 cell cycle |
| 1 transport | |||
| GO:0006629 | lipid metabolism | 34 | 1 response to stress |
| GO:0042158 | protein biosynthesis | 49 | - |
| GO:0006950 | response to stress | 63 | 2 cell cycle |
| 1 signal transduction | |||
| 4 transport | |||
| 1 lipid metabolism | |||
| GO:0007165 | signal transduction | 39 | 3 cell wall organization and biogenesis |
| 1 response to stress | |||
| GO:0006810 | transport | 152 | 1 DNA metabolism |
| 4 response to stress |
Example semantic features (SGD8 dataset). Top 10 terms in the k = 8 semantic features obtained for a NMF experiment (ordered by decreasing importance). Labels show topical interpretations provided by experts (including more concrete topics in parenthesis)
| replic | repair | glucos | actin | spindl | mitochondri | transport | translat |
| pcna | telomer | fatti | swi | cyclin | preprotein | vesicl | mrna |
| dna | dsb | heat | nucleosom | kinetochor | mitochondria | vacuolar | trna |
| ner | recombin | stress | snf | hsp90 | inner | vacuol | alpha |
| damag | mismatch | endoplasm | histon | chaperon | transloc | membran | gcn4 |
| checkpoint | dna | reticulum | chromatin | scf | outer | nitrogen | beta |
| rfc | rad52 | proteasom | elong | anaphas | membran | secretori | gtp |
| pol | excis | phosphatas | mate | mitosi | matrix | autophagi | phosphoryl |
| polymeras | rad51 | atpas | silenc | centromer | oxid | cytoplasm | exchang |
| rad6 | endonucleas | sphingolipid | polar | mitot | translocas | sort | kinas |
Figure 2Example semantic profiles. Genes are represented as semantic profiles (linear combination of semantic features). Profiles of some of the genes in the SGD8 dataset are shown, using semantic features (F1 to F8) in table 2.
Figure 3SGD8 dataset gene clustering. Two-way hierarchical clustering of both gene-documents and semantic features of the SGD8 set allows determination of gene clusters and corresponding significant factors.
SGD8 gene clusters. Clusters obtained from the SGD8 dataset.
| 'DNA metabolism' (103 genes) | The rest of the genes in the cluster (5) are also related to DNA repair and replication processes taking into account their functional annotations in SGD. | |
| 'lipid metabolism' (32 genes) | Genes annotated with other Slim categories (6), also contain functional annotations in SGD revealing their implication in lipid metabolism. | |
| 'response to stress' (23 genes) | Among genes with other Slim categories there are genes involved in the ubiquitin-dependent protein catabolism ( | |
| 'transport' (37 genes) | Most genes in the cluster (40) are annotated with membrane related localizations in GO cell component category: 'plasma membrane' (35 genes), 'periplasmic space' (4 genes) and 'membrane fraction' (1 gene). | |
| 'transport' (20 genes) | 13 genes correspond to hydrogen-transporting V-type ATPases (namely | |
| 'transport' (42 genes) | Non-transport genes are related to vacuole organization and inheritance ( | |
| 'transport' (24 genes) | Contains mitochondria located genes. Transport genes: members of the mitochondrial protein translocase family ( | |
| 'DNA metabolism' (28 genes) | All genes contain chromatin related GO annotations in SGD. Contains 5 genes with other Slim categories related to chromatin. | |
| 'cell cycle' (33 genes) | Contains also 'transport' and two "signal transduction" Slim genes. Transport genes are: | |
| 'protein biosynthesis' (40 genes) | Other genes in the cluster include translation elongation and translation initiation factors as well as those involved in mRNA processing like mRNA catabolism, mRNA-nucleus export or the RNA polymerase II transcription machinery (e.g. regulators like | |
| 'cell wall organization and biogenesis' (25 genes) | Among them a significant number is related to cell shape and structure (cell wall and cytoskeleton), as well as events and processes related to morphological changes in the cellular envelope (cell budding, sporulation, conjugation with cellular fusion, endocytosis). |
Semantic features (Reelin dataset clusters). Top 10 terms of semantic features representing the four clusters obtained for the Reelin dataset. An average semantic feature has been calculated from the characteristic features in each cluster obtained by two-way hierarchical clustering.
| p53 | notch | app | tgf-beta |
| egfr | sonic | abeta | reelin |
| c-myc | notch1 | amyloid | tau |
| breast | presenilin | gamma-secretas | fyn |
| tumor | tgf-beta | alzheim | egfr |
| cancer | limb | presenilin | phosphoryl |
| neu | bud | apo | apo |
| tgf-beta | ventral | beta-amyloid | src |
| p21 | mesenchym | amyloid-beta | neuron |
| vegf | patch | plaqu | apolipoprotein |
Figure 4MAPK signaling pathway mapping. Subset of genes in cluster K (colored in pink) mapped onto the MAPK signaling pathway diagram for S. cerevisiae (04010sce pathway), as provided in the KEGG PATHWAY database [37].
Semantic features (SGD8 dataset clusters). Top 10 terms of semantic features representing the eleven clusters obtained for the SGD8 dataset. An average semantic feature has been calculated from the characteristic features in each cluster obtained by two-way hierarchical clustering.
| repair, dna, replic, telomere, checkpoint, pcna, damag, dsb, recombin, mismatch | |
| sterol, fatti, lipid, ergosterol, synthetas, synthas, biosynthesi, actin, heat, sphingolipid | |
| mitochondri, hsp70, ubiquitin, dna, oxid, chaperon, shock, rad52, heat, camp | |
| transport, membran, vesicle, golgi, outer, receptor, copii, export, snare, vacuolar | |
| vacuolar, v-atpas, membran, vacuol, vesicl, transport, golgi, glucose, transloc, cytosol | |
| transport, uptake, vesicle, vacuolar, vacuole, membran, permeas, nitrogen, ubiquitin, glucos | |
| mitochondria, mitochondri, mitochondrion, inner, ino1, matur, translat, transles, membrane-associ, membran | |
| nucleosom, histon, swi, snf, chromatin, remodel, arrai, transcript, silenc, acetyl | |
| spindl, cyclin, kinetochor, checkpoint, anaphase, mitosi, mitot, sister, chromosom, replic | |
| translat, mrna, ribosom, rna, poli, swi, snf, elong, transcript, atpas | |
| actin, kinas, wall, phosphoryl, pheromone, mate, phosphates, cytoskeleton, glucose, polar |
Figure 5Reelin dataset gene clustering. A) Two-way hierarchical clustering of the Reelin set corresponding to 10 NMF factorizations with k = 7. Four cluster selection. B) Detailed view, where the semantic feature common to Notch signaling genes and Alzheimer cluster is highlighted.