| Literature DB >> 16466569 |
Bin Zheng1, David C McLean, Xinghua Lu.
Abstract
BACKGROUND: Biomedical literature, e.g., MEDLINE, contains a wealth of knowledge regarding functions of proteins. Major recurring biological concepts within such text corpora represent the domains of this body of knowledge. The goal of this research is to identify the major biological topics/concepts from a corpus of protein-related MEDLINE titles and abstracts by applying a probabilistic topic model.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16466569 PMCID: PMC1420333 DOI: 10.1186/1471-2105-7-58
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Representing concepts with word distributions. Two hypothetic topics are depicted. The bar lengths indicate the word usage preference in form of probability.
Figure 2Bayesian model selection. The means of approximated evidence for different models are plotted; standard error bars are within the symbols.
The ten most common topics from a trained LDA model
| Topic words | |
| 51 | receptor coupl ligand agonist subtype pharmacolog antagonist orphan adrenerg desensit |
| 156 | kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20 |
| 136 | cerevisia saccharomyc strain yeast plasmid multicopi lacz floccul auxotroph gal1 |
| 67 | Famili member belong multigen subfamily mrg Dalton cabp28k heterogen transmembran |
| 154 | patient syndrom diseas disord autosom inherit recess ref caus clinic |
| 124 | cdna librari clone probe screen isol lambda obtain oligonucleotid gtl1 |
| 37 | neuron axon migrat motor glial spinal cord neurit dendrite outgrowth |
| 229 | mutant defect doubl phenotyp fail rescu restor impair pleiotrop unable |
| 112 | exon intron genom kb flank region span upstream bp start |
| 172 | nuclear nucleu export cytoplasm nuclei pore ran hnrnp envelop import |
Figure 3Semantic analysis for a MEDLINE abstract (PMID 9989411). The topics associated with the words were inferred by the LDA model and are shown as the superscript number next to the words. The words from the topics # 73 and # 147 are highlighted with blue and red colors, respectively.
Figure 4Determining the biological relevance of the topics. . Histogram of human assigned biological relevance scores. A score of 0 indicates no biological relevance, while scores of 1 through 5 indicate increasingly relevant and coherent biological relevance. . Relationship between the human assigned biological relevance score and the topic-GO MI.
Examples of topic-GO associations
| 278 | GO:0005730 | 0.001439 | Component | nucleolus | ribosom rrna pre deplet process small nucleolar biogenesi accumul nucleolu |
| 267 | GO:0005681 | 0.001193 | Component | spliceosome complex | splice altern pre snrnp mrna spliceosom u2 step sap snrna |
| 105 | GO:0005816 | 0.00119 | Component | spindle pole body | microtubul spindl mitot tubulin kinetochor mitosi centrosom pole centromer bodi |
| 236 | GO:0006935 | 0.00186 | Process | chemotaxis | lymphocyt macrophag chemokin monocyt neutrophil inflammatori leukocyt peripher mcp cd8 |
| 156 | GO:0006468 | 0.001514 | Process | protein amino acid phosphorylation | kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20 |
| 267 | GO:0000398 | 0.001404 | Process | nuclear mRNA splicing | splice altern pre snrnp mrna spliceosom u2 step sap snrna |
| 156 | GO:0004674 | 0.001148 | Function | protein serine/threonine kinase activity | kinas phosphoryl serin threonin pkc autophosphoryl casein akt catalyt ste20 |
| 267 | GO:0008248 | 0.001463 | Function | pre-mRNA splicing factor activity | splice altern pre snrnp mrna spliceosom u2 step sap snrna |
| 236 | GO:0008009 | 0.001093 | Function | chemokine activity | lymphocyt macrophag chemokin monocyt neutrophil inflammatori leukocyt peripher mcp cd8 |
| 224 | GO:0015671 | 5.05E-06 | Process | oxygen transport | ha uniqu characterist featur extens character typic possess unusu exhibit |
| 227 | GO:0015213 | 5.00E-06 | Function | uridine transporter activity | function defin unknown perform wide thei tissu repres consist creat |
Figure 5A directed acyclic graphical representation of the LDA model in plate notation.