| Literature DB >> 16539745 |
Minsuk Lee1, Weiqing Wang, Hong Yu.
Abstract
BACKGROUND: Topic detection is a task that automatically identifies topics (e.g., "biochemistry" and "protein structure") in scientific articles based on information content. Topic detection will benefit many other natural language processing tasks including information retrieval, text summarization and question answering; and is a necessary step towards the building of an information system that provides an efficient way for biologists to seek information from an ocean of literature.Entities:
Mesh:
Year: 2006 PMID: 16539745 PMCID: PMC1472693 DOI: 10.1186/1471-2105-7-140
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The number of the OMIM entries as a function of the number of topics.
Percentage of the accuracy (Acc.) when apply naïve Bayes (NB) to detect topics in OMIM with different learning features.
| NB | S | S+M | S+M+T | S+M+T+A | M | T+A | GM | GM+M |
| Acc. | 54.4 | 62.2 | 63.3 | 66.4 | 62.6 | 65.9 | 47.4 | 62.0 |
S = semantic sypes
M = MeSH terms
T = title
A = Abstract
GM = general MeSH terms
Figure 2Topic clustering in OMIM. The cost of detection of Single-pass (A) and Group-wise-average (B) with different features; namely, semantic types only (S); combined semantic types and MeSH terms (S+M); semantic types with MeSH terms and title (S+M+T); all four features (S+M+T+A); the MeSH terms alone (M); and combined title and abstract (T+A). (C) Comparison of the cost of detection between single-pass and group-wise-average with MeSH terms alone as features and similarity threshold τ = 0.5.
Figure 3Topic Clustering as a function of number of topics (similarity of threshold τ = 0.5.).
Figure 4Topic Clustering of articles cited in biological review articles. (A) The cost of detection of Group C with different features; namely, semantic types only (S); combined semantic types and MeSH terms (S+M); semantic types with MeSH terms and title (S+M+T); all four features (S+M+T+A); the MeSH terms alone (M); and combined title and abstract (T+A). (B) The cost of detection of three groups (A, B and C) with the semantic types as the feature.
OMIM topics and the number of documents that have been assigned to each topic.
| Topics | Total Number | No_dup* | Topics | Total Number | No_dup* |
| CLONING | 7719 | 2397 | ANIMAL MODEL | 879 | 261 |
| MAPPING | 7487 | 1760 | CYTOGENETICS | 770 | 222 |
| MOLECULAR GENETICS | 7139 | 2347 | OTHER FEATUERS | 349 | 90 |
| CLINICAL FEATURES | 6917 | 3044 | HETEROGENEITY | 298 | 62 |
| GENE FUNCTION | 6469 | 2960 | HISTORY | 131 | 42 |
| GENE STRUCTURE | 2444 | 165 | EVOLUTION | 126 | 16 |
| INHERITANCE | 1513 | 385 | ALLELIC VARIANTS | 122 | 80 |
| DIAGNOSIS | 1300 | 193 | NOMENCLATURE | 99 | 9 |
| POPULATION GENETICS | 1163 | 213 | GENOTYPE | 91 | 12 |
| PATHOGENESIS | 1062 | 216 | GENE FAMILY | 83 | 25 |
| PHENOTYPE | 1034 | 270 | GENE THERAPY | 51 | 21 |
| BIOCHEMICAL FEATURES | 993 | 314 | GENETIC VARIABILITY | 47 | 17 |
| CLINICAL MANAGEMENT | 974 | 284 |
* The number of references that have been assigned to only one topic, not others, within an OMIM entry.