| Literature DB >> 29295179 |
Zhiguo Yu1, Thang Nguyen2, Ferdinand Dhombres3, Todd Johnson1, Olivier Bodenreider3.
Abstract
Extracting and understanding information, themes and relationships from large collections of documents is an important task for biomedical researchers. Latent Dirichlet Allocation is an unsupervised topic modeling technique using the bag-of-words assumption that has been applied extensively to unveil hidden thematic information within large sets of documents. In this paper, we added MeSH descriptors to the bag-of-words assumption to generate 'hybrid topics', which are mixed vectors of words and descriptors. We evaluated this approach on the quality and interpretability of topics in both a general corpus and a specialized corpus. Our results demonstrated that the coherence of 'hybrid topics' is higher than that of regular bag-of-words topics in the specialized corpus. We also found that the proportion of topics that are not associated with MeSH descriptors is higher in the specialized corpus than in the general corpus.Entities:
Keywords: Data Interpretation; Medical Subject Headings; Models; Statistical; Statistical Data
Mesh:
Year: 2017 PMID: 29295179 PMCID: PMC5875427
Source DB: PubMed Journal: Stud Health Technol Inform ISSN: 0926-9630
Generated ‘bag-of-MeSH&word’ topics (*MeSH descriptors)
| Topic 1 | Topic 2 | Topic 3 |
|---|---|---|
| model | *brain | motor |
| predict | cortex | visual |
| value | region | *movement |
| prediction | functional | *face |
| analysis | cortical | right |
| predictive | activity | response |
| regression | neural | *hand |
| datum | network | processing |
| estimate | change | object |
| predictor | area | stimuli |
| 1-0 mapping | 1-1 mapping | 1-many mapping |
Figure 1Comparison of mean TC-W2V topic coherence scores for different numbers of topics k, generated from the general corpus
Figure 2Comparison of mean TC-W2V topic coherence scores for different numbers of topics k, generated from the specialized corpus
Figure 3Plot of mean TC-W2V topic coherence scores for different numbers of topics k, generated from the general corpus
Figure 4Plot of mean TC-W2V topic coherence scores for different numbers of topics k, generated from the specialized corpus
# of topics with 0,1,n MeSH descriptors (n>1)
| Data Set | Optimal | # of Topic with 0 MD | # of Topic with 1 MD | # of Topic with n MD |
|---|---|---|---|---|
| General Corpus | 200 | 6(3%) | 16(8%) | 178(89%) |
| Spec. Corpus | 22 | 9(41%) | 6(27%) | 7(32%) |