| Literature DB >> 25878657 |
Yan Yan1, Xu-Cheng Yin1, Sujian Li2, Mingyuan Yang1, Hong-Wei Hao3.
Abstract
High-level abstraction, for example, semantic representation, is vital for document classification and retrieval. However, how to learn document semantic representation is still a topic open for discussion in information retrieval and natural language processing. In this paper, we propose a new Hybrid Deep Belief Network (HDBN) which uses Deep Boltzmann Machine (DBM) on the lower layers together with Deep Belief Network (DBN) on the upper layers. The advantage of DBM is that it employs undirected connection when training weight parameters which can be used to sample the states of nodes on each layer more successfully and it is also an effective way to remove noise from the different document representation type; the DBN can enhance extract abstract of the document in depth, making the model learn sufficient semantic representation. At the same time, we explore different input strategies for semantic distributed representation. Experimental results show that our model using the word embedding instead of single word has better performance.Entities:
Mesh:
Year: 2015 PMID: 25878657 PMCID: PMC4386712 DOI: 10.1155/2015/650527
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Hidden layer nodes construction of DBN and DBM.
Figure 2Hybrid Deep Belief Network.
Figure 3The 50-dimensional word embedding.
Document classification accuracy and document retrieval precision.
| Dataset | Model | L_r | Document classification | Document retrieval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Output units number | Output units number | |||||||||||
| 50 | 100 | 128 | 512 | 1000 | 50 | 100 | 128 | 512 | 1000 | |||
| 20 Newsgroups | RSM | 0.01 | 63.11 | 66.82 | 67.67 | 66.52 | 69.31 | 62.82 | 67.80 | 68.64 | 65.10 | 67.48 |
| 0.001 | 60.94 | 62.91 | 71.55 | 73.68 | 73.05 | 56.10 | 58.32 | 69.17 | 69.65 | 69.34 | ||
| 0.0001 | 36.59 | 62.75 | 63.33 | 70.40 | 71.38 | 28.46 | 58.56 | 59.39 | 66.37 | 65.14 | ||
| DBN | 0.01 | 72.77 | 72.35 | 72.49 | 72.56 | 72.39 | 67.52 | 67.78 | 68.90 | 69.27 | 69.81 | |
| DocNADE | 0.01 | 73.83 | 73.40 | 73.54 | 74.75 | 74.11 | 68.50 | 68.78 | 70.17 | 70.66 | 70.82 | |
| HDBN | 0.01 |
|
|
|
|
|
|
|
|
|
| |
|
| ||||||||||||
| BBC data | RSM | 0.01 | 94.04 | 94.65 | 95.41 | 95.11 | 95.87 | 93.44 | 93.29 | 92.98 | 93.90 | 94.81 |
| 0.001 | 95.11 | 96.48 | 96.64 | 94.65 | 96.64 | 94.66 | 95.34 | 96.18 | 95.79 | 96.03 | ||
| 0.0001 | 60.40 | 94.65 | 94.80 | 96.79 | 94.95 | 70.88 | 91.76 | 94.05 | 95.88 | 95.42 | ||
| DBN | 0.01 | 95.31 | 94.81 | 96.03 | 95.06 | 95.05 | 93.18 | 94.10 | 94.30 | 94.13 | 94.69 | |
| DocNADE | 0.01 | 95.60 | 96.77 | 96.93 | 96.89 | 96.93 | 94.94 | 95.44 | 96.28 | 95.98 | 96.13 | |
| HDBN | 0.01 |
|
|
|
|
|
|
|
|
|
| |
Document classification accuracy and document retrieval precision of six categories on 20 Newsgroups.
| Dataset | Model | L_r | Document classification | Document retrieval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Output units number | Output units number | |||||||||||
| 50 | 100 | 128 | 512 | 1000 | 50 | 100 | 128 | 512 | 1000 | |||
| 20 Newsgroups | RSM | 0.01 | 76.13 | 79.60 | 80.63 | 80.24 | 80.61 | 75.40 | 79.83 | 80.24 | 78.64 | 79.66 |
| 0.001 | 74.11 | 76.20 | 81.88 | 81.03 | 81.34 | 68.23 | 70.43 | 80.89 | 81.08 | 81.07 | ||
| 0.0001 | 57.88 | 76.89 | 76.92 | 83.34 | 82.93 | 58.66 | 71.98 | 72.64 | 80.45 | 78.93 | ||
| DBN | 0.01 | 76.52 | 79.77 | 81.03 | 82.94 | 82.11 | 77.36 | 80.21 | 80.07 | 80.99 | 79.01 | |
| DocNADE | 0.01 | 78.82 | 81.37 | 82.70 | 83.67 | 83.76 | 78.13 | 81.01 | 81.46 | 81.81 | 81.15 | |
| HDBN | 0.01 |
|
|
|
|
|
|
|
|
|
| |
Document classification accuracy and retrieval precision using keyword word embedding.
| Dataset | Document classification | Document retrieval | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Output units number | Output units number | |||||||||
| 50 | 100 | 128 | 512 | 1000 | 50 | 100 | 128 | 512 | 1000 | |
| 20 Newsgroups | 81.57 | 81.68 | 82.35 | 82.09 | 82.29 | 81.91 | 76.89 | 78.99 | 78.50 | 78.91 |
| BBC data | 98.41 | 98.84 | 99.35 | 98.82 | 97.76 | 97.26 | 96.97 | 98.05 | 97.52 | 98.58 |
Figure 420 Newsgroups and BBC documents representation visualization.