| Literature DB >> 25685832 |
Asmaa M El-Said1, Ali I Eldesoky1, Hesham A Arafat1.
Abstract
Tremendous growth in the number of textual documents has produced daily requirements for effective development to explore, analyze, and discover knowledge from these textual documents. Conventional text mining and managing systems mainly use the presence or absence of key words to discover and analyze useful information from textual documents. However, simple word counts and frequency distributions of term appearances do not capture the meaning behind the words, which results in limiting the ability to mine the texts. This paper proposes an efficient methodology for constructing hierarchy/graph-based texts organization and representation scheme based on semantic annotation and Q-learning. This methodology is based on semantic notions to represent the text in documents, to infer unknown dependencies and relationships among concepts in a text, to measure the relatedness between text documents, and to apply mining processes using the representation and the relatedness measure. The representation scheme reflects the existing relationships among concepts and facilitates accurate relatedness measurements that result in a better mining performance. An extensive experimental evaluation is conducted on real datasets from various domains, indicating the importance of the proposed approach.Entities:
Year: 2015 PMID: 25685832 PMCID: PMC4313059 DOI: 10.1155/2015/136172
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1The proposed semantic text organization/representation scheme for information access and knowledge acquisition processes.
Figure 2A framework for constructing a semantic hierarchy/graph text representation scheme.
Figure 3The hierarchy-based structure of the textual document.
Algorithm 1TRN algorithm.
Algorithm 2C-SynD algorithm.
Algorithm 3SALA_links algorithm.
Algorithm 4IOAC-QL algorithm.
Figure 4The IOAC-QL algorithm implementation.
Action values table with the SALA decision.
| State | FVO1 | FVO1 | FVO1 | FVO1 |
| Action | Similarity | Contiguity | Contrast | Causality |
| ActionResult State | FVO1 FVO7 FVO9 FVO22 FVO75 FVO92 FVO112 FVO125 | FVO103 FVO17 FVO44 | FVO89 | FVO63 FVO249 |
| Total reward | 184 | 74 | 22 | 25 |
|
| 94 | 61 | 32.9 | 42.9 |
| Decision | Yes | No | No | No |
Contingency table presents human judgment and system judgment.
| Human judgment | ||
|---|---|---|
| True | False | |
| System judgment | ||
| True | TP | FP |
| False | FN | TN |
Datasets details for the experimental setup.
| Dataset | Experimenting | Dataset Name | Description |
|---|---|---|---|
| DS1 | Experiment 1: content-based evaluation | Miller and Charles (MC)1 | RG consists of 65 pairs of nouns extracted from the WordNet, rated by multiple human annotators. |
| DS2 | Microsoft Research Paraphrase Corpus (MRPC)2 | The corpus consists of 5,801 sentence pairs collected from newswire articles, 3,900 of which were labeled as relatedness by human annotators. The whole set is divided into a training subset (4,076 sentences of which 2,753 are true) and a test subset (1,725 pairs of which 1,147 are true). | |
|
| |||
| DS3 | Experiment 2: coselection-based evaluation (closest-synonym detection) | British National Corpus (BNC)3 | BNC is a carefully selected collection of 4124 contemporary written and spoken English texts, contains 100-million-word text corpus of samples of written and spoken English with the near-synonym collocations. |
| DS4 | SN (semantic neighbors)4 | SN relates 462 target terms (nouns) to 5910 relatum terms with 14.682 semantic relations (7341 are meaningful and 7341 are random). The SN contains synonyms coming from three sources: WordNet 3.0, Roget's thesaurus, and a synonyms database. | |
|
| |||
| DS5 | Experiment 2: coselection-based evaluation (semantic relationships exploration) | BLESS6 | BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625 relatum terms with 26.554 semantic relations (14.440 are meaningful (correct) and 12.154 are random). Every relation has one of the following types: hypernymy, cohypernymy, meronymy, attribute, event, or random. |
| DS6 | TREC5 | TREC includes 1437 sentences annotated with entities and relations at least one relation. There are three types of entities: person (1685), location (1968), and organization (978); in addition there is a fourth type others (705), which indicates that the candidate entity is none of the three types. There are five types of relations: located in (406) indicates that one location is located inside another location, work for (394) indicates that a person works for an organization, OrgBased in (451) indicates that an organization is based in a location, live in (521) indicates that a person lives at a location, and kill (268) indicates that a person killed another person. There are 17007 pairs of entities that are not related by any of the five relations and hence have the NR relation between them which thus significantly outnumbers other relations. | |
| DS7 | IJCNLP 2011-New York Times (NYT)6 | NYT contains 150 business articles from NYT. There are 536 instances (208 positive, 328 negative) with 140 distinct descriptors in NYT dataset. | |
| DS8 | IJCNLP 2011-Wikipedia8 | Wikipedia personal/social relation dataset was previously used in Culotta et al. [ | |
|
| |||
| DS9 | Experiment 3: task-based evaluation | Reuters 21,5787 | Reuters-21,578 contains 21,578 documents (12,902 are used) categorized to 10 categories. |
| DS10 | 20 Newsgroups8 | 20 newsgroups dataset contains 20,000 documents (18,846 are used) categorized to 20 categories. | |
1Available at http://www.cs.cmu.edu/~mfaruqui/suite.html.
2Available at http://research.microsoft.com/en-us/downloads/.
3Available at http://corpus.byu.edu/bnc/.
4Available at http://cental.fltr.ucl.ac.be/team/~panchenko/sre-eval/sn.csv.
5Available at http://l2r.cs.uiuc.edu/~cogcomp/Data/ER/conll04.corp.
6Available at http://www.mysmu.edu/faculty/jingjiang/data/IJCNLP2011.zip.
7Available at http://mlr.cs.umass.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection.
8Available at http://www.csmining.org/index.php/id-20-newsgroups.html.
The results of the comparison of the correlation coefficient between human judgment with some relatedness measures and the proposed semantic relatedness measure.
| Measure | Relevance correlation with M&C |
|---|---|
| Distance-based measures | |
| Rada | 0.688% |
| Wu and Palmer | 0.765% |
| Information-based measures | |
| Resnik | 0.77% |
| Jiang and Conrath | 0.848% |
| Lin | 0.853% |
| Hybrid measures | |
| T. Hong and D. smith | 0.879% |
| Zili Zhou | 0.882% |
| Information/feature-base measures | |
| The proposed MFsSR | 0.937% |
The results of the comparison of the accuracy and acceptance rate between the proposed semantic relatedness measure and Islam and Inkpen [7].
| MRPC dataset | Relatedness threshold | Human judgment (TP + FN) | Islam and Inkpen [ | The proposed MFsSR | ||
|---|---|---|---|---|---|---|
| Acc | AR | Acc | AR | |||
| Training subset (4,076) | 0.1 | 2,753 true | 0.67 | 1 | 0.68 | 1 |
| 0.2 | 0.67 | 1 | 0.68 | 1 | ||
| 0.3 | 0.67 | 1 | 0.68 | 1 | ||
| 0.4 | 0.67 | 1 | 0.68 | 1 | ||
| 0.5 | 0.69 | 0.98 | 0.68 | 1 | ||
| 0.6 | 0.72 | 0.89 | 0.68 | 1 | ||
| 0.7 | 0.68 | 0.78 | 0.70 | 0.98 | ||
| 0.8 | 0.56 | 0.4 | 0.72 | 0.86 | ||
| 0.9 | 0.37 | 0.09 | 0.60 | 0.49 | ||
| 1 | 0.33 | 0 | 0.34 | 0.02 | ||
|
| ||||||
| Test subset (1,725) | 0.1 | 1,147 true | 0.66 | 1 | 0.67 | 1 |
| 0.2 | 0.66 | 1 | 0.67 | 1 | ||
| 0.3 | 0.66 | 1 | 0.67 | 1 | ||
| 0.4 | 0.66 | 1 | 0.67 | 1 | ||
| 0.5 | 0.68 | 0.98 | 0.67 | 1 | ||
| 0.6 | 0.72 | 0.89 | 0.66 | 1 | ||
| 0.7 | 0.68 | 0.78 | 0.66 | 0.98 | ||
| 0.8 | 0.56 | 0.4 | 0.64 | 0.85 | ||
| 0.9 | 0.38 | 0.09 | 0.49 | 0.5 | ||
| 1 | 0.33 | 0 | 0.34 | 0.03 | ||
The results of the C-SynD algorithm based on the MFsSR compared to the PMI and ICA for detecting the closest synonym.
| Datasets | PMI | ICA | C-SynD | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| Precision | Recall |
| |
| SN | 60.6% | 60.6% | 0.61% | 79.5% | 71.6% | 75.3% | 85% | 80% | 82% |
| BNC | 74.5% | 67.9% | 71% | 76% | 67.8% | 72% | 82% | 71.9% | 77% |
The results of the IOAC-QL algorithm based on AVWs compared to CRFs and TFICF approaches.
| Datasets | CRFs | TFICF | IOAC-QL | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| Precision | Recall |
| |
| TREC | 75.08% | 60.2% | 66.28% | 89.3% | 71.4% | 78.7% | 89.8% | 88.1% | 88.6% |
| BLESS | 73.04% | 62.66% | 67.03% | 73.8% | 69.5% | 71.6% | 95.0% | 83.5% | 88.9% |
| NYT | 68.46% | 54.02% | 60.38% | 86.0% | 65.0% | 74.0% | 90.0% | 74.0% | 81.2% |
| Wikipedia | 56.0% | 42.0% | 48.0% | 64.6% | 54.88% | 59.34% | 70.0% | 44.0% | 54.0% |
The results of graph-based approach compared to term-based and concept-based approach using K-means and the HAC clustering algorithms.
| Datasets | Algorithm | Term-based | Concept-based | Graph-based |
|---|---|---|---|---|
| FC | FC | FC | ||
| Reuters-21578 | HAC | 73% | 85.7% | 90.2% |
|
| 51% | 84.6% | 91.8% | |
|
| ||||
| 20 Newsgroups | HAC | 53% | 82% | 89% |
|
| 48% | 79.3% | 87% | |