| Literature DB >> 24982952 |
Ming Che Lee1, Jia Wei Chang2, Tung Cheng Hsieh3.
Abstract
This paper presents a grammar and semantic corpus based similarity algorithm for natural language sentences. Natural language, in opposition to "artificial language", such as computer programming languages, is the language used by the general public for daily communication. Traditional information retrieval approaches, such as vector models, LSA, <span class="Gene">HAL, or even the ontology-based approaches that extend to include concept similarity comparison instead of cooccurrence terms/words, may not always determine the perfect matching while there is no obvious relation or concept overlap between two natural language sentences. This paper proposes a sentence similarity algorithm that takes advantage of corpus-based ontology and grammatical rules to overcome the addressed problems. Experiments on two famous benchmarks demonstrate that the proposed algorithm has a significant performance improvement in sentences/short-texts with arbitrary syntax and structure.Entities:
Mesh:
Year: 2014 PMID: 24982952 PMCID: PMC4005080 DOI: 10.1155/2014/437162
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Linkage structures produced by link grammar.
Selected linkages used in the algorithm; superscript ξ denotes the optional linking types.
| Links | Subtypes | Descriptions |
|---|---|---|
|
| — |
|
|
| ||
|
| — |
|
|
| ||
|
|
|
|
|
|
| |
|
|
| |
|
| ||
|
|
|
|
|
|
| |
|
|
| |
|
| ||
|
| — |
|
|
| ||
|
| — |
|
|
| ||
|
|
|
|
|
|
| |
|
| ||
|
| — |
|
|
| ||
|
| ∗ |
|
|
| ||
|
| — |
|
|
| ||
|
|
|
|
|
| ||
|
| — |
|
|
| ||
|
| ∗ |
|
|
| ||
|
| ∗ |
|
|
| ||
|
| — |
|
|
| ||
|
|
|
|
|
|
| |
|
|
| |
|
| ||
|
| ∗ |
|
|
| ||
|
| — |
|
|
| ||
|
| — |
|
|
| ||
|
| — |
|
Algorithm 1Linking types.
Algorithm 2Semantic sentence similarity.
Algorithm 3Grammarmatrix.
Figure 2Diagram of grammar matrices and grammar vectors.
Figure 3Linkages of sentence A.
Figure 4Linkages of sentence B.
Figure 5Linkages of sentence C.
The GMs of sentence A versus sentence B.
|
|
|
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Subtypes and words |
|
|
|
|
|
|
|
|
|
| |
| revenue | revenue- | revenue- | with- | over- | of- | from- | the- |
‘s- | the- | |||
|
| ||||||||||||
|
|
| revenue | 1 | — | — | — | — | — | — | — | — | — |
|
| ||||||||||||
|
|
| revenue- | — | 1 | — | — | — | — | — | — | — | — |
|
| ||||||||||||
|
|
| revenue-in | — | — | 1 | — | — | — | — | — | — | — |
|
| quarter-of | — | — | 0.33 | — | — | — | — | — | — | — | |
|
| period-earlier | — | — | 0.33 | — | — | — | — | — | — | — | |
|
| ||||||||||||
|
|
| in-quarter | — | — | — | 0.33 | 0.67 | 0.83 | 0.91 | — | — | — |
|
| from-period | — | — | — | 0.33 | 0.36 | 0.91 | 1 | — | — | — | |
|
| of-year | — | — | — | 0.31 | 0.77 | 1 | 0.91 | — | — | — | |
|
| ||||||||||||
|
|
| the-quarter | — | — | — | — | — | — | — | 0.33 | 0.67 | 0.91 |
|
| the-year | — | — | — | — | — | — | — | 0.31 | 0.77 | 0.91 | |
The GMs of sentence A versus sentence C.
|
|
|
|
|
| |||
|---|---|---|---|---|---|---|---|
|
| Subtypes and words |
|
|
|
|
| |
| result | for-employees | over-years | the-result | an-package | |||
|
| |||||||
|
|
| revenue | 0.31 | — | — | — | — |
|
| |||||||
|
|
| in-quarter | — | 0.73 | 0.83 | — | — |
|
| from-period | — | 0.18 | 0.91 | — | — | |
|
| of-year | — | 0.22 | 0.9 | — | — | |
|
| |||||||
|
|
| the-quarter | — | — | — | 0.4 | 0.4 |
|
| the-year | — | — | — | 0.33 | 0.55 | |
The GMs of sentence B versus sentence C.
|
|
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Subtypes and words |
|
|
|
|
|
|
|
|
| |
| revenue | with- | over- | of- | from- | the- | ‘s- | the- | the- | |||
|
| |||||||||||
|
|
| result | 0.31 | — | — | — | — | — | — | — | — |
|
| |||||||||||
|
|
| for-employees | — | 0 | 0.22 | 0.22 | 0.11 | — | — | — | — |
|
| over-years | — | 0.31 | 0.33 | 0.9 | 0.91 | — | — | — | — | |
|
| |||||||||||
|
|
| the-result | — | — | — | — | — | 0.71 | 0.33 | 0.5 | 0.33 |
|
| an-package | — | — | — | — | — | 0.33 | 0.55 | 0.4 | 0.55 | |
Li's Dataset [8].
| Number | Word pair | Raw sentences | Human Sim. |
|---|---|---|---|
| 1 | cord : smile | (1) Cord is strong, thick string. | 0.0100 |
|
| |||
| 2 | rooster : voyage | (1) A rooster is an adult male chicken. | 0.0050 |
|
| |||
| 3 | noon : string | (1) Noon is 12 o'clock in the middle of the day. | 0.0125 |
|
| |||
| 4 | fruit : furnace | (1) Fruit or a fruit is something which grows on a tree or bush and which contains seeds or a stone covered by a substance that you can eat. | 0.0475 |
|
| |||
| 5 | autograph : shore | (1) An autograph is the signature of someone famous which is specially written for a fan to keep. | 0.0050 |
|
| |||
| 6 | automobile : wizard | (1) An automobile is a car. | 0.0200 |
|
| |||
| 7 | mound : stove | (1) A mound of something is a large rounded pile of it. | 0.0050 |
|
| |||
| 8 | grin : implement | (1) A grin is a broad smile. | 0.0050 |
|
| |||
| 9 | asylum : fruit | (1) An asylum is a psychiatric hospital. | 0.0050 |
|
| |||
| 10 | asylum : monk | (1) An asylum is a psychiatric hospital. | 0.0375 |
|
| |||
| 11 | graveyard : madhouse | (1) A graveyard is an area of land, sometimes near a church, where dead people are buried. | 0.0225 |
|
| |||
| 12 | glass : magician | (1) Glass is a hard transparent substance that is used to make things such as windows and bottles. | 0.0075 |
|
| |||
| 13 | boy : rooster | (1) A boy is a child who will grow up to be a man. | 0.1075 |
|
| |||
| 14 | cushion : jewel | (1) A cushion is a fabric case filled with soft material, which you put on a seat to make it more comfortable. | 0.0525 |
|
| |||
| 15 | monk : slave | (1) A monk is a member of a male religious community that is usually separated from the outside world. | 0.0450 |
|
| |||
| 16 | asylum : cemetery | (1) An asylum is a psychiatric hospital. | 0.0375 |
|
| |||
| 17 | coast : forest | (1) The coast is an area of land that is next to the sea. | 0.0475 |
|
| |||
| 18 | grin : lad | (1) A grin is a broad smile. | 0.0125 |
|
| |||
| 19 | shore : woodland | (1) The shores or shore of a sea, lake, or wide river is the land along the edge of it. | 0.0825 |
|
| |||
| 20 | monk : oracle | (1) A monk is a member of a male religious community that is usually separated from the outside world. | 0.1125 |
|
| |||
| 21 | boy : sage | (1) A boy is a child who will grow up to be a man. | 0.0425 |
|
| |||
| 22 | automobile : cushion | (1) An automobile is a car. | 0.0200 |
|
| |||
| 23 | mound : shore | (1) A mound of something is a large rounded pile of it. | 0.0350 |
|
| |||
| 24 | lad : wizard | (1) A lad is a young man or boy. | 0.0325 |
|
| |||
| 25 | forest : graveyard | (1) A forest is a large area where trees grow close together. | 0.0650 |
|
| |||
| 26 | food : rooster | (1) Food is what people and animals eat. | 0.0550 |
|
| |||
| 27 | cemetery : woodland | (1) A cemetery is a place where dead people's bodies or their ashes are buried. | 0.0375 |
|
| |||
| 28 | shore : voyage | (1) The shores or shore of a sea, lake, or wide river is the land along the edge of it. | 0.0200 |
|
| |||
| 29 | bird : woodland | (1) A bird is a creature with feathers and wings, females lay eggs, and most birds can fly. | 0.0125 |
|
| |||
| 30 | coast : hill | (1) The coast is an area of land that is next to the sea. | 0.1000 |
|
| |||
| 31 | furnace : implement | (1) A furnace is a container or enclosed space in which a very hot fire is made, for example to melt metal, burn rubbish or produce steam. | 0.0500 |
|
| |||
| 32 | crane : rooster | (1) A crane is a large machine that moves heavy things by lifting them in the air. | 0.0200 |
|
| |||
| 33 | hill : woodland | (1) A hill is an area of land that is higher than the land that surrounds it. | 0.1450 |
|
| |||
| 34 | car : journey | (1) A car is a motor vehicle with room for a small number of passengers. | 0.0725 |
|
| |||
| 35 | cemetery : mound | (1) A cemetery is a place where dead people's bodies or their ashes are buried. | 0.0575 |
|
| |||
| 36 | glass : jewel | (1) Glass is a hard transparent substance that is used to make things such as windows and bottles. | 0.1075 |
|
| |||
| 37 | magician : oracle | (1) A magician is a person who entertains people by doing magic tricks. | 0.1300 |
|
| |||
| 38 | crane : implement | (1) A crane is a large machine that moves heavy things by lifting them in the air. | 0.1850 |
|
| |||
| 39 | brother : lad | (1) Your brother is a boy or a man who has the same parents as you. | 0.1275 |
|
| |||
| 40 | sage : wizard | (1) A sage is a person who is regarded as being very wise. | 0.1525 |
|
| |||
| 41 | oracle : sage | (1) In ancient times, an oracle was a priest or priestess who made statements about future events or about the truth. | 0.2825 |
|
| |||
| 42 | bird : crane | (1) A bird is a creature with feathers and wings, females lay eggs, and most birds can fly. | 0.0350 |
|
| |||
| 43 | bird : cock | (1) A bird is a creature with feathers and wings, females lay eggs, and most birds can fly. | 0.1625 |
|
| |||
| 44 | food : fruit | (1) Food is what people and animals eat. | 0.2425 |
|
| |||
| 45 | brother : monk | (1) Your brother is a boy or a man who has the same parents as you. | 0.0450 |
|
| |||
| 46 | asylum : madhouse | (1) An asylum is a psychiatric hospital. | 0.2150 |
|
| |||
| 47 | furnace : stove | (1) A furnace is a container or enclosed space in which a very hot fire is made, for example, to melt metal, burn rubbish, or produce steam. | 0.3475 |
|
| |||
| 48 | magician : wizard | (1) A magician is a person who entertains people by doing magic tricks. | 0.3550 |
|
| |||
| 49 | hill : mound | (1) A hill is an area of land that is higher than the land that surrounds it. | 0.2925 |
|
| |||
| 50 | cord : string | (1) Cord is strong, thick string. | 0.4700 |
|
| |||
| 51 | glass : tumbler | (1) Glass is a hard transparent substance that is used to make things such as windows and bottles. | 0.1375 |
|
| |||
| 52 | grin : smile | (1) A grin is a broad smile. | 0.4850 |
|
| |||
| 53 | serf : slave | (1) In former times, serfs were a class of people who had to work on a particular person's land and could not leave without that person's permission. | 0.4825 |
|
| |||
| 54 | journey : voyage | (1) When you make a journey, you travel from one place to another. | 0.3600 |
|
| |||
| 55 | autograph : signature | (1) An autograph is the signature of someone famous which is specially written for a fan to keep. | 0.4050 |
|
| |||
| 56 | coast : shore | (1) The coast is an area of land that is next to the sea. | 0.5875 |
|
| |||
| 57 | forest : woodland | (1) A forest is a large area where trees grow close together. | 0.6275 |
|
| |||
| 58 | implement : tool | (1) An implement is a tool or other pieces of equipment. | 0.5900 |
|
| |||
| 59 | cock : rooster | (1) A cock is an adult male chicken. | 0.8625 |
|
| |||
| 60 | boy : lad | (1) A boy is a child who will grow up to be a man. | 0.5800 |
|
| |||
| 61 | cushion : pillow | (1) A cushion is a fabric case filled with soft material, which you put on a seat to make it more comfortable. | 0.5225 |
|
| |||
| 62 | cemetery : graveyard | (1) A cemetery is a place where dead people's bodies or their ashes are buried. | 0.7725 |
|
| |||
| 63 | automobile : car | (1) An automobile is a car. | 0.5575 |
|
| |||
| 64 | midday : noon | (1) Midday is 12 o'clock in the middle of the day. | 0.9550 |
|
| |||
| 65 | gem : jewel | (1) A gem is a jewel or stone that is used in jewellery. | 0.6525 |
Benchmark number and the results compared with Li et al. [8], LSA [54], STS Meth. [55], SyMSS [56], Omiotis [57], and our grammar-based approach.
| R&G number | Human | Li_McLean | LSA | STS Meth. | SyMSS | Omiotis | LG |
|---|---|---|---|---|---|---|---|
| 1 | 0.01 | 0.33 | 0.51 | 0.06 | 0.32 | 0.11 | 0.22 |
| 5 | 0.01 | 0.29 | 0.53 | 0.11 | 0.28 | 0.10 | 0.06 |
| 9 | 0.01 | 0.21 | 0.51 | 0.07 | 0.27 | 0.10 | 0.35 |
| 13 | 0.10 | 0.53 | 0.53 | 0.16 | 0.27 | 0.30 | 0.32 |
| 17 | 0.13 | 0.36 | 0.58 | 0.26 | 0.42 | 0.30 | 0.41 |
| 21 | 0.04 | 0.51 | 0.53 | 0.16 | 0.37 | 0.24 | 0.44 |
| 25 | 0.07 | 0.55 | 0.60 | 0.33 | 0.53 | 0.30 | 0.07 |
| 29 | 0.01 | 0.34 | 0.51 | 0.12 | 0.31 | 0.11 | 0.20 |
| 33 | 0.15 | 0.59 | 0.81 | 0.29 | 0.43 | 0.49 | 0.07 |
| 37 | 0.13 | 0.44 | 0.58 | 0.20 | 0.23 | 0.11 | 0.07 |
| 41 | 0.28 | 0.43 | 0.58 | 0.09 | 0.38 | 0.11 | 0.02 |
| 47 | 0.35 | 0.72 | 0.72 | 0.30 | 0.24 | 0.22 | 0.25 |
| 48 | 0.36 | 0.64 | 0.62 | 0.34 | 0.42 | 0.53 | 0.79 |
| 49 | 0.29 | 0.74 | 0.54 | 0.15 | 0.39 | 0.57 | 0.38 |
| 50 | 0.47 | 0.69 | 0.68 | 0.49 | 0.35 | 0.55 | 0.07 |
| 51 | 0.14 | 0.65 | 0.73 | 0.28 | 0.31 | 0.52 | 0.39 |
| 52 | 0.49 | 0.49 | 0.70 | 0.32 | 0.54 | 0.60 | 0.84 |
| 53 | 0.48 | 0.39 | 0.83 | 0.44 | 0.52 | 0.5 | 0.18 |
| 54 | 0.36 | 0.52 | 0.61 | 0.41 | 0.33 | 0.43 | 0.32 |
| 55 | 0.41 | 0.55 | 0.70 | 0.19 | 0.33 | 0.43 | 0.38 |
| 56 | 0.59 | 0.76 | 0.78 | 0.47 | 0.43 | 0.93 | 0.62 |
| 57 | 0.63 | 0.7 | 0.75 | 0.26 | 0.50 | 0.61 | 0.82 |
| 58 | 0.59 | 0.75 | 0.83 | 0.51 | 0.64 | 0.74 | 0.94 |
| 59 | 0.86 | 1 | 1 | 0.94 | 1 | 1 | 1 |
| 60 | 0.58 | 0.66 | 0.83 | 0.6 | 0.63 | 0.93 | 0.89 |
| 61 | 0.52 | 0.66 | 0.63 | 0.29 | 0.39 | 0.35 | 0.08 |
| 62 | 0.77 | 0.73 | 0.74 | 0.51 | 0.75 | 0.73 | 0.94 |
| 63 | 0.59 | 0.64 | 0.87 | 0.52 | 0.78 | 0.79 | 0.95 |
| 64 | 0.96 | 1 | 1 | 0.93 | 1 | 0.93 | 1 |
Results of the grammar-based and competitive methods on the Microsoft Research Paraphrase Corpus.
| Category | Metric | Accuracy | Precision | Recall |
|
|---|---|---|---|---|---|
| Corpus-based | PMI-IR | 69.90 | 70.20 | 95.20 | 81.00 |
| LSA | 68.40 | 69.70 | 95.20 | 80.50 | |
| STS Meth. | 72.64 | 74.65 | 89.13 | 81.25 | |
| SyMSS_JCN | 70.87 | 74.70 | 84.17 | 79.02 | |
| SyMSS_Vector | 70.82 | 74.15 | 90.32 | 81.44 | |
| Omiotis | 69.97 | 70.78 | 93.40 | 80.52 | |
|
| |||||
| Lexicon-based | JC | 69.30 | 72.20 | 87.10 | 79.00 |
| LC | 69.50 | 72.40 | 87.00 | 79.00 | |
| Lesk | 69.30 | 72.40 | 86.60 | 78.90 | |
| L | 69.30 | 71.60 | 88.70 | 79.20 | |
| W&P | 69.00 | 70.20 | 92.10 | 80.00 | |
| R | 69.00 | 69.00 | 96.40 | 80.40 | |
| M | 70.30 | 69.60 | 97.70 | 81.30 | |
|
| |||||
| Machine learning-based | Wan et al. [ | 75.00 | 77.00 | 90.00 | 83.00 |
| Z&P | 71.90 | 74.30 | 88.20 | 80.70 | |
| Qiu et al. [ | 72.00 | 72.50 | 93.40 | 81.60 | |
|
| |||||
| Baselines | Random | 51.30 | 68.30 | 50.00 | 57.80 |
| VSM | 65.40 | 71.60 | 79.50 | 75.30 | |
|
| |||||
| LG | 71.02 | 73.90 | 91.07 | 81.59 | |
Figure 6Precision versus similarity threshold curves of STS and LG for eleven different similarity thresholds.
Figure 7Recall versus similarity threshold curves of STS and LG for eleven different similarity thresholds.
Figure 8Accuracy versus similarity threshold curves of STS and LG for eleven different similarity thresholds.
Figure 9F-measure versus similarity threshold curves of STS and LG for eleven different similarity thresholds.