| Literature DB >> 24715819 |
Fanghuai Hu1, Zhiqing Shao1, Tong Ruan1.
Abstract
Constructing ontology manually is a time-consuming, error-prone, and tedious task. We present SSCO, a self-supervised learning based chinese ontology, which contains about 255 thousand concepts, 5 million entities, and 40 million facts. We explore the three largest online Chinese encyclopedias for ontology learning and describe how to transfer the structured knowledge in encyclopedias, including article titles, category labels, redirection pages, taxonomy systems, and InfoBox modules, into ontological form. In order to avoid the errors in encyclopedias and enrich the learnt ontology, we also apply some machine learning based methods. First, we proof that the self-supervised machine learning method is practicable in Chinese relation extraction (at least for synonymy and hyponymy) statistically and experimentally and train some self-supervised models (SVMs and CRFs) for synonymy extraction, concept-subconcept relation extraction, and concept-instance relation extraction; the advantages of our methods are that all training examples are automatically generated from the structural information of encyclopedias and a few general heuristic rules. Finally, we evaluate SSCO in two aspects, scale and precision; manual evaluation results show that the ontology has excellent precision, and high coverage is concluded by comparing SSCO with other famous ontologies and knowledge bases; the experiment results also indicate that the self-supervised models obviously enrich SSCO.Entities:
Mesh:
Year: 2014 PMID: 24715819 PMCID: PMC3970055 DOI: 10.1155/2014/848631
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Useful structured information in encyclopedias.
Figure 2The framework of our encyclopedia-based ontology learning system.
Figure 3Two InfoBox examples extracted from Wikipedia, the labels in the green solid line boxes are attribute names, and the labels in the red dashed line boxes are attribute values.
Lexicon-syntactic patterns of synonyms in the sample sentences.
| Pattern | Example |
|---|---|
|
| Jisuanji (computer) is commonly called as Diannao |
|
| Shanghai is abbreviated as Hu |
|
| Hehua (lotus) is also named as Lianhua |
|
| Oceania is also called as Australia |
|
| Laoshe is originally called as Shu Qingchun |
|
| Xi'an is anciently named as Chang'an |
|
| Like is love's synonym |
|
| Hu is Shanghai's abbreviation |
Lexicon-syntactic patterns of hyponymys in the sample sentences.
| Pattern | Example |
|---|---|
|
| Tablet is a type of computer |
|
| China is a country |
|
| Human is a kind of mammal |
|
| Animals such as lion and tiger |
|
| Dynasties there are Han, Tang, and so on |
|
| China is a developing country |
Figure 4A labeled example of SR-CRF.
Information crawled from online encyclopedias.
| Article | Category | InfoBox | Redirection | |
|---|---|---|---|---|
| Baidu-Baike | 4,729,672 | 635,424 | 171,532 | 150,003 |
| Hudong-Baike | 3,680,773 | 213,176 | 641,251 | 258,720 |
| Chinese-Wikipedia | 462,653 | 107,233 | 74,293 | 322,922 |
The detail size of each element of SSCO; the hyponymy relations are divided into concept-subconcept relations and concept-instance relations.
| Element | Size |
|---|---|
| Concept | 264,940 |
| Instance | 4,902,180 |
| Synonym relation | 812,239 |
| Concept-subconcept relation | 711,232 |
| Concept-instance relation | 25,960,023 |
| Attributes | 6,926,942 |
|
| |
| Total facts | 39,577,556 |
Concept extraction result.
| Concept source | Quantity | Precision | Opt |
|---|---|---|---|
| From toxonomy system | 104,803 | 1.000 | Select |
| From category label | |||
| In Chinese-Wikipedia | 7,622 | 0.990 | Select |
| In Hudong-Baike | 82,328 | 0.824 | Discard |
| In Baidu-Baike | 430,951 | 0.672 | Discard |
| In Hudong-Baike and Baidu-Baike | 76,923 | 0.972 | Select |
| Total | 84,545 | ||
| From hyponymy relation | 75,592 | 0.964 | Select |
|
| |||
| Total | 264,940 | 0.981 | |
Figure 5The synonym relations extracted from the redirection pages and InfoBox modules of encyclopedias.
Figure 6Precision-recall curves of SR-CRF and SR-SVM.
The detail results of synonymy extraction.
| Relation source | Quantity | Precision |
|---|---|---|
| Redirection page and InfoBox module | 262,822 | 1.000 |
| SR-CRF | 357,009 | 0.912 |
| SR-SVM | 80,182 | 0.900 |
| The remaining ones in both SR-CRF and SR-SVM | 112,226 | 0.932 |
|
| ||
| Total | 812,239 | 0.943 |
Hyponymy relation extraction result.
| Relation source | Quantity | Precision |
|---|---|---|
| From toxonomy system | 442,894 | 1.000 |
| From category label | ||
| In Chinese-Wikipedia | 1,549,103 | 0.978 |
| In Hudong-Baike and Baidu-Baike | 13,902,238 | 0.944 |
| From HY-CRF | 9,723,819 | 0.916 |
| From HY-SVM | 723,076 | 0.900 |
| From the remaining of HY-CRF and HY-SVM | 330,125 | 0.928 |
|
| ||
| Total | 26,671,255 | 0.935 |
Size of other ontologies.
| Ontology | Entities | Facts |
|---|---|---|
| KnowItNow | N/A | 25,860 |
| KnowItAll | N/A | 29,835 |
| SUMO | 20,000 | 60,000 |
| OpenCyc | 47,000 | 306,000 |
| Cyc | 500,000 | 5,000,000 |
| WordNet | 155,287 | 479,887 |
| YAGO | 1,056,638 | 5,000,000 |
| YAGO2 | 9,800,000 | 120,000,000 |
| HowNet | 191,924 | 462,433 |
| CCD | 66,590 | N/A |
| CCE | 2,000,000 | N/A |
| The ontology in [ | 822,135 | 5,237,520 |