| Literature DB >> 33265195 |
Jie Hu1,2, Shaobo Li1,3, Yong Yao3, Liya Yu3, Guanci Yang1, Jianjun Hu2,3.
Abstract
Many text mining tasks such as text retrieval, text summarization, and text comparisons depend on the extraction of representative keywords from the main text. Most existing keyword extraction algorithms are based on discrete bag-of-words type of word representation of the text. In this paper, we propose a patent keyword extraction algorithm (PKEA) based on the distributed Skip-gram model for patent classification. We also develop a set of quantitative performance measures for keyword extraction evaluation based on information gain and cross-validation, based on Support Vector Machine (SVM) classification, which are valuable when human-annotated keywords are not available. We used a standard benchmark dataset and a homemade patent dataset to evaluate the performance of PKEA. Our patent dataset includes 2500 patents from five distinct technological fields related to autonomous cars (GPS systems, lidar systems, object recognition systems, radar systems, and vehicle control systems). We compared our method with Frequency, Term Frequency-Inverse Document Frequency (TF-IDF), TextRank and Rapid Automatic Keyword Extraction (RAKE). The experimental results show that our proposed algorithm provides a promising way to extract keywords from patent texts for patent classification.Entities:
Keywords: deep learning; information gain; keyword extraction; patent classification
Year: 2018 PMID: 33265195 PMCID: PMC7512597 DOI: 10.3390/e20020104
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Comparison of different keywords extraction approaches.
| Categories | Methods | Advantages | Drawbacks | Application Scenarios |
|---|---|---|---|---|
| Supervised | Machine learning approaches (Decision Tree [ | High readability, great flexibility to include a wide variety of arbitrary, non-independent features of the input. Can find several new terms which have not appeared in the training | Need labeled corpus | News, scientific articles, etc. Wildly applied. |
| Unsupervised | TF-IDF [ | Without the need for a labeled corpus. Easy to implement, widely applied | Cannot extract semantically meaningful words. The keywords are not comprehensive. Not accurate enough. | News, scientific articles, etc. Wildly applied. |
| TextRank [ | Without the need for a labeled corpus. Has a strong ability to apply to other topic texts. | Ignored semantic relevance of keywords. The effect of low frequency keyword extraction is poor. High computational complexity. | Used in small-scale text keyword extraction task. | |
| LDA [ | Without the need for a labeled corpus. Can obtain semantic keywords and solve the problem of polysemous. Easy to apply to various languages. | Prefer to extract general keywords which cannot represent the topic of corresponding text well. | Various languages. | |
| RAKE [ | Without the need for a corpus. Very fast and the complexity is low. Easy to implement. | Cannot extract semantically meaningful words. | Extracting key-phrases from texts. | |
| PKEA (Our approach) | Can both extract semantic and discriminative keywords. Without the need for a corpus. Low computational complexity. High performance on extracting discriminative keywords. Easy to implement and apply to other type texts. | Need pre-defined category corpus. | Specially designed for extracting keyword from patent texts. Easy to extend to other scientific articles. |
Figure 1The overall process of keyword extraction and its evaluation measures.
Figure 2The architecture of Skip-gram model [20].
Number of documents, author and reader-assigned key phrases in the training and test dataset.
| Dataset | Number of Documents | Number of Categories | Number of Key Phrases | ||
|---|---|---|---|---|---|
| Author | Reader | Combined | |||
| Training | 144 | 4 | 559 | 1824 | 2223 |
| Test | 100 | 4 | 387 | 1217 | 1482 |
Top 10 keywords in each patent category extracted by the patent keyword extraction algorithm (PKEA).
| Patent Categories | GPS System | Object Recognition | Vehicle Control System | Radar System | Lidar System |
|---|---|---|---|---|---|
| Keywords | GPS | Camera | Automobile | Radar | Lidar |
| Satellite | Environment | Controller | Trajectory | Laser | |
| Altitude | Image | Communication | Operation | Detection | |
| Position | ORC | Assistance | Present-azimuth | Three-axis | |
| Synchronization | GUI | Speed | Prior-azimuth | Microwave | |
| Wavelength | Visibility | Guidance | Radiation | Receiver | |
| Telecommunication | Autonomous | Acceleration | Path | Luminescence | |
| Geo-mobile | Surrounding | Acquisition | Plurality | Reflection | |
| GPS-enabled | Video | Remote | Reference-location | Speedometer | |
| MS (communication device) | Multi-target | Roadway | Radar-sensor | Collision |
Keywords distribution in patent document.
| Patent Documents | Keywords | Categories | |||
|---|---|---|---|---|---|
| GPS | Image | Camera | Vehicle | ||
| Patent 1 | 1 | 0 | 0 | 1 | A |
| Patent 2 | 0 | 1 | 1 | 0 | B |
| Patent 3 | 0 | 0 | 1 | 1 | B |
| Patent 4 | 0 | 0 | 1 | 1 | A |
| Patent 5 | 1 | 1 | 0 | 1 | A |
Figure 3The sum of entire keywords extracted by five algorithms.
Figure 4Precision scores obtained by Support Vector Machine (SVM) classifier using five keyword extraction algorithms.
Figure 5Recall scores obtained by SVM classifier using five keyword extraction algorithms.
Figure 6F1 scores obtained by SVM classifier using five keyword extraction algorithms.
Statistics of F1 achieved by fives algorithms.
| Paired F1-Score Statistics | |||||
|---|---|---|---|---|---|
| Mean | N | Std. Deviation | Std. Error Mean | ||
| Pair 1 | PKEA | 0.8199 | 10 | 0.03040 | 0.00961 |
| Frequency | 0.7897 | 10 | 0.02141 | 0.00677 | |
| Pair 2 | PKEA | 0.8199 | 10 | 0.03040 | 0.00961 |
| TFIDF | 0.7972 | 10 | 0.03500 | 0.01107 | |
| Pair 3 | PKEA | 0.8199 | 10 | 0.03040 | 0.00961 |
| RAKE | 0.7851 | 10 | 0.03826 | 0.01210 | |
| Pair 4 | PKEA | 0.8199 | 10 | 0.03040 | 0.00961 |
| TextRank | 0.7905 | 10 | 0.04965 | 0.01570 | |
Five algorithms paired F1 T-test results.
| Paired F1-Score Test | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Paired Differences | t | df | Sig. (2-Tailed) | ||||||
| Mean | Std. Deviation | Std. Error Mean | 95% Confidence Interval of the Difference | ||||||
| Lower | Upper | ||||||||
| Pair 1 | PKEA-Frequency | 0.03015 | 0.01052 | 0.00333 | 0.02263 | 0.03768 | 9.065 | 9 | 0.000 |
| Pair 2 | PKEA-TFIDF | 0.02271 | 0.00878 | 0.00278 | 0.01642 | 0.02899 | 8.175 | 9 | 0.000 |
| Pair 3 | PKEA-RAKE | 0.03482 | 0.01177 | 0.00372 | 0.02640 | 0.04325 | 9.354 | 9 | 0.000 |
| Pair 4 | PKEA-TextRank | 0.02937 | 0.02150 | 0.00680 | 0.01399 | 0.04475 | 4.319 | 9 | 0.002 |
Performance comparison between four algorithms on SemEval-2010 dataset.
| Methods | Assigned by | Top 5 Candidates | Top 10 Candidates | Top 15 Candidates | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precsion | Recall | F1 | Precise | Recall | F1 | ||
| TF × IDF [ | R | 17.8% | 7.4% | 10.4% | 13.9% | 11.5% | 12.6% | 11.6% | 14.5% | 12.9% |
| C | 22.0% | 7.5% | 11.2% | 17.7% | 12.1% | 14.4% | 14.9% | 15.3% | 15.1% | |
| NB [ | R | 16.8% | 7.0% | 9.9% | 13.3% | 11.1% | 12.1% | 11.4% | 14.2% | 12.7% |
| C | 21.4% | 7.3% | 10.9% | 17.3% | 11.8% | 14.0% | 14.5% | 14.9% | 14.7% | |
| ME [ | R | 16.8% | 7.0% | 9.9% | 13.3% | 11.1% | 12.1% | 11.4% | 14.2% | 12.7% |
| C | 21.4% | 7.3% | 10.9% | 17.3% | 11.8% | 14.0% | 14.5% | 14.9% | 14.7% | |
| PKEA | R | 20.0% | 8.5% | 11.9% | 15.8% | 12.7% | 14.1% | 13.2% | 15.8% | 14.4% |
| C | 24.6% | 8.6% | 12.8% | 19.4% | 12.9% | 15.5% | 16.1% | 15.6% | 15.9% | |
| HUMB [ | R | 30.4% | 12.6% | 17.8% | 24.8% | 20.6% | 22.5% | 21.2% | 26.4% | 23.5% |
| C | 39.0% | 13.3% | 19.8% | 32.0% | 21.8% | 26.0% | 27.2% | 27.8% | 27.5% | |