| Literature DB >> 30830911 |
Lea Helmers1, Franziska Horn1, Franziska Biegler2, Tim Oppermann2, Klaus-Robert Müller1,3,4.
Abstract
More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines. In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality.Entities:
Mesh:
Year: 2019 PMID: 30830911 PMCID: PMC6398827 DOI: 10.1371/journal.pone.0212103
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Illustration of the presented novel approach to the search for a patent’s prior art.
First, a dataset of patent applications is obtained from a patent database using a few manually selected seed patents and recursively including the patent applications they cite. Then, the patent texts are transformed into feature vectors and the similarity between two documents is computed based on said feature vectors. Finally, patents that are considered as very similar to a new target patent application are returned as possible prior art. An appropriate similarity measure for this process should assign high similarity scores to related patents (e.g. where one patent was cited in the search report of the other) and low scores to unrelated (randomly paired) patents. We compare different similarity measures by quantifying the overlap between the respective similarity score distributions of pairs of related documents and randomly paired patents using the AUC score.
Evaluation results on the cited/random dataset.
| Features | patent section: AUC | ||
|---|---|---|---|
| 0.8620 | 0.8656 | ||
| 0.9361 | 0.8579 | 0.8561 | |
| 0.9207 | 0.8377 | 0.8250 | |
| 0.9410 | 0.8618 | 0.8525 | |
| 0.9314 | |||
AUC values when computing the cosine similarity with BOW, LSA, KPCA, word2vec, and doc2vec features constructed from different patent sections of the cited/random dataset.
Fig 2Distributions of cosine similarity scores.
Similarity scores for the patent pairs are computed using BOW feature vectors generated either from full texts (left) or only the claims sections (right). Scale on the y-axis is irrelevant and was therefore omitted.
Confusion matrix for the dataset subsample.
| cited | random | |
|---|---|---|
| 65 | 18 | |
| 86 | 281 |
The original cited/random labelling is compared to the more accurate relevant/irrelevant labels.
Correlations between labels and similarity scores on the dataset subsample.
| cited/random | relevant/irr. | |
|---|---|---|
| 0.501 | 0.652 | |
| 0.592 | — |
Spearman’s ρ for the cosine similarity calculated with BOW feature vectors and the relevant/irrelevant and cited/random labelling.
Fig 3Score correlation for the patent with ID US20150018885.
A false negative (ID US20110087291) caught by the cosine similarity is circled in gray.
Summary of evaluation results.
| Features | AUC | AP | ||||
|---|---|---|---|---|---|---|
| relevant | cited | cited | relevant | cited | cited | |
| 0.8118 | 0.8063 | 0.5274 | 0.7095 | |||
| 0.7798 | 0.7075 | 0.9361 | 0.4787 | 0.5921 | 0.3257 | |
| 0.7441 | 0.6740 | 0.9207 | 0.4721 | 0.5832 | 0.2996 | |
| 0.9410 | 0.4019 | |||||
| 0.7658 | 0.8138 | 0.9314 | 0.4749 | 0.6829 | 0.3121 | |
AUC and average precision (AP) scores for the different feature extraction methods on the dataset subsample with cited/random and relevant/irrelevant labelling, as well as the full dataset.