| Literature DB >> 32023289 |
Sunhye Kim1, Inchae Park1,2, Byungun Yoon1.
Abstract
In natural-language processing, the subject-action-object (SAO) structure is used to convert unstructured textual data into structured textual data comprising subjects, actions, and objects. This structure is suitable for analyzing the key elements of technology, as well as the relationships between these elements. However, analysis using the existing SAO structure requires a substantial number of manual processes because this structure does not represent the context of the sentences. Thus, we introduce the concept of SAO2Vec, in which SAO is used to embed the vectors of sentences and documents, for use in text mining in the analysis of technical documents. First, the technical documents of interest are collected, and SAO structures are extracted from them. Then, sentence vectors are extracted through the Doc2Vec algorithm and are updated using word vectors in the SAO structure. Finally, SAO vectors are drawn using an updated sentence vector with the same SAO structure. In addition, document vectors are derived from the document's SAO vectors. The results of an experiment in the Internet of things field indicate that the SAO2Vec method produces 3.1% better accuracy than the Doc2Vec method and 115.0% better accuracy than SAO frequency alone. This proves that the proposed SAO2Vec algorithm can be used to improve grouping and similarity analysis by including both the meanings and the contexts of technical elements.Entities:
Mesh:
Year: 2020 PMID: 32023289 PMCID: PMC7001927 DOI: 10.1371/journal.pone.0227930
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
List of patents registered by CPC, 2017 [28].
| Rank | CPC code | Register Num (2017) | Register Num (2016) | Growth Rate |
|---|---|---|---|---|
| 1 | G06F | 48935 | 47919 | 2.12% |
| 2 | H04L | 33575 | 31414 | 6.88% |
| 3 | H01L | 26989 | 26741 | 0.93% |
| 4 | H04W | 21258 | 19390 | 9.63% |
| Average Growth Rate | 2% | |||
aCPC code “G06F” is about electrical digital data processing.
bCPC code “H04L” is about transmission of digital information.
cCPC code “H01L” is about semiconductor devices; electric solid state devices.
dCPC code “H04W” is about wireless communications networks.
Fig 1Doc2vec models: DM model, Dbow model [14].
Fig 2Research framework: The framework of this study has three steps.
The first step is to collect data and extract the SAO structure, which will be used in future steps. And the second step is to extract words and sentence vectors using doc2vec algorithm and update the sentence vector. In the final step, SAO vectors and sentence vectors are obtained.
Fig 3Example of SAO extraction.
Fig 4Word and sentence vector embedding and updating.
Fig 5An algorithm for determining activeness of each document.
Collect data.
| DB | US Patent |
|---|---|
| Keyword | IoT: device-to-device, D2D, machine-to-machine, M2M, internet and things, IoT Transmission and wireless communications networks: transmission, transfer, data, information, signal |
| Period | 20080101 ~ 20180101 |
| Results | 941 |
| USPTO link |
Example of prepared learning data in Iot field.
| Patent num | Sentence num | Sentence |
|---|---|---|
| 2 | 102 | transmit power level control for the discovery signal transmission |
| 4 | 218 | the non-transitory computer-readable storage device of claim, wherein the instructions further cause the use to perform operations to: encode the discovery signal for the additional transmissions using additional discovery resources from at least a second sub-discovery resource pool of the plurality of sub-discovery resource pools |
aPatent num is the index number of a patent.
bSentence num is the index number of the sentence from a patent.
Example of SAO structure extracted by the learning data.
| Patent num | Sentence num | SAO num | S | A | O |
|---|---|---|---|---|---|
| 2 | 218 | 475 | instructions | perform | operation |
| 476 | operation | encode | discovery signal | ||
| 477 | additional transmission | use | additional discovery resource |
aPatent num is the index number of a patent.
bSentence num is the index number of sentence from a patent.
cSAO num is the index number of SAO structure from a patent.
dS is the subject word in the SAO structure.
eA is the action word in the SAO structure.
fO is the object word in the SAO structure.
Fig 6Visualization of word vector ‘technique’.
Top 5 words similar with “technique” and “transmission” using Doc2vec result.
| word | Cosine similarity | ||||
|---|---|---|---|---|---|
| top1 | top2 | top3 | top4 | top5 | |
| Technique | fusing | Technology | converging | Procedure | disclosure |
| transmission | communication | transmitting | resource | subframe | wireless |
aCosine similarity top x means that the cosine similarity with the given word is xth most similar.
Example of the average similarity between sentence vector and word vector.
| Avg_similarity | S vector | A vector | O vector |
|---|---|---|---|
| Sen vector | - 0.03385 | 0.00255 | - 0.01029 |
| FE sen vector | 0.02575 | 0.00946 | 0.04026 |
aAvg_similarity is the average of similarity in all pairs of vectors
bS vector is the average of similarity in all pairs of Subject vectors
cA vector is the average of similarity in all pairs of Action vectors
dO vector is the average of similarity in all pairs of Object vectors
Fig 7Visualization of the 5285th sentence vector.
Fig 8Visualization of the 5285th FE sentence vector.
Example of the labeled SAO structure: Label 32, 93.
| Label | Patent num | Sentence num | SAO num | S | A | O |
|---|---|---|---|---|---|---|
| 32 | 2 | 218 | 475 | instructions | perform | operation |
| 2 | 230 | 496 | instructions | perform | task | |
| 405 | 14671 | 34778 | - | perform | operation | |
| 93 | 2 | 218 | 476 | operation | encode | Discovery signal |
| 4 | 206 | 444 | operation | encode | information | |
| 267 | 9736 | 22707 | operation | comprise | signal |
A total of 5901 labels were derived from 79,303 SAO structures. Label 32 is a set of 106 SAO structures, and label 93 is a set of 149 SAO structures, with each of the three examples represented in the table.
aLabel is index of the unique SAO structure.
bPatent num is the index number of a patent.
cSentence num is the index number of sentence from a patent.
dSAO num is the index number of SAO structure from a patent.
eS is the subject word in the SAO structure.
fA is the action word in the SAO structure.
gO is the object word in the SAO structure.
Example of SAO vector: Sentence 1134, 2261, 206 in patent 218.
| Patent_num | Sentence_num | SAO vector |
|---|---|---|
| 218 | 1134 | (0.491999, 1.288587, 0.317469, …, -0.432819, 0.162475, 0.000578) |
| 218 | 2261 | (0.530694, 0.459372, 0.197969, …, -1.144108,-0.387236,-0.423902) |
| … | … | … |
| 218 | 206 | (0.634885, 1.22471, 0.558733, …, -0.544525,-0.155806,0.17755) |
Patent No. 218 contains sentences with 35 SAO structures, three of which are represented in the table.
aPatent num is the index number of a patent.
bSentence num is the index number of sentence from a patent.
cSAO vetor is the 200 dimensional vector of the SAO structure
Uses of an SAO vector with the same label.
| Patent_num | Sentence_num | SAO vector |
|---|---|---|
| 294 | 32 | (0.597186, 0.546221, 0.097256, -0.112597, 0.121704, 0.521426, …) |
| 303 | 32 | (0.595341, 0.474635, 0.180812, 0.745539, 0.436642, 0.22292, …) |
| 520 | 32 | (0.58665, 0.508009, -0.08149, 0.07969, 0.273633, 0.452418, …) |
Label 32 appeared in a total of 73 patents, and the table shows three examples and the SAO vector of Label 32 in that document.
aPatent num is the index number of a patent.
bLabel is index of the unique SAO structure.
cSAO vetor is the 200 dimensional vector of the SAO structure
Example of document vector: Patent 1, 2, 3.
| Patent_num | Document vector |
|---|---|
| 1 | (0.166322247,0.4541985,0.055633331, …, -0.023517468,0.156254383,-0.128338305) |
| 2 | (0.051742255, 0.193958759, 0.32022656, …, -0.05796744, 0.433337986, 0.312557369) |
| 941 | (0.302176, 0.811281897, 0.201013195, …, -0.176340402, -0.392004782, -0.195887195) |
Document vectors were calculated for a total of 941 documents, and the table shows three examples.
aPatent num is the index number of a patent.
bDocument vector is the 200 dimensional vector of the document.
Example of document similarity matrix using SAO frequency.
| similarity | doc1 | doc2 | doc3 | doc4 | doc5 |
|---|---|---|---|---|---|
| doc1 | 1 | 0.027 | 0.032 | 0.030 | 0.033 |
| doc2 | - | 1 | 0.034 | 0.033 | 0.036 |
| doc3 | - | - | 1 | 0.041 | 0.048 |
| doc4 | - | - | - | 1 | 0.043 |
| doc5 | - | - | - | - | 1 |
Example of document similarity matrix using Doc2Vec.
| similarity | doc1 | doc2 | doc3 | doc4 | doc5 |
|---|---|---|---|---|---|
| doc1 | 1 | 0.106 | 0.089 | 0.048 | 0.178 |
| doc2 | - | 1 | 0.148 | 0.576 | 0.184 |
| doc3 | - | - | 1 | 0.424 | 0.304 |
| doc4 | - | - | - | 1 | 0.307 |
| doc5 | - | - | - | - | 1 |
Example of document similarity matrix using SAO2Vec.
| similarity | doc1 | doc2 | doc3 | doc4 | doc5 |
|---|---|---|---|---|---|
| doc1 | 1 | 0.367 | 0.056 | 0.054 | 0.127 |
| doc2 | - | 1 | 0.199 | 0.684 | 0.579 |
| doc3 | - | - | 1 | 0.460 | 0.354 |
| doc4 | - | - | - | 1 | 0.429 |
| doc5 | - | - | - | - | 1 |
Patent clustering using spectral clustering.
| SAO2vec | Doc2vec | SAO frequency | |
|---|---|---|---|
| Accuracy | 875/941 = 0.93 | 849/941 = 0.90 | 407/9441 = 0.43 |
The results of the patent clustering show that the vector based on SAO frequency performed the worst and that the vector based on Doc2Vec performed the best.
| | |
| | |
| | |