| Literature DB >> 35327874 |
Wei Ren1, Hengwei Zhang1, Ming Chen1.
Abstract
Currently, there is no domain dictionary in the field of electric vehicles disassembly and other domain dictionary construction algorithms do not accurately extract terminology from disassembly text, because the terminology is complex and variable. Herein, the construction of a domain dictionary for the disassembly of electric vehicles is a research work that has important research significance. Extracting high-quality keywords from text and categorizing them widely uses information mining, which is the basis of named entity recognition, relation extraction, knowledge questions and answers and other disassembly domain information recognition and extraction. In this paper, we propose a supervised learning dictionary construction algorithm based on multi-dimensional features that combines different features of extraction candidate keywords from the text of each scientific study. Keywords recognition is regarded as a binary classification problem using the LightGBM model to filter each keyword, and then expand the domain dictionary based on the pointwise mutual information value between keywords and its category. Here, we make use of Chinese disassembly manuals, patents and papers in order to establish a general corpus about the disassembly information and then use our model to mine the disassembly parts, disassembly tools, disassembly methods, disassembly process, and other categories of disassembly keywords. The experiment evidenced that our algorithms can significantly improve extraction and category performance better than traditional algorithms in the disassembly domain. We also investigated the performance algorithms and attempts to describe them. Our work sets a benchmark for domain dictionary construction in the field of disassembly of electric vehicles that is based on the newly developed dataset using a multi-class terminology classification.Entities:
Keywords: LightGBM; PMI; domain dictionary; keyword extraction; terminology
Year: 2022 PMID: 35327874 PMCID: PMC8947409 DOI: 10.3390/e24030363
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Flow chart of model.
Position Feature.
| First occurrence of word in a text |
| |
| Last occurrence of word in a text |
|
Feature type and description.
| Type | Equation | Describe |
|---|---|---|
| TF |
| |
| IDF | ||
| TF-IDF | ———— | |
| TTF |
The parameter settings of the CBOW model.
| Parameter | Size | Window | Min_Count | CBOW_Mean | Sample |
|---|---|---|---|---|---|
| Value | 100 | 10 | 5 | 1 | 0.0001 |
Comparison of results for different k.
| k | Precision | Recall | F1 |
|---|---|---|---|
| 3 | 0.36 | 0.64 | 0.47 |
| 4 | 0.41 | 0.65 | 0.48 |
| 5 | 0.45 | 0.62 | 0.53 |
| 6 | 0.47 | 0.58 | 0.52 |
| 7 | 0.54 | 0.49 | 0.50 |
| 8 | 0.55 | 0.46 | 0.49 |
Parameters required for LightGBM.
| Parameters | Boosting_Type | Objective | Metric | Num_Leaves |
|---|---|---|---|---|
| Value | Gbdt | Binary | Binary_Logloss, Auc | 5 |
| Parameters | Max_Depth | Min_Data_In_Leaf | Learning_Rate | Feature_Fraction |
| Value | 6 | 450 | 0.1 | 0.9 |
| Parameters | Bagging_Fraction | Bagging_Freq | Reg_Alpha | Reg_Lambda |
| Value | 0.95 | 5 | 1 | 0.001 |
| Parameters | Min_Gain_To_Split | Verbose | Is_Unbalance | — — |
| Value | 0.2 | 5 | TRUE | — — |
Figure 2Comparison of results for different features.
Comparison of results for different keyword extraction algorithms.
| Num | Algorithm | Precision | Recall | F1 |
|---|---|---|---|---|
| 1 | TFIDF | 0.66 | 0.55 | 0.65 |
| 2 | TextRank | 0.35 | 0.42 | 0.47 |
| 3 | YAKE | 0.41 | 0.49 | 0.43 |
| 4 | TFIDF- TextRank | 0.75 | 0.65 | 0.64 |
| 5 | BERT | 0.83 | 0.71 | 0.69 |
| 6 | CBOW | 0.63 | 0.64 | 0.58 |
| 7 | Skipgram | 0.64 | 0.66 | 0.61 |
| 8 | Multi-Dimensional Features | 0.95 | 0.61 | 0.78 |
The result for difference classification algorithm.
| Classification Algorithm | Precision | Recall | F1 |
|---|---|---|---|
| SVM | 93.65 | 92.60 | 92.59 |
| K neighbors | 93.65 | 91.63 | 92.66 |
| Random forest | 94.96 | 92.94 | 93.95 |
| LightGBM | 99.69 | 99.54 | 99.42 |
The result for difference PMI value.
| Terms | PMI | Size |
|---|---|---|
| Parts | 0 | 410 |
| Processes | 0 | 118 |
| Methods | 0 | 94 |
| Tools | 0 | 195 |
| Other | 0 | 48 |
The result for difference model algorithm.
| Extraction Algorithm | Precision | Recall | F1 |
|---|---|---|---|
| TFIDF + SVM + PMI | 90.12 | 89.63 | 91.65 |
| TFIDF + random forest + PMI | 90.55 | 90.15 | 90.57 |
| TFIDF + LightGBM + PMI | 90.88 | 90.79 | 90.68 |
| TextRank + SVM + PMI | 92.23 | 92.45 | 91.57 |
| TextRank + random forest + PMI | 92.56 | 92.89 | 91.67 |
| TextRank + LightGBM + PMI | 93.01 | 92.50 | 92.60 |
| BERT + SVM + PMI | 94.05 | 93.44 | 92.87 |
| BERT + random forest + PMI | 94.36 | 92.14 | 92.73 |
| BERT + LightGBM + PMI | 94.89 | 93.58 | 92.91 |
| Word2Vec + SVM + PMI | 93.16 | 92.57 | 92.12 |
| Word2Vec + random forest + PMI | 93.46 | 93.01 | 91.25 |
| Word2Vec + LightGBM + PMI | 93.89 | 92.45 | 91.45 |
| Our Model | 98.02 | 95.55 | 95.83 |