| Literature DB >> 35539017 |
Kelle Fortunato Costa1, Fabrício Almeida Araújo2,3, Jefferson Morais4, Carlos Renato Lisboa Frances1, Rommel T J Ramos5.
Abstract
Antimicrobial resistance is a significant public health problem worldwide. In recent years, the scientific community has been intensifying efforts to combat this problem; many experiments have been developed, and many articles are published in this area. However, the growing volume of biological literature increases the difficulty of the biocuration process due to the cost and time required. Modern text mining tools with the adoption of artificial intelligence technology are helpful to assist in the evolution of research. In this article, we propose a text mining model capable of identifying and ranking prioritizing scientific articles in the context of antimicrobial resistance. We retrieved scientific articles from the PubMed database, adopted machine learning techniques to generate the vector representation of the retrieved scientific articles, and identified their similarity with the context. As a result of this process, we obtained a dataset labeled "Relevant" and "Irrelevant" and used this dataset to implement one supervised learning algorithm to classify new records. The model's overall performance reached 90% accuracy and the f-measure (harmonic mean between the metrics) reached 82% accuracy for positive class and 93% for negative class, showing quality in the identification of scientific articles relevant to the context. The dataset, scripts and models are available at https://github.com/engbiopct/TextMiningAMR.Entities:
Keywords: Antimicrobial resistance; Biological literature; Machine learning; Text mining
Year: 2022 PMID: 35539017 PMCID: PMC9080439 DOI: 10.7717/peerj.13351
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 3.061
Figure 1Proposed TM model.
Steps (A) and (B) include retrieving the information. Steps (C) and (D) include the recognition of entities and the discovery of knowledge, resulting in a metric (cosine similarity) responsible for determining the binary classification performed in step (E).
Parameters E-Search PubMed central.
| Parameters | Value |
|---|---|
| URL |
|
| db | PMC (full text articles) |
| Term | (“drug resistance, microbial”[MeSH Terms] OR (“drug”[All Fields] AND “resistance”[All Fields] AND “microbial”[All Fields]) OR “microbial drug resistance”[All Fields] OR (“drug”[All Fields] AND “resistance”[All Fields] AND “microbial”[All Fields]) OR ”drug resistance, microbial”[All Fields]) |
| Free text articles | Open access |
Parameters Doc2Vec algorithm.
| Parameters | Value | Description |
|---|---|---|
| 300 | Dimensionality of the feature vectors | |
|
| 0.025 | The initial learning rate |
|
| 0.00025 | Learning rate will linearly drop to |
| 18 | Use these many worker threads to train the model (= faster training with multicore machines) | |
| 3 | Ignores all words with total frequency lower than this | |
| 30 | Number of iterations (epochs) over the | |
| 1 | Defines the training algorithm. If |
Figure 2Evaluation of the proposed method.
Figure 3Performance of predictions with automatically labeled data.
Figure 4SVM classifier confusion matrix for dataset_1 (PV-DM).
Figure 5SVM classifier confusion matrix for dataset_1 (Bag of Words).
Classifier performance assessment.
| Class | Precision | Recall | F1-score | Support | Accuracy | |
|---|---|---|---|---|---|---|
| 0 | 0.74 | 0.93 | 0.82 | 15 | 0.90 | |
| 1 | 0.98 | 0.89 | 0.93 | 47 | ||
| 0 | 0.21 | 0.60 | 0.32 | 15 | 0.37 | |
| 1 | 0.70 | 0.30 | 0.42 | 47 |
Results of labeling and classification steps vs experts.
| Relevant (%) | Irrelevant (%) | |
|---|---|---|
| Labeling | ||
| Dataset_1 | 80 | 68 |
| Dataset_2 | 66 | 34 |
| Classification | ||
| SVM_1 | 93 | 89 |
| SVM_2 | 60 | 29 |