| Literature DB >> 34585729 |
Kamran Karimi1, Sergei Agalakov1, Cheryl A Telmer2, Thomas R Beatman2, Troy J Pells1, Bradley Im Arshinoff1, Carolyn J Ku2, Saoirse Foley2, Veronica F Hinman2, Charles A Ettensohn2, Peter D Vize1.
Abstract
A keyword-based search of comprehensive databases such as PubMed may return irrelevant papers, especially if the keywords are used in multiple fields of study. In such cases, domain experts (curators) need to verify the results and remove the irrelevant articles. Automating this filtering process will save time, but it has to be done well enough to ensure few relevant papers are rejected and few irrelevant papers are accepted. A good solution would be fast, work with the limited amount of data freely available (full paper body may be missing), handle ambiguous keywords and be as domain-neutral as possible. In this paper, we evaluate a number of classification algorithms for identifying a domain-specific set of papers about echinoderm species and show that the resulting tool satisfies most of the abovementioned requirements. Echinoderms consist of a number of very different organisms, including brittle stars, sea stars (starfish), sea urchins and sea cucumbers. While their taxonomic identifiers are specific, the common names are used in many other contexts, creating ambiguity and making a keyword search prone to error. We try classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM, Bagging, AdaBoost and Neural Network learning models and compare their performance. We show how effective the resulting classifiers are in filtering irrelevant articles returned from PubMed. The methodology used is more dependent on the good selection of training data and is a practical solution that can be applied to other fields of study facing similar challenges. Database URL: The code and date reported in this paper are freely available at http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/.Entities:
Mesh:
Year: 2021 PMID: 34585729 PMCID: PMC8588847 DOI: 10.1093/database/baab062
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Sklearn classifiers used in our experiments
| Classifiers | Model |
|---|---|
| RidgeClassifier, SGDClassifier, PassiveAggressiveClassifier, LogisticRegression | Linear |
| MultinomialNB, ComplementNB, BernoulliNB | Naïve Bayes |
| DecisionTreeClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier | Tree |
| KNeighborsClassifier | K Nearest Neighbors |
| SVC | SVM |
| MLPClassifier | Neural Network |
Accuracy values for the training and test data. Best accuracies are highlighted in bold
| 20-fold CV for training data | |||
|---|---|---|---|
| Classifier | Average accuracy (%) | Standard deviation | Accuracy for test data (%) |
| RidgeClassifier |
| 4.8 | 88.4 |
| SGDClassifier | 88.2 | 5.1 | 88.6 |
| PassiveAggressiveClassifier | 88.0 | 4.7 | 89.3 |
| LogisticRegression | 87.4 | 5.6 | 86.6 |
| MultinomialNB | 82.4 | 7.9 | 79.5 |
| ComplementNB | 84.1 | 7.0 | 82.0 |
| BernoulliNB | 82.1 | 7.6 | 83.7 |
| DecisionTreeClassifier | 86.4 | 5.9 | 92.7 |
| RandomForestClassifier | 87.4 | 6.0 | 88.0 |
| BaggingClassifier | 87.5 | 5.1 |
|
| KNeighborsClassifier | 82.1 | 5.4 | 81.1 |
| AdaBoostClassifier | 84.0 | 6.1 | 91.1 |
| SVC | 88.0 | 4.5 | 89.1 |
| MLPClassifier | 87.1 | 5.7 | 88.2 |
Precision and recall values for the test dataset. Best results are highlighted in bold
| Irrelevant papers | Relevant papers | |||||
|---|---|---|---|---|---|---|
| Classifier | Precision (%) | Recall (%) | F-score (%) | Precision (%) | Recall (%) | F-score (%) |
| RidgeClassifier | 91.6 | 84.1 | 87.7 | 85.8 | 92.6 | 89.1 |
| SGDClassifier | 91.6 | 84.5 | 87.9 | 86.2 | 92.6 | 89.3 |
| PassiveAggressiveClassifier | 91.7 | 78.6 | 85.2 | 82.1 | 94.3 | 87.8 |
| LogisticRegression | 93.0 | 78.6 | 85.2 | 82.1 | 94.3 | 87.8 |
| MultinomialNB | 95.7 | 60.9 | 74.4 | 72.2 | 97.4 | 82.9 |
| ComplementNB | 92.6 | 68.6 | 78.9 | 75.9 | 94.8 | 84.3 |
| BernoulliNB | 96.2 | 69.5 | 80.7 | 76.9 | 97.4 | 85.9 |
| DecisionTreeClassifier | 93.9 | 91.5 | 92.4 | 90.9 | 94.3 | 92.9 |
| RandomForestClassifier | 95.1 | 79.5 | 86.5 | 83.0 | 96.1 | 89.1 |
| BaggingClassifier |
|
|
|
|
|
|
| KNeighborsClassifier | 89.0 | 70.0 | 78.4 | 76.1 | 91.7 | 83.2 |
| AdaBoostClassifier | 94.1 | 87.3 | 90.6 | 88.6 | 94.8 | 91.6 |
| SVC | 88.8 | 86.8 | 87.8 | 87.6 | 89.5 | 88.6 |
| MLPClassifier | 94.9 | 84.1 | 89.2 | 86.2 | 95.6 | 90.7 |