| Literature DB >> 25551575 |
Hayda Almeida1, Marie-Jean Meurs2, Leila Kosseim1, Greg Butler3, Adrian Tsang2.
Abstract
This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm.Entities:
Mesh:
Year: 2014 PMID: 25551575 PMCID: PMC4281078 DOI: 10.1371/journal.pone.0115892
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1System Workflow.
mycoSORT Training and Testing processes.
mycoSet Corpus Statistics.
| Attribute | Quantity |
| Total # of instances | 7,583 (100%) |
| Total # of abstracts with text content | 6,898 (90.96%) |
| Negative instances | 6,834 (90.12%) |
| Positive instances | 749 (9.88%) |
| # of words in paper abstracts | 43,598 |
| # of words in paper titles | 12,388 |
| # of annotations in paper abstracts | 50,866 |
| # of annotations in paper titles | 8,172 |
| # of EC numbers | 12,272 |
Statistics of the mycoSet corpus.
Figure 2Corpora Under-sampling.
Number of Instances and Balances across all Training Sets.
mycoSet Bio-entities.
| Entity | Span | Entity | Span |
| AccessionNumber | entity | Glycosylation | sentence |
| ActivityAssayConditions | sentence | Kinetics | sentence |
| Assay | entity | Laccase | entity |
| Buffer | entity | Lipase | entity |
| Characterization | entity | Peroxidase | entity |
| Enzyme | entity | pH | sentence |
| Expression | sentence | ProductAnalysis | sentence |
| Family | entity | Temperature | sentence |
| Fungus | entity | SpecificActivity | sentence |
| Gene | entity | Substrate | entity |
| GlycosideHydrolase | entity | SubstrateSpecificity | sentence |
Bioentities and Spans in the mycoSet Corpus Annotated by the mycoMINE Text Mining System.
mycoSet Feature Vector Representation.
| ligninase | Trametes versicolor | synthetic | substrate | specificity | three | fungus | enzyme | … |
| 2 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | … |
Feature Occurrence Represented in the Feature Vector.
mycoSet Confusion Matrix.
| Predicted Positive | Predicted Negative | |
|
| True Positive (TP) | False Negative (FN) |
|
| False Positive (FP) | True Negative (TN) |
Confusion Matrix of a Binary Classification.
mycoSORT Results - Set of Features F1.
| Under-sampling(USF) | Classifier | Precision | Recall | F-measure | MCC | F-2 |
| Training set with USF 0% | Naive Bayes | 0.286 | 0.227 | 0.253 | 0.182 | 0.240 |
| Training set with USF 0% | LMT | 0.492 | 0.207 | 0.291 | 0.274 | 0.230 |
| Training set with USF 0% | LibSVM | 0.714 | 0.033 | 0.064 | 0.140 | 0.04 |
| Training set with USF 5% | Naive Bayes | 0.294 | 0.280 | 0.287 | 0.210 | 0.280 |
| Training set with USF 5% | LMT | 0.461 | 0.233 | 0.310 | 0.278 | 0.260 |
| Training set with USF 5% | LibSVM | 0.645 | 0.133 | 0.221 | 0.264 | 0.160 |
| Training set with USF 10% | Naive Bayes | 0.269 | 0.307 | 0.287 | 0.202 | 0.300 |
| Training set with USF 10% | LMT | 0.376 | 0.213 | 0.272 | 0.226 | 0.230 |
| Training set with USF 10% | LibSVM | 0.47 | 0.207 | 0.287 | 0.264 | 0.230 |
| Training set with USF 15% | Naive Bayes | 0.301 | 0.347 | 0.322 | 0.241 | 0.340 |
| Training set with USF 15% | LMT | 0.352 | 0.413 | 0.380 | 0.307 | 0.400 |
| Training set with USF 15% | LibSVM | 0.387 | 0.287 | 0.330 | 0.271 | 0.300 |
| Training set with USF 20% | Naive Bayes | 0.263 | 0.340 | 0.297 | 0.209 | 0.320 |
| Training set with USF 20% | LMT | 0.348 | 0.480 | 0.403 | 0.331 | 0.450 |
| Training set with USF 20% | LibSVM | 0.353 | 0.353 | 0.353 | 0.281 | 0.350 |
| Training set with USF 25% | Naive Bayes | 0.243 | 0.353 | 0.288 | 0.197 | 0.320 |
| Training set with USF 25% | LMT | 0.286 | 0.547 | 0.375 | 0.301 | 0.460 |
| Training set with USF 25% | LibSVM | 0.282 | 0.413 | 0.335 | 0.251 | 0.380 |
| Training set with USF 30% | Naive Bayes | 0.277 | 0.440 | 0.340 | 0.257 | 0.390 |
| Training set with USF 30% | LMT | 0.291 | 0.627 | 0.397 | 0.334 | 0.510 |
| Training set with USF 30% | LibSVM | 0.258 | 0.48 | 0.336 | 0.252 | 0.410 |
| Training set with USF 35% | Naive Bayes | 0.242 | 0.440 | 0.312 | 0.223 | 0.380 |
| Training set with USF 35% | LMT | 0.233 | 0.620 | 0.338 | 0.266 | 0.470 |
| Training set with USF 35% | LibSVM | 0.210 | 0.633 | 0.316 | 0.241 | 0.450 |
| Training set with USF 40% | Naive Bayes | 0.254 | 0.467 | 0.329 | 0.243 | 0.400 |
| Training set with USF 40% | LMT | 0.269 | 0.660 | 0.382 | 0.321 | 0.510 |
| Training set with USF 40% | LibSVM | 0.196 | 0.667 | 0.303 | 0.229 | 0.450 |
Results of Positive Class on Feature Setting #1, Using Only Bio-entities as Features.
mycoSORT Results - Set of Features F1+F4.
| Under-sampling(USF) | Classifier | Precision | Recall | F-measure | MCC | F-2 |
| Training set with USF 0% | Naive Bayes | 0.285 | 0.380 | 0.326 | 0.242 | 0.360 |
| Training set with USF 0% | LMT | 0.516 | 0.107 | 0.177 | 0.202 | 0.130 |
| Training set with USF 0% | LibSVM | 1.000 | 0.020 | 0.039 | 0.134 | 0.020 |
| Training set with USF 5% | Naive Bayes | 0.273 | 0.373 | 0.315 | 0.230 | 0.350 |
| Training set with USF 5% | LMT | 0.426 | 0.173 | 0.246 | 0.224 | 0.200 |
| Training set with USF 5% | LibSVM | 0.833 | 0.033 | 0.064 | 0.155 | 0.040 |
| Training set with USF 10% | Naive Bayes | 0.268 | 0.427 | 0.329 | 0.243 | 0.380 |
| Training set with USF 10% | LMT | 0.412 | 0.233 | 0.298 | 0.255 | 0.260 |
| Training set with USF 10% | LibSVM | 0.688 | 0.073 | 0.133 | 0.203 | 0.090 |
| Training set with USF 15% | Naive Bayes | 0.268 | 0.427 | 0.329 | 0.243 | 0.380 |
| Training set with USF 15% | LMT | 0.398 | 0.300 | 0.342 | 0.284 | 0.320 |
| Training set with USF 15% | LibSVM | 0.604 | 0.193 | 0.293 | 0.306 | 0.220 |
| Training set with USF 20% | Naive Bayes | 0.275 | 0.440 | 0.338 | 0.255 | 0.390 |
| Training set with USF 20% | LMT | 0.322 | 0.393 | 0.354 | 0.276 | 0.380 |
| Training set with USF 20% | LibSVM | 0.471 | 0.327 | 0.386 | 0.338 | 0.350 |
| Training set with USF 25% | Naive Bayes | 0.258 | 0.507 | 0.342 | 0.260 | 0.420 |
| Training set with USF 25% | LMT | 0.321 | 0.520 | 0.397 | 0.324 | 0.460 |
| Training set with USF 25% | LibSVM | 0.364 | 0.420 | 0.390 | 0.318 | 0.410 |
| Training set with USF 30% | Naive Bayes | 0.237 | 0.540 | 0.329 | 0.248 | 0.430 |
| Training set with USF 30% | LMT | 0.328 | 0.500 | 0.396 | 0.322 | 0.450 |
| Training set with USF 30% | LibSVM | 0.323 | 0.473 | 0.384 | 0.308 | 0.430 |
| Training set with USF 35% | Naive Bayes | 0.227 | 0.513 | 0.315 | 0.229 | 0.410 |
| Training set with USF 35% | LMT | 0.267 | 0.587 | 0.367 | 0.295 | 0.470 |
| Training set with USF 35% | LibSVM | 0.251 | 0.573 | 0.349 | 0.274 | 0.460 |
| Training set with USF 40% | Naive Bayes | 0.244 | 0.520 | 0.332 | 0.250 | 0.420 |
| Training set with USF 40% | LMT | 0.267 | 0.707 | 0.388 | 0.334 | 0.530 |
| Training set with USF 40% | LibSVM | 0.217 | 0.613 | 0.321 | 0.245 | 0.450 |
Results of Positive Class on Feature Setting #2, Using Bio-entities and EC Numbers as Features.
mycoSORT Results - Set of Features F5.
| Under-sampling(USF) | Classifier | Precision | Recall | F-measure | MCC | F-2 |
| Training set with USF 0% | Naive Bayes | 0.307 | 0.720 | 0.430 | 0.382 | 0.570 |
| Training set with USF 0% | LMT | 0.656 | 0.420 | 0.512 | 0.485 | 0.450 |
| Training set with USF 0% | LibSVM | 0.833 | 0.033 | 0.064 | 0.155 | 0.040 |
| Training set with USF 5% | Naive Bayes | 0.310 | 0.733 | 0.436 | 0.390 | 0.580 |
| Training set with USF 5% | LMT | 0.600 | 0.500 | 0.545 | 0.503 | 0.520 |
| Training set with USF 5% | LibSVM | 0.703 | 0.173 | 0.278 | 0.319 | 0.200 |
| Training set with USF 10% | Naive Bayes | 0.307 | 0.760 | 0.438 | 0.396 | 0.590 |
| Training set with USF 10% | LMT | 0.574 | 0.567 | 0.570 | 0.523 | 0.570 |
| Training set with USF 10% | LibSVM | 0.704 | 0.333 | 0.452 | 0.449 | 0.370 |
| Training set with USF 15% | Naive Bayes | 0.309 | 0.793 | 0.445 | 0.41 | 0.600 |
| Training set with USF 15% | LMT | 0.458 | 0.693 | 0.552 | 0.504 | 0.630 |
| Training set with USF 15% | LibSVM | 0.596 | 0.413 | 0.488 | 0.451 | 0.440 |
| Training set with USF 20% | Naive Bayes | 0.314 | 0.793 | 0.450 | 0.415 | 0.610 |
| Training set with USF 20% | LMT | 0.422 | 0.653 | 0.513 | 0.460 | 0.590 |
| Training set with USF 20% | LibSVM | 0.545 | 0.527 | 0.536 | 0.485 | 0.530 |
| Training set with USF 25% | Naive Bayes | 0.312 | 0.780 | 0.446 | 0.408 | 0.600 |
| Training set with USF 25% | LMT | 0.399 | 0.673 | 0.501 | 0.449 | 0.590 |
| Training set with USF 25% | LibSVM | 0.481 | 0.580 | 0.526 | 0.470 | 0.560 |
| Training set with USF 30% | Naive Bayes | 0.288 | 0.767 | 0.418 | 0.377 | 0.580 |
| Training set with USF 30% | LMT | 0.388 | 0.727 | 0.506 | 0.461 | 0.620 |
| Training set with USF 30% | LibSVM | 0.460 | 0.687 | 0.551 | 0.503 | 0.630 |
| Training set with USF 35% | Naive Bayes | 0.302 | 0.780 | 0.435 | 0.397 | 0.590 |
| Training set with USF 35% | LMT | 0.359 | 0.807 | 0.497 | 0.465 | 0.650 |
| Training set with USF 35% | LibSVM | 0.369 | 0.800 | 0.505 | 0.472 | 0.650 |
| Training set with USF 40% | Naive Bayes | 0.303 | 0.773 | 0.435 | 0.396 | 0.590 |
| Training set with USF 40% | LMT | 0.344 | 0.840 | 0.488 | 0.463 | 0.650 |
| Training set with USF 40% | LibSVM | 0.338 | 0.840 | 0.482 | 0.456 | 0.650 |
Results of Positive Class on Feature Setting #3, Using Only Bag-of-Words as Features.
mycoSORT Results - Set of Features F1+F2+F3+F4.
| Under-sampling(USF) | Classifier | Precision | Recall | F-measure | MCC | F-2 |
| Training set with USF 0% | Naive Bayes | 0.355 | 0.727 | 0.477 | 0.431 | 0.600 |
| Training set with USF 0% | LMT | 0.685 | 0.420 | 0.521 | 0.498 | 0.460 |
| Training set with USF 0% | LibSVM | 0.867 | 0.087 | 0.158 | 0.257 | 0.110 |
| Training set with USF 5% | Naive Bayes | 0.365 | 0.740 | 0.489 | 0.446 | 0.610 |
| Training set with USF 5% | LMT | 0.585 | 0.480 | 0.527 | 0.484 | 0.500 |
| Training set with USF 5% | LibSVM | 0.729 | 0.287 | 0.411 | 0.424 | 0.330 |
| Training set with USF 10% | Naive Bayes | 0.349 | 0.787 | 0.484 | 0.448 | 0.630 |
| Training set with USF 10% | LMT | 0.552 | 0.600 | 0.575 | 0.526 | 0.590 |
| Training set with USF 10% | LibSVM | 0.670 | 0.420 | 0.516 | 0.491 | 0.450 |
| Training set with USF 15% | Naive Bayes | 0.342 | 0.787 | 0.477 | 0.441 | 0.620 |
| Training set with USF 15% | LMT | 0.478 | 0.647 | 0.550 | 0.498 | 0.600 |
| Training set with USF 15% | LibSVM | 0.607 | 0.473 | 0.532 | 0.491 | 0.490 |
| Training set with USF 20% | Naive Bayes | 0.342 | 0.793 | 0.478 | 0.443 | 0.630 |
| Training set with USF 20% | LMT | 0.425 | 0.64 | 0.511 | 0.456 | 0.580 |
| Training set with USF 20% | LibSVM | 0.521 | 0.587 | 0.552 | 0.500 | 0.570 |
| Training set with USF 25% | Naive Bayes | 0.322 | 0.787 | 0.457 | 0.421 | 0.610 |
| Training set with USF 25% | LMT | 0.389 | 0.747 | 0.511 | 0.469 | 0.630 |
| Training set with USF 25% | LibSVM | 0.474 | 0.667 | 0.554 | 0.504 | 0.620 |
| Training set with USF 30% | Naive Bayes | 0.336 | 0.773 | 0.469 | 0.430 | 0.610 |
| Training set with USF 30% | LMT | 0.398 | 0.780 | 0.527 | 0.490 | 0.650 |
| Training set with USF 30% | LibSVM | 0.459 | 0.673 | 0.546 | 0.496 | 0.620 |
| Training set with USF 35% | Naive Bayes | 0.304 | 0.800 | 0.440 | 0.406 | 0.600 |
| Training set with USF 35% | LMT | 0.343 | 0.760 | 0.473 | 0.433 | 0.610 |
| Training set with USF 35% | LibSVM | 0.357 | 0.793 | 0.493 | 0.458 | 0.640 |
| Training set with USF 40% | Naive Bayes | 0.295 | 0.780 | 0.428 | 0.389 | 0.590 |
| Training set with USF 40% | LMT | 0.361 | 0.847 | 0.506 | 0.481 | 0.670 |
| Training set with USF 40% | LibSVM | 0.331 | 0.793 | 0.468 | 0.433 | 0.620 |
Results of Positive Class on Feature Setting #4, Using Bio-entities, Content and EC Numbers as Features.
Figure 3mycoSORT F-measure scores.
Results of the Best Classifiers for Each Classification Model.
Figure 4mycoSORT F-2 scores.
Results of the Best Classifiers for Each Classification Model.