| Literature DB >> 16893465 |
David Chen1, Hans-Michael Müller, Paul W Sternberg.
Abstract
BACKGROUND: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.Entities:
Mesh:
Year: 2006 PMID: 16893465 PMCID: PMC1559726 DOI: 10.1186/1471-2105-7-370
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example of the clustering results from the Sex Determination category. An intuitive interface allows users to quickly locate the topic of interest. The topics listed were generated automatically during the phrase-based clustering step.
Distribution of training examples and SVM output among the nine main categories.
| Number of examples in training set | Number assigned | |
| Genetics | 36 | 769 |
| Molecular Biology | 31 | 702 |
| Cellular Biology | 45 | 1532 |
| Sex Determination | 23 | 475 |
| Developmental Control | 42 | 1198 |
| Signal Transduction | 28 | 912 |
| Neurobiology and Behavior | 27 | 1214 |
| Ecology and Evolution | 21 | 505 |
| WormMethods | 39 | 1286 |
Figure 2Overview of the classification process. Full-text papers are taken from the Textpresso corpus and processed via SVM and phrase-base clustering. The end result is a large set of html files displaying the paper taxonomy.
Comparison of boosting and using XML on SVM performance with 10-fold cross validation on training set
| Average precision (sigma) | Average recall (sigma) | F1 score | |
| XML (boost 0) | 0.6328 (0.0913) | 0.5836 (0.0961) | 0.6072 |
| Untrimmed (boost 0) | 0.6387 (0.0953) | 0.5875 (0.1028) | 0.6120 |
| Trimmed (boost 0) | 0.6305 (0.1005) | 0.5874 (0.1016) | 0.6082 |
| XML (boost 1) | 0.6527 (0.0970) | 0.6173 (0.0877) | 0.6345 |
| Untrimmed (boost 1) | 0.6531 (0.0864) | 0.6214 (0.0918) | 0.6369 |
| Trimmed (boost 1) | 0.6351 (0.0965) | 0.6173 (0.0996) | 0.6261 |
| XML (boost 2) | 0.6673 (0.0790) | 0.6486 (0.0775) | 0.6578 |
| Untrimmed (boost 2) | 0.6566 (0.0655) | 0.6316 (0.0761) | 0.6438 |
| Trimmed (boost 2) | 0.6533 (0.0618) | 0.6314 (0.0834) | 0.6422 |
| XML (boost 3) | 0.6800 (0.0770) | 0.6556 (0.0755) | 0.6676 |
| Untrimmed (boost 3) | 0.6722 (0.0546) | 0.6419 (0.0622) | 0.6567 |
| Trimmed (boost 3) | 0.6472 (0.0478) | 0.6315 (0.0593) | 0.6393 |
| XML (boost 4) | 0.6780 (0.0857) | 0.6414 (0.0897) | 0.6592 |
| Untrimmed (boost 4) | 0.6843 (0.0624) | 0.6522 (0.0745) | 0.6678 |
| Trimmed (boost 4) | 0.6571 (0.0640) | 0.6241 (0.0766) | 0.6402 |
| XML (boost 5) | 0.6820 (0.0908) | 0.6456 (0.1035) | 0.6633 |
| Untrimmed (boost 5) | 0.6957 (0.0746) | 0.6517 (0.0655) | 0.6730 |
| Trimmed (boost 5) | 0.6708 (0.0817) | 0.6207 (0.0863) | 0.6448 |
| XML (boost 6) | 0.6994 (0.0781) | 0.6594 (0.0758) | 0.6788 |
| Untrimmed (boost 6) | 0.6926 (0.0859) | 0.6485 (0.0798) | 0.6698 |
| Trimmed (boost 6) | 0.6680 (0.0929) | 0.6172 (0.0966) | 0.6416 |
| XML (boost 7) | 0.6863 (0.0737) | 0.6382 (0.0784) | 0.6614 |
| Untrimmed (boost 7) | 0.6865 (0.0851) | 0.6415 (0.0864) | 0.6632 |
| Trimmed (boost 7) | 0.6732 (0.0869) | 0.6207 (0.0987) | 0.6459 |
| XML (boost 8) | 0.6703 (0.0709) | 0.6176 (0.0886) | 0.6429 |
| Untrimmed (boost 8) | 0.6817 (0.0682) | 0.6276 (0.0704) | 0.6535 |
| Trimmed (boost 8) | 0.6843 (0.0759) | 0.6245 (0.0939) | 0.6530 |
| XML (boost 9) | 0.6801 (0.0748) | 0.6142 (0.0906) | 0.6455 |
| Untrimmed (boost 9) | 0.6807 (0.0722) | 0.6167 (0.0724) | 0.6471 |
| Trimmed (boost 9) | 0.6749 (0.0775) | 0.6070 (0.0963) | 0.6392 |
Comparison of number of assigned documents when germline is available as a category
| With Germline as a category | Without Germline as a category | |||
| Training examples | Documents assigned | Training examples | Documents assigned | |
| Genetics | 25 | 736 | 25 | 722 |
| Molecular Biology | 20 | 611 | 20 | 683 |
| Cellular Biology | 21 | 1101 | 21 | 1277 |
| Sex Determination | 18 | 205 | 34 | 398 |
| Developmental Control | 22 | 796 | 22 | 721 |
| Signal Transduction | 20 | 922 | 20 | 757 |
| Neurobiology and Behavior | 20 | 1431 | 20 | 1401 |
| Ecology and Evolution | 18 | 495 | 18 | 470 |
| WormMethods | 23 | 1070 | 23 | 1154 |
| Germline | 18 | 118 | ||