| Literature DB >> 27312781 |
Khadim Dramé1, Fleur Mougin2, Gayo Diallo2.
Abstract
BACKGROUND: With the large and increasing volume of textual data, automated methods for identifying significant topics to classify textual documents have received a growing interest. While many efforts have been made in this direction, it still remains a real challenge. Moreover, the issue is even more complex as full texts are not always freely available. Then, using only partial information to annotate these documents is promising but remains a very ambitious issue.Entities:
Keywords: Biomedical text classification; Explicit semantic analysis; Information extraction; Machine learning; Multi-label classification; Semantic indexing; k-nearest neighbours
Mesh:
Year: 2016 PMID: 27312781 PMCID: PMC4911685 DOI: 10.1186/s13326-016-0073-1
Source DB: PubMed Journal: J Biomed Semantics
Importance of each feature for the prediction according to the Information Gain measure
| Feature | Description | Information gain |
|---|---|---|
| Feature 1 | Number of neighbours in which the label is assigned | 0.16 |
| Feature 2 | Sum of similarity scores between the document and all the neighbours’ document where the label appears | 0.17 |
| Feature 3 | Check whether all constituted tokens of the label appear in the target document | 0.01 |
| Feature 4 | Check whether one of the label entries appears in the target document | 0.03 |
| Feature 5 | Frequency of the label if it is contained in the document | 0.03 |
| Feature 6 | Check if the label is contained in the document title | 0.02 |
Fig. 1The process of the Explicit Semantic Analysis based approach. The two steps of the ESA-based approach are presented: the indexing step and the classification step
Results of our kNN-based system and the best systems participating in the BioASQ challenge on the different tests of the batch 3
| Test | Number of documents | System | EBP | EBR | EBF |
|---|---|---|---|---|---|
| Test 1 | 2,961 | kNN-Classifier | 0.55 | 0.48 | 0.49 |
| Best | 0.59 | 0.62 | 0.58 | ||
| Test 2 | 5,612 | kNN-Classifier | 0.52 | 0.50 | 0.48 |
| Best | 0.62 | 0.60 | 0.60 | ||
| Test 3 | 2,698 | kNN-Classifier | 0.55 | 0.49 | 0.49 |
| Best | 0.64 | 0.63 | 0.62 | ||
| Test 4 | 2,982 | kNN-Classifier | 0.49 | 0.55 | 0.49 |
| Best | 0.63 | 0.62 | 0.62 | ||
| Test 5 | 2,697 | kNN-Classifier | 0.50 | 0.53 | 0.48 |
| Best | 0.64 | 0.61 | 0.61 |
Results of the kNN-Classifier according to the classifier and strategy used for fixing N: a) 0.5 as the minimal confidence score threshold, b) the average size of the sets of labels collected from the neighbours and c) the cut-off method. A training set of 20,000 documents is used
| Strategy | Classifier | EBP | EBR | EBF |
|---|---|---|---|---|
| a) | NB | 0.58 | 0.49 | 0.49 |
| RF | 0.74 | 0.34 | 0.43 | |
| b) | NB | 0.51 | 0.54 | 0.51 |
| RF | 0.52 | 0.54 | 0.52 | |
| c) | NB | 0.56 | 0.52 | 0.51 |
| RF | 0.61 | 0.52 | 0.53 |
Results of the kNN-Classifier according to the classifier using the cut-off method with a training set of 50,000 documents
| Classifier | EBP | EBR | EBF | Acc |
|---|---|---|---|---|
| NB | 0.59 | 0.54 | 0.54 | 0.39 |
| RF | 0.62 | 0.54 | 0.55 | 0.41 |
| C4.5 | 0.63 | 0.52 | 0.54 | 0.39 |
| MLP | 0.64 | 0.46 | 0.51 | 0.36 |
Labels generated by the kNN-Classifier with their corresponding relevance scores for the document having the 23044786 PMID
| Labels | Relevance | Manual validation |
|---|---|---|
| Humans | 0.99 | Yes |
| Postoperative care | 0.75 | Yes |
| Female | 0.60 | Yes |
| Male | 0.60 | Yes |
| Middle aged | 0.32 | Yes |
| General surgery | 0.32 | Yes |
| Medical errors | 0.32 | Yes |
| Patient care team | 0.32 | No |
| Postoperative complications | 0.32 | No |
| Adult | 0.26 | Yes |
| Safety management | 0.26 | No |
| Aged | 0.25 | Yes |
| Prospective studies | 0.21 | Yes |
| Length of stay | 0.21 | No |
| Patient safety | 0.20 | Yes |
| Surgical procedures, operative | 0.20 | No |
Example of a PubMed® (23044786) citation manually annotated by human indexers using MeSH descriptors. This is an example of a PubMed citation, consisting of a title and an abstract, with MeSH descriptors manually selected by indexers for annotating it
| Title | An observational study of the frequency, severity, and etiology of failures in postoperative care after major elective general surgery |
| Abstract | Objective: |
| MeSH descriptors assigned manually to the citation | Adult, Aged, Aged, 80 and over, Digestive System Surgical Procedures*, Elective Surgical Procedures*, Female, General Surgery, Hospitals, Teaching, Urban, Humans, Interprofessional Relations, London, Male, Medical Errors, Medical, Errors, Middle Aged, Outcome and Process Assessment (Health Care)*, Patient Safety, Postoperative, Care, Postoperative Care, Prospective Studies |
Results of the ESA-based approach according to the association score
| Association score | EBF | Acc |
|---|---|---|
| Jaccard coefficient | 0.26 | 0.16 |
| TF.ICF | 0.22 | 0.13 |
Comparison of our kNN-Classifier used for participating in the challenge with the best systems and the MTI baseline on the test set of the week 2 of batch 3 consisting of 3009 documents. The used measures are: example-based precision (EBP), example-based recall (EBR), example-based f-measure (EBF) and micro f-measure (MiF) (Source BioASQ 2014)
| Systems | EBP | EBR | EBF | MiF |
|---|---|---|---|---|
| Antinomyra [ | 0.59 | 0.62 | 0.59 | 0.60 |
| L2R [ | 0.59 | 0.60 | 0.58 | 0.59 |
| Hippocrates [ | 0.59 | 0.60 | 0.57 | 0.59 |
| MTI | 0.59 | 0.58 | 0.56 | 0.57 |
| kNN-Classifier | 0.55 | 0.49 | 0.49 | 0.51 |