| Literature DB >> 25887792 |
Antonio Jose Jimeno Yepes1,2, Laura Plaza3, Jorge Carrillo-de-Albornoz4, James G Mork5, Alan R Aronson6.
Abstract
BACKGROUND: Research in biomedical text categorization has mostly used the bag-of-words representation. Other more sophisticated representations of text based on syntactic, semantic and argumentative properties have been less studied. In this paper, we evaluate the impact of different text representations of biomedical texts as features for reproducing the MeSH annotations of some of the most frequent MeSH headings. In addition to unigrams and bigrams, these features include noun phrases, citation meta-data, citation structure, and semantic annotation of the citations.Entities:
Mesh:
Year: 2015 PMID: 25887792 PMCID: PMC4407321 DOI: 10.1186/s12859-015-0539-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
MeSH descriptors in the evaluation collection, and their citations frequencies in the training set
|
|
|
|
|
|---|---|---|---|
| Humans | 66612 | Disease models, Animal | 2203 |
| Male | 39007 | Rats, Sprague-Dawley | 2160 |
| Female | 38793 | Sensitivity and specificity | 2155 |
| Animals | 25529 | Cell proliferation | 2124 |
| Adult | 21471 | Biological markers | 2088 |
| Middle aged | 20867 | Cohort studies | 2072 |
| Young adult | 9512 | Risk assessment | 2049 |
| Adolescent | 8869 | Brain | 2035 |
| Mice | 7980 | Mutation | 2025 |
| Treatment outcome | 6749 | Mice, Inbred C57BL | 2005 |
| Aged, 80 and over | 6015 | Cell line | 1947 |
| Child | 5759 | Apoptosis | 1901 |
| Rats | 5610 | Infant, Newborn | 1865 |
| Risk factors | 4896 | Tomography, X-Ray computed | 1862 |
| Prospective studies | 3178 | RNA, Messenger | 1843 |
| Questionnaires | 3064 | Age factors | 1763 |
| Signal transduction | 2925 | Algorithms | 1698 |
| Cell line, Tumor | 2911 | Models, Molecular | 1692 |
| Molecular sequence data | 2695 | Antineoplastic agents | 1681 |
| Pregnancy | 2672 | Gene expression regulation | 1669 |
| Infant | 2551 | Dose-response relationship, Drug | 1627 |
| Magnetic resonance imaging | 2545 | Amino acid sequence | 1625 |
| Cells, Cultured | 2451 | Genotype | 1561 |
| Prognosis | 2450 | Neoplasms | 1521 |
| Case-Control studies | 2383 | Phylogeny | 1518 |
Feature comparison over all results
|
|
|
|
|
|
|---|---|---|---|---|
| Unigram | 0.418 | 0.492 | 0.420 | 0.471 |
| Bigram | 0.406 | 0.513* | 0.420 | 0.477* |
| Argumentative | 0.403 | 0.479 | 0.415 | 0.464 |
| Noun phrases | 0.222 | 0.329 | 0.222 | 0.271 |
| Concepts | 0.409 | 0.497* | 0.427 | 0.480* |
| CUIs | 0.398 | 0.496 | 0.422 | 0.475 |
| MTI predictions | 0.513* | 0.531* | 0.478* | 0.501* |
| MTI MMI | 0.398 | 0.454 | 0.367 | 0.382 |
| MTI PRC | 0.481* | 0.502 | 0.430 | 0.453 |
| First level taxonomy | 0.300 | 0.456 | 0.351 | 0.429 |
| Second level taxonomy | 0.222 | 0.424 | 0.329 | 0.393 |
| Third level taxonomy | 0.173 | 0.383 | 0.285 | 0.341 |
| Journal | 0.115 | 0.193 | 0.126 | 0.208 |
| Affiliation | 0.046 | 0.064 | 0.045 | 0.044 |
| Author | 0.062 | 0.137 | 0.081 | 0.084 |
Results are reported in F-measure. Binary representation of features is used. Several learning algorithms have been used including SVMLight, SVM-perf, AdaBoostM1 and AdaBoostM1 with oversampling of positive instances (Ada Over). For each column, results significantly better than unigram (p >0.05) are indicated with *. For each pair of methods (SVMLight/SVM-perf and AdaBoostM1/Ada Over), statistical differences are highlighted using †.
Results of the best performance features (Unigrams, Bigrams, Concepts’ names and CUIs, and First level taxonomy) keeping the source of tokens (either title or abstract), using SVM-perf and a binary representation of features
|
|
|
| |
|---|---|---|---|
| SVM-perf unigram | 0.395 | 0.654 | 0.492 |
| SVM-perf bigram | 0.414 | 0.675 | 0.513* |
| SVM-perf concepts | 0.404 | 0.646 | 0.497* |
| SVM-perf CUIs | 0.404 | 0.643 | 0.496* |
| SVM-perf first level taxonomy | 0.351 | 0.653 | 0.456 |
| SVM-perf TIAB unigram | 0.398 | 0.659 | 0.496* |
| SVM-perf TIAB bigram | 0.408 | 0.685 | 0.512* |
| SVM-perf TIAB Concepts | 0.405 | 0.656 | 0.501* |
| SVM-perf TIAB CUIs | 0.407 | 0.655 | 0.502* |
| SVM-perf TIAB first level taxonomy | 0.376 | 0.610 | 0.465 |
Results significantly better than unigram (p >0.05) are indicated with *.
Binary versus term frequency features using SVMLight and SVM-perf on unigrams and bigrams
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| SVMLight Unigram | 0.678 | 0.302 | 0.418* | 0.694 | 0.269 | 0.387 |
| SVMLight Bigram | 0.711 | 0.284 | 0.406 | 0.708 | 0.273 | 0.394 |
| SVMLight TIAB unigram | 0.678 | 0.302 | 0.418* | 0.700 | 0.263 | 0.383 |
| SVMLight TIAB bigram | 0.730 | 0.294 | 0.420* | 0.715 | 0.268 | 0.389 |
| SVM-perf Unigram | 0.395 | 0.654 | 0.492 | 0.390 | 0.686 | 0.497 |
| SVM-perf Bigram | 0.414 | 0.675 | 0.513 | 0.442 | 0.594 | 0.507 |
| SVM-perf TIAB unigram | 0.398 | 0.659 | 0.496* | 0.401 | 0.609 | 0.483 |
| SVM-perf TIAB bigram | 0.408 | 0.685 | 0.512 | 0.428 | 0.611 | 0.503 |
For each row, significantly better results (p >0.05) are indicated with *.
MTI results and individual performance of its components MMI (MetaMap + Restrict-to-MeSH) and PRC (PubMed Related citations)
|
|
|
| |
|---|---|---|---|
| MTI system | 0.612 | 0.499 | 0.549 |
| MMI | 0.556 | 0.212 | 0.307 |
| PRC | 0.602 | 0.356 | 0.447 |
| MMI+PRC | 0.600 | 0.393 | 0.475 |
Feature combination results
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| Unigram | 0.395 | 0.654 | 0.492 | 0.528 | 0.425 | 0.471 |
| Unigram+CUI | 0.409 | 0.657 | 0.504* | 0.529 | 0.437 | 0.479* |
| Unigram+Meta | 0.387 | 0.672 | 0.491 | 0.550 | 0.405 | 0.466 |
| Unigram+NP | 0.382 | 0.701 | 0.495* | 0.535 | 0.424 | 0.473 |
| Unigram+Taxo | 0.403 | 0.660 | 0.500* | 0.531 | 0.432 | 0.477 |
| Unigram+mti | 0.448 | 0.679 | 0.540* | 0.586 | 0.477 | 0.526* |
| Unigram+mmi+prc | 0.445 | 0.677 | 0.537* | 0.583 | 0.474 | 0.523* |
| Unigram+all | 0.452 | 0.689 | 0.546* | 0.600 | 0.476 | 0.531* |
|
|
|
|
|
|
|
|
| TIAB+bigram | 0.408 | 0.685 | 0.512 | 0.556 | 0.421 | 0.479 |
| TIAB+bigram+CUI | 0.439 | 0.688 | 0.536 | 0.556 | 0.435 | 0.488* |
| TIAB+bigram+Meta | 0.408 | 0.689 | 0.513 | 0.581 | 0.406 | 0.478 |
| TIAB+bigram+NP | 0.417 | 0.686 | 0.518* | 0.560 | 0.422 | 0.481 |
| TIAB+bigram+Taxo | 0.418 | 0.679 | 0.518* | 0.554 | 0.412 | 0.472 |
| TIAB+bigram+mti | 0.451 | 0.701 | 0.549* | 0.604 | 0.475 | 0.532* |
| TIAB+bigram+mmi+prc | 0.448 | 0.699 | 0.546* | 0.607 | 0.466 | 0.528* |
| TIAB+bigram+all | 0.470 | 0.682 | 0.557* | 0.629 | 0.380 | 0.474 |
Results are reported in Precision (Prec), Recall (Rec) and F-measure (F1). Unigrams and bigrams with feature source (either title or abstract, TIAB+bigram) are combined with concepts identifiers (+CUI), meta-data (+Meta), noun phrases (+NP), hypernyms (+Taxo), MTI predictions (+mti), MTI components (mti+prc) and all the features (+all). For each column, results significantly better (p >0.05) than the ones obtained with unigram or TIAB+bigram are indicated with *.
Figure 1Classification performance per MeSH heading. The figure shows the F-measure for each MeSH heading, when the best combination of features is used for classification (TIAB+bigram+all) and using the best performance ML algorithm (SVM-perf).