| Literature DB >> 28365740 |
Xiangying Jiang1, Martin Ringwald2, Judith Blake2, Hagit Shatkay1.
Abstract
The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL: www.informatics.jax.org.Entities:
Mesh:
Year: 2017 PMID: 28365740 PMCID: PMC5467553 DOI: 10.1093/database/bax017
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
The datasets used for training and testing of our biomedical document classification
| Dataset | Number of examples | ||
|---|---|---|---|
| Positive | Negative | Total | |
| GXD | 12 966 | 12 354 | 25 320 |
| GXD-caption | 1630 | 1696 | 3326 |
Classification evaluation measures for our classifiers on the GXD dataset using different cross-validation settings
| Classifiers | Precision | Recall | Accuracy | Utility-10 | Utility-20 | |
|---|---|---|---|---|---|---|
| NB5 | 0.892 (0.005) | 0.957 (0.003) | 0.923 (0.003) | 0.917 (0.004) | 0.876 (0.006) | 0.881 (0.005) |
| RF5 | 0.908 (0.006) | 0.921 (0.005) | 0.915 (0.004) | 0.912 (0.005) | 0.895 (0.007) | 0.899 (0.007) |
| NB10 | 0.891 (0.007) | 0.957 (0.006) | 0.923 (0.004) | 0.917 (0.005) | 0.875 (0.008) | 0.881 (0.008) |
| RF10 | 0.908 (0.008) | 0.922 (0.008) | 0.915 (0.006) | 0.912 (0.007) | 0.894 (0.009) | 0.899 (0.009) |
| RF-H-and-H | 0.905 | 0.925 | 0.915 | 0.913 | 0.896 | 0.900 |
NB denotes Naïve Bayes classifier; RF denotes Random Forest classifier. The suffix 5 indicates using 5 complete runs of 5-fold cross validation; the suffix 10 indicates using 10 complete runs of 10-fold cross validation. H-and-H represents using half of the GXD dataset for training and the other half for testing.
Figure 1.Performance of our classifiers, measured on the GXD dataset according to the different performance metrics, calculated over the various cross-validation settings. NB denotes Naïve Bayes; RF denotes Random Forest classifier. The suffix 5 denotes average over 5 complete runs of 5-fold cross validation (25 runs in total); the suffix 10 denotes average over10 complete runs of 10-fold cross validation (100 runs in total). half-training-halftesting represents runs in which half of the GXD dataset was used for training and the other half for testing.
Classification results obtained over the GXD-caption dataset using different set of features
| Classifiers | Precision | Recall | F-measure | Accuracy | Utility-10 | Utility-20 |
|---|---|---|---|---|---|---|
| NB_AB | 0.874 (0.023) | 0.656 (0.027) | 0.749 (0.022) | 0.783 (0.017) | 0.866 (0.024) | 0.872 (0.023) |
| RF_AB | 0.779 (0.016) | 0.768 (0.024) | 0.773 (0.017) | 0.802 (0.015) | 0.765 (0.018) | 0.776 (0.017) |
| NB_CAP | 0.831 (0.018) | 0.766 (0.024) | 0.797 (0.019) | 0.808 (0.017) | 0.820 (0.019) | 0.828 (0.018) |
| RF_CAP | 0.855 (0.019) | 0.758 (0.017) | 0.804 (0.015) | 0.817 (0.014) | 0.846 (0.021) | 0.853 (0.020) |
| NB_AB&CAP | 0.816 (0.018) | 0.846 (0.014) | 0.853 (0.013) | |||
| RF_AB&CAP | 0.876 (0.019) | 0.869 (0.020) | 0.875 (0.019) |
AB indicates using features from titles/abstracts only. CAP indicates using features from captions alone. AB_CAP indicates using features from both captions and titles/abstracts.
Figure 2.Comparison of classification results obtained over the GXD-caption dataset using different sets of features. NB denotes Naïve Bayes; RF denotes Random Forest classifier. AB indicates using text-features from titles/abstracts only; CAP indicates using features from captions alone; AB_CAP indicates using features from both captions and titles/abstracts. The results shown are averaged over five complete runs of 5-fold cross validation (25 runs in total. Standard deviations are shown in Table 3).