| Literature DB >> 30717659 |
Christian Simon1, Kristian Davidsen2, Christina Hansen2, Emily Seymour3, Mike Bogetofte Barnkob4, Lars Rønn Olsen5.
Abstract
BACKGROUND: Scientific data and research results are being published at an unprecedented rate. Many database curators and researchers utilize data and information from the primary literature to populate databases, form hypotheses, or as the basis for analyses or validation of results. These efforts largely rely on manual literature surveys for collection of these data, and while querying the vast amounts of literature using keywords is enabled by repositories such as PubMed, filtering relevant articles from such query results can be a non-trivial and highly time consuming task.Entities:
Keywords: Biological databases; Database curation; Document classification; Literature survey; Machine learning; PubMed; Text mining
Mesh:
Year: 2019 PMID: 30717659 PMCID: PMC7394276 DOI: 10.1186/s12859-019-2607-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Histogram of the 5 year impact factor of biomedical articles published in 2010. Data was retrieved from http://opencitations.net/
Fig. 2Workflow of a typical database curation process involving data extraction from the primary literature. First, an initial search using a publication search engine such as PubMed is performed, after which corpora of both relevant and irrelevant articles are defined. These corpora are then used to train a text mining classifier, which is applied in subsequent searches to minimize time spent reading irrelevant articles. With each iteration of data extraction, the size of the corpora increases, thus increasing the performance of the classification algorithm
Fig. 3Results pertaining to classification of articles relating to infectious diseases vs. non-infectious diseases (allergy, autoimmunity, cancer, etc.) using a glmnet classifier. a) BioReader learning curve for five-fold cross-validation with glmnet on corpora ranging from 50 to 1500 abstracts in intervals of 10 abstracts (average over 100 iterations). b) ROC curves of performance of BioReader and MedlineRanker trained with 1500 abstracts and evaluated on 500 abstracts excluded from the training. c) BioReader F1 scores for positive and negative abstract classification at varying proportions of training set size (total 750 abstracts) for each category in intervals of 10 abstracts (average over 100 iterations). The classifier was applied to a balanced test set of 500 abstracts
Feature comparison of BioReader, MedlineRanker, and MScanner
| Feature | BioReader | MedlineRanker | MScanner |
|---|---|---|---|
| Positive class input | Yes | Yes | Yes |
| Negative class input | Yes | Yes | No |
| Classification list input | Yes | Yes | No |
| Training features | All words (stemmed to consolidate counts), MeSH, journal, authors | Nouns | MeSH, journal |
| Classification algorithm(s) | support vector machine, elastic-net regularized generalized linear model, maximum entropy, supervised latent Dirichlet allocation, bagging, boosting, random forest, k-nearest neighbor, regression tree, and naïve Bayes classifiers | Naïve Bayes classifier | Naïve Bayes classifier |
| Output | Ranked lists, term signature (positive and negative), separation visualization (PCA), performance metrics | Ranked lists, term signature (positive), performance metrics | Ranked list |
| Standalone source code available | Yes | No (but offers API) | Yes |