Literature DB >> 30717659

BioReader: a text mining tool for performing classification of biomedical literature.

Christian Simon¹, Kristian Davidsen², Christina Hansen², Emily Seymour³, Mike Bogetofte Barnkob⁴, Lars Rønn Olsen⁵.

Abstract

BACKGROUND: Scientific data and research results are being published at an unprecedented rate. Many database curators and researchers utilize data and information from the primary literature to populate databases, form hypotheses, or as the basis for analyses or validation of results. These efforts largely rely on manual literature surveys for collection of these data, and while querying the vast amounts of literature using keywords is enabled by repositories such as PubMed, filtering relevant articles from such query results can be a non-trivial and highly time consuming task.
RESULTS: We here present a tool that enables users to perform classification of scientific literature by text mining-based classification of article abstracts. BioReader (Biomedical Research Article Distiller) is trained by uploading article corpora for two training categories - e.g. one positive and one negative for content of interest - as well as one corpus of abstracts to be classified and/or a search string to query PubMed for articles. The corpora are submitted as lists of PubMed IDs and the abstracts are automatically downloaded from PubMed, preprocessed, and the unclassified corpus is classified using the best performing classification algorithm out of ten implemented algorithms.
CONCLUSION: BioReader supports data and information collection by implementing text mining-based classification of primary biomedical literature in a web interface, thus enabling curators and researchers to take advantage of the vast amounts of data and information in the published literature. BioReader outperforms existing tools with similar functionalities and expands the features used for mining literature in database curation efforts. The tool is freely available as a web service at http://www.cbs.dtu.dk/services/BioReader.

Entities: Chemical Disease Gene Species

Keywords: Biological databases; Database curation; Document classification; Literature survey; Machine learning; PubMed; Text mining

Mesh：

Year: 2019 PMID： 30717659 PMCID： PMC7394276 DOI： 10.1186/s12859-019-2607-x

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

The “big data problem” currently facing the biomedical sciences is due to large volumes of raw biological data, such as genomic sequences, proteomics measurements, and transcriptomic and metagenomic profiles exceeding our analytical capacity. A similar trend is observed in the biomedical literature, which currently consists of more than 27 million articles and grows by almost a million new publications each year. Even within niche topics of the scientific literature, the number of article can be unmanageable: at the time of writing, there are more than 91,000 articles in PubMed about the tumor suppressor gene p53 alone (search term “p53” on August 25, 2018) – a body of literature overwhelming even to domain experts. The “big literature” problem is amplified by the procyclic effect of cited articles receiving more attention and in turn more citations, which results in a large body of mostly uncited and possibly unread articles. Only approximately 0.5% of articles published in 2010 had a 5-year impact factor above 30, 84% had a 5-year impact factor below 5, and approximately 15% will most likely never be cited (data from http://opencitations.net [1], see Fig. 1).

Fig. 1

Histogram of the 5 year impact factor of biomedical articles published in 2010. Data was retrieved from http://opencitations.net/

Histogram of the 5 year impact factor of biomedical articles published in 2010. Data was retrieved from http://opencitations.net/ Regardless of citation metrics, many articles contain potentially valuable information and several scientific projects are based on manual curation of databases assembled by extraction of data and information from the primary literature to compile highly useful databases, including MetaCyc – a curated database of experimentally elucidated metabolic pathways [2], the Immune Epitope Database (IEDB) [3], and the Tumor T cell Antigen database [4]. Specific use cases include searching for T cell epitope sequences [5-7] for peptide vaccination, or molecular surface marker expression measurements [8] useful for in silico cancer immunotherapy target selection [9]. The typical curation process is outlined in Fig. 2: first, a preliminary literature search is performed using basic or advanced search functions of literature databases such as PubMed. This yields a list of articles potentially containing the data or information of interest. Upon manual inspection, a proportion of these articles will be determined relevant and mined for their content, whereas a proportion will reveal itself to be irrelevant. These corpora can then be used to refine the search methodology moving forward, by forming a training set for classification of future searches. This has been proven to vastly speed up the curation process by minimizing the number of irrelevant articles that curators spend their time on [5-7]. The training data set is expanded with each iteration of classification, thus improving the performance of the classification algorithm.

Fig. 2

Workflow of a typical database curation process involving data extraction from the primary literature. First, an initial search using a publication search engine such as PubMed is performed, after which corpora of both relevant and irrelevant articles are defined. These corpora are then used to train a text mining classifier, which is applied in subsequent searches to minimize time spent reading irrelevant articles. With each iteration of data extraction, the size of the corpora increases, thus increasing the performance of the classification algorithm Article classification techniques thus facilitate systematic knowledge extraction from the entire corpus of biomedical literature. To enable the broader community to benefit from this workflow, we have implemented the relevant methods from text mining, machine learning, and bioinformatics in a web service for article classification and retrieval, which outperforms simple keyword search functions native to PubMed, Google Scholar, etc. To illustrate the utility of BioReader in achieving a better and more fine-grained classification, we compared its performance against the closest resembling existing web service, MedlineRanker [10], and discuss a number of use case for which we have utilized the method for database curation.

Implementation

Abstract retrieval

The webserver offers a simple interface where users are prompted to upload two lists of PubMed IDs: two lists for the training categories (e.g. positive and negative for content of interest) as well as one list of PubMed IDs corresponding to abstracts to be classified as belonging to one of the two groups. The abstracts are retrieved using NCBI’s Entrez programming utilities, E-utilities.

Text pre-processing and corpus formation

Once abstracts are retrieved, the three text corpora are generated and the following operations are performed on the text: lowercase transformation, stop word removal, punctuation removal, word stemming, and whitespace stripping. As many gene names contain numeric characters, numbers found in conjunction with letters are not removed. All of the above operations are performed using the “NLP” and “tm” [11] packages for R.

Document-term matrix formation and classifier training

After corpus formation, the texts are tokenized in document term matrices (DTM), which are essentially feature vectors of word counts for all words in all documents in the corpus. Word counts are background corrected by term frequency-inverse document frequency (Tf-Idf) transformation [12], which offsets the count of a given word, by the number of documents in the corpus it occurs in, thereby reducing the importance of words that appear more frequently in general. Terms in the transformed DTMs are then reduced to the top terms differentiating the two training classes, as determined by a Mann-Whitney U test [13]. The resulting training corpora DTMs are used to train and test ten different classification algorithms (support vector machine [14], elastic-net regularized generalized linear model [15], maximum entropy [16], scaled linear discriminant analysis, bagging [17], boosting [18], random forest [19], k-nearest neighbor [20], regression tree [21], and naïve Bayes classifiers) to accommodate corpora of different size and complexity [22]. The best performing algorithm is determined by five-fold cross-validation on the training set and the documents to be classified are subsequently assigned positive or negative for content of interest using this algorithm.

Output

The output consists of performance metrics from the five-fold cross-validation on the training data and two lists of article titles, corresponding to the classification of the test set. The input list is ranked by descending probability of abstracts falling within the two categories. In addition to the result lists, the top 50 terms with most differential frequency between the two training classes (25 for each class) are visualized by a word cloud, enabling users to refine their PubMed search term based on the terms in each class. The class separation is visualized in a PCA plot, with the newly classified articles highlighted.

Performance evaluation data

To evaluate the performance of BioReader, we used two curated abstract sets from the IEDB curation procedure [5]. One corpus consists of 1000 abstracts of articles containing epitope-specific data or epitope structure as well as 1000 abstracts of articles that does not contain epitope relevant data and information. The other corpus consists of 1000 abstracts of articles related to infectious diseases and 1000 abstracts related to non-infectious diseases (allergy, autoimmunity, cancer, etc.). Both corpora were randomly subdivided into sets of 1500 abstracts for training (including five-fold cross-validation and construction of learning curves) and 500 abstracts for performance evaluation.

Comparison to MedlineRanker

MedlineRanker [10] enables users to input a single list of relevant literature, which is then used to rank publications from PubMed – either a randomly chosen subset, articles published within a data range, or a specific subset of articles. As an advanced option, MedlineRanker also enables classification based on two lists: 1) a list of articles of interest (positive list), and 2) a background list of irrelevant articles (negative list). We here compare the performance of BioReader to the advanced function of MedlineRanker.

Results and discussion

The performance of BioReader depends heavily on the size of the training set, how well the training set captures the differences between classes, and the inherent ability of a given set to be separated into the desired classes. Here we demonstrate that BioReader can successfully predict whether articles contain epitope-specific data or epitope structure, and from a separate corpus, which articles relates to infectious diseases vs. non-infectious diseases (allergy, autoimmunity, cancer, etc.) [23].

Use case 1: Classifying articles for disease type and epitope content

Figure 3A shows a learning curve for five-fold cross-validated performances of BioReader utilizing a lasso and elastic-net regularized generalized linear model (glmnet) [15], which proved to be the best performing of the ten implemented classifiers for the disease example corpus. The classifier was trained on sets ranging from 50 to 1500 abstracts (in intervals of ten abstracts with equal distribution of categories). The learning curve shows that a glmnet classifier trained on 280 abstracts performs very similar to the same algorithm trained on 1500 (accuracy = 0.78 and 0.83 on the small and full training set respectively). Figure 3B shows ROC curves of the performance of BioReader and MedlineRanker trained on 1500 abstracts, and classification of a set of 500 abstracts excluded from the training. Both tools perform well with AUC of 0.971 and 0.912, respectively. The remaining 9 BioReader algorithms also performed reasonably well, with a total of 6 of the 10 implemented algorithms outperforming MedlineRanker (Additional file 1) It should also be noted that BioReader successfully retrieved all the input abstracts (1500 for training and 500 for evaluation), whereas MedlineRanker only retrieved 450 of the evaluation abstracts (the proportion of training abstracts successfully retrieved by MedlineRanker is unknown). Achieving such high performance is highly dependent on training set balance (i.e. equal number of abstracts in the two training classes). Figure 3C shows the F1 scores for BioReader classification of the two categories at different positive to negative article list ratios, and it is apparent that predictive performance decreases significantly when uneven ratios of the two categories are used for training.

Fig. 3

Results pertaining to classification of articles relating to infectious diseases vs. non-infectious diseases (allergy, autoimmunity, cancer, etc.) using a glmnet classifier. a) BioReader learning curve for five-fold cross-validation with glmnet on corpora ranging from 50 to 1500 abstracts in intervals of 10 abstracts (average over 100 iterations). b) ROC curves of performance of BioReader and MedlineRanker trained with 1500 abstracts and evaluated on 500 abstracts excluded from the training. c) BioReader F1 scores for positive and negative abstract classification at varying proportions of training set size (total 750 abstracts) for each category in intervals of 10 abstracts (average over 100 iterations). The classifier was applied to a balanced test set of 500 abstracts For the epitope content example, the corpus of 2000 abstracts for which the articles were manually curated to be positive for epitope content was subsequently manually classified for infectious disease vs. non-infections disease content. In this example, the glmnet also proved to be superior in five-fold cross-validation on 1500 abstracts, and the learning curve (Additional file 2) indicated that a training set of around 600 abstracts (300 in each category) resulted in near optimal performance. Training on the full training corpus and subsequent testing on 500 abstracts excluded from the initial training yielded an AUC of 0.953, and 0.941, 0.854, and 0.898, in specificity, sensitivity, and accuracy, respectively.

Use case 2: Classifying articles for surface protein expression data

Throughout the history of molecular biology researchers have been accumulating information about cells, including their functions, molecular composition, development from stem cells, and role in disease. Many of these studies rely on immunophenotyping using molecular surface markers to distinguish cells, diseases, or developmental stages of interest. The dynamic surface marker profiles of cells have been extensively used as biomarkers indicative of different biological states (e.g. developmental stage, disease state, etc.), for cell sorting, and for therapeutics, where specific surface markers are used to direct therapeutic agents to diseased cells, using either monoclonal antibodies or cell-based therapies. Traditionally, studies revealing new knowledge about cells, their surface markers, and the complex dynamic relationship between the two have been communicated and shared almost exclusively in the primary scientific literature. We utilized BioReader and manual data extraction to assemble a comprehensive data set of human hematopoietic cells and their corresponding quantitative or qualitative presence (depending on availability) of known molecular surface markers. Utilizing over 6000 data points across 305 CD molecules on 206 cell types, we characterized the “human hematopoietic CDome” and found that surface markers provided a higher resolution functional classification of hematopoietic cellular function than transcriptome-wide expression analyses [8].

BioReader features

In addition to outperforming existing tools, BioReader also adds features to the biomedical text mining toolbox. Most notable is the implementation of multiple machine learning algorithms to cater for corpora of different size and complexity. As see in Fig. 3B, the training of multiple machine learning algorithms and subsequent selection of the best performing as determined by five-fold cross-validation on the training data, is indeed a useful strategy: 6 out of the 10 implemented algorithms outperformed MedlineRanker, whereas 4 did not. Thus, relying on a single algorithm for all corpora is likely an inferior strategy, as corpora can vary in size, composition, and complexity. Comparison of BioReader features to two similar tools, MedlineRanker and MScanner [24] is shown in Table 1.

Table 1

Feature comparison of BioReader, MedlineRanker, and MScanner

Feature	BioReader	MedlineRanker	MScanner
Positive class input	Yes	Yes	Yes
Negative class input	Yes	Yes	No
Classification list input	Yes	Yes	No
Training features	All words (stemmed to consolidate counts), MeSH, journal, authors	Nouns	MeSH, journal
Classification algorithm(s)	support vector machine, elastic-net regularized generalized linear model, maximum entropy, supervised latent Dirichlet allocation, bagging, boosting, random forest, k-nearest neighbor, regression tree, and naïve Bayes classifiers	Naïve Bayes classifier	Naïve Bayes classifier
Output	Ranked lists, term signature (positive and negative), separation visualization (PCA), performance metrics	Ranked lists, term signature (positive), performance metrics	Ranked list
Standalone source code available	Yes	No (but offers API)	Yes

Feature comparison of BioReader, MedlineRanker, and MScanner

Conclusion

We have created a flexible implementation of a number of well-known and established text mining tools, designed to cater to a variety of classification tasks with biomedical literature. We have demonstrated that with a relatively small set of manually categorized articles, users can classify up to 1000 PubMed articles per run (and no limits on the number of runs). BioReader outperforms existing tools for classification tasks and offers new and improved features.

Availability and requirements

Project name: BioReader Project home page: http://www.cbs.dtu.dk/services/BioReader Operating system(s): Platform independent Programming language: R, Perl Other requirements: None License: GNU GPL. Any restrictions to use by non-academics: License needed. Performance of all 10 BioReader algorithms and MedlineRanker classifying articles relating to infectious diseases vs. non-infectious diseases (allergy, autoimmunity, cancer, etc.). (DOCX 47 kb) Results of classification of articles containing epitope data using a glmnet classifier. (DOCX 96 kb)

12 in total

1. Boosting for tumor classification with gene expression data.

Authors: Marcel Dettling; Peter Bühlmann
Journal: Bioinformatics Date: 2003-06-12 Impact factor: 6.937

2. TANTIGEN: a comprehensive database of tumor T cell antigens.

Authors: Lars Rønn Olsen; Songsak Tongchusak; Honghuang Lin; Ellis L Reinherz; Vladimir Brusic; Guang Lan Zhang
Journal: Cancer Immunol Immunother Date: 2017-03-09 Impact factor: 6.968

3. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

4. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases.

Authors: Ron Caspi; Tomer Altman; Kate Dreher; Carol A Fulcher; Pallavi Subhraveti; Ingrid M Keseler; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Lukas A Mueller; Quang Ong; Suzanne Paley; Anuradha Pujar; Alexander G Shearer; Michael Travers; Deepika Weerasinghe; Peifen Zhang; Peter D Karp
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

5. Cost sensitive hierarchical document classification to triage PubMed abstracts for manual curation.

Authors: Emily Seymour; Rohini Damle; Alessandro Sette; Bjoern Peters
Journal: BMC Bioinformatics Date: 2011-12-19 Impact factor: 3.169

6. MScanner: a classifier for retrieving Medline citations.

Authors: Graham L Poulter; Daniel L Rubin; Russ B Altman; Cathal Seoighe
Journal: BMC Bioinformatics Date: 2008-02-19 Impact factor: 3.169

7. Literature classification for semi-automated updating of biological knowledgebases.

Authors: Lars Olsen; Ulrich Johan Kudahl; Ole Winther; Vladimir Brusic
Journal: BMC Genomics Date: 2013-10-16 Impact factor: 3.969

8. The immune epitope database 2.0.

Authors: Randi Vita; Laura Zarebski; Jason A Greenbaum; Hussein Emami; Ilka Hoof; Nima Salimi; Rohini Damle; Alessandro Sette; Bjoern Peters
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

9. Automating document classification for the Immune Epitope Database.

Authors: Peng Wang; Alexander A Morgan; Qing Zhang; Alessandro Sette; Bjoern Peters
Journal: BMC Bioinformatics Date: 2007-07-26 Impact factor: 3.169

10. Characterizing the human hematopoietic CDome.

Authors: Mike Stein Barnkob; Christian Simon; Lars Rønn Olsen
Journal: Front Genet Date: 2014-09-25 Impact factor: 4.599

11 in total

1. LitSuggest: a web-based system for literature recommendation and curation using machine learning.

Authors: Alexis Allot; Kyubum Lee; Qingyu Chen; Ling Luo; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

2. Integrating image caption information into biomedical document classification in support of biocuration.

Authors: Xiangying Jiang; Pengyuan Li; James Kadin; Judith A Blake; Martin Ringwald; Hagit Shatkay
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

Review 3. Allogenic stem cell transplant-associated acute graft versus host disease: a computational drug discovery text mining approach using oral and gut microbiome signatures.

Authors: M F Beckman; D S Morton; F Bahrani Mougeot; J-L C Mougeot
Journal: Support Care Cancer Date: 2020-10-22 Impact factor: 3.603

4. Representing a Heterogeneous Pharmaceutical Knowledge-Graph with Textual Information.

Authors: Masaki Asada; Nallappan Gunasekaran; Makoto Miwa; Yutaka Sasaki
Journal: Front Res Metr Anal Date: 2021-07-01

5. Drug Abuse Research Trend Investigation with Text Mining.

Authors: Li-Wei Chou; Kang-Ming Chang; Ira Puspitasari
Journal: Comput Math Methods Med Date: 2020-02-01 Impact factor: 2.238

Review 6. ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed.

Authors: Paola Turina; Piero Fariselli; Emidio Capriotti
Journal: Front Mol Biosci Date: 2021-03-25

Review 7. Literature Mining and Mechanistic Graphical Modelling to Improve mRNA Vaccine Platforms.

Authors: Lorena Leonardelli; Giuseppe Lofano; Gianluca Selvaggio; Silvia Parolo; Stefano Giampiccolo; Danilo Tomasoni; Enrico Domenici; Corrado Priami; Haifeng Song; Duccio Medini; Luca Marchetti; Emilio Siena
Journal: Front Immunol Date: 2021-09-07 Impact factor: 7.561

8. 'Refbin' an online platform to extract and classify large-scale information: a pilot study of COVID-19 related papers.

Authors: Shania Lunna; Isabelle Flinn; James Prytherch; Camille Torfs-Leibman; Sarah Robtoy; Matt Bansak; David Krag
Journal: BMJ Health Care Inform Date: 2022-03

9. Semantic text mining in early drug discovery for type 2 diabetes.

Authors: Lena K Hansson; Rasmus Borup Hansen; Sune Pletscher-Frankild; Rudolfs Berzins; Daniel Hvidberg Hansen; Dennis Madsen; Sten B Christensen; Malene Revsbech Christiansen; Ulrika Boulund; Xenia Asbæk Wolf; Sonny Kim Kjærulff; Martijn van de Bunt; Søren Tulin; Thomas Skøt Jensen; Rasmus Wernersson; Jan Nygaard Jensen
Journal: PLoS One Date: 2020-06-15 Impact factor: 3.240

10. Classifying domain-specific text documents containing ambiguous keywords.

Authors: Kamran Karimi; Sergei Agalakov; Cheryl A Telmer; Thomas R Beatman; Troy J Pells; Bradley Im Arshinoff; Carolyn J Ku; Saoirse Foley; Veronica F Hinman; Charles A Ettensohn; Peter D Vize
Journal: Database (Oxford) Date: 2021-09-29 Impact factor: 3.451