Literature DB >> 31032839

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.

Xiangying Jiang1, Martin Ringwald2, Judith A Blake2, Cecilia Arighi1,3, Gongbo Zhang1, Hagit Shatkay1,3.   

Abstract

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
© The Author(s) 2019. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2019        PMID: 31032839      PMCID: PMC6482935          DOI: 10.1093/database/baz045

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


  26 in total

Review 1.  Assessing the accuracy of prediction algorithms for classification: an overview.

Authors:  P Baldi; S Brunak; Y Chauvin; C A Andersen; H Nielsen
Journal:  Bioinformatics       Date:  2000-05       Impact factor: 6.937

2.  BeCAS: biomedical concept recognition services and visualization.

Authors:  Tiago Nunes; David Campos; Sérgio Matos; José Luís Oliveira
Journal:  Bioinformatics       Date:  2013-06-04       Impact factor: 6.937

3.  Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts.

Authors:  Chih-Hsuan Wei; Bethany R Harris; Donghui Li; Tanya Z Berardini; Eva Huala; Hung-Yu Kao; Zhiyong Lu
Journal:  Database (Oxford)       Date:  2012-11-17       Impact factor: 3.451

4.  Layout-aware text extraction from full-text PDF of scientific articles.

Authors:  Cartic Ramakrishnan; Abhishek Patnia; Eduard Hovy; Gully Apc Burns
Journal:  Source Code Biol Med       Date:  2012-05-28

5.  Detection of interaction articles and experimental methods in biomedical literature.

Authors:  Gerold Schneider; Simon Clematide; Fabio Rinaldi
Journal:  BMC Bioinformatics       Date:  2011-10-03       Impact factor: 3.169

6.  FlyBase at 25: looking to the future.

Authors:  L Sian Gramates; Steven J Marygold; Gilberto Dos Santos; Jose-Maria Urbano; Giulia Antonazzo; Beverley B Matthews; Alix J Rey; Christopher J Tabone; Madeline A Crosby; David B Emmert; Kathleen Falls; Joshua L Goodman; Yanhui Hu; Laura Ponting; Andrew J Schroeder; Victor B Strelets; Jim Thurmond; Pinglei Zhou
Journal:  Nucleic Acids Res       Date:  2016-10-30       Impact factor: 16.971

7.  The mouse Gene Expression Database (GXD): 2017 update.

Authors:  Jacqueline H Finger; Constance M Smith; Terry F Hayamizu; Ingeborg J McCright; Jingxia Xu; Meiyee Law; David R Shaw; Richard M Baldarelli; Jon S Beal; Olin Blodgett; Jeff W Campbell; Lori E Corbani; Jill R Lewis; Kim L Forthofer; Pete J Frost; Sharon C Giannatto; Lucie N Hutchins; Dave B Miers; Howie Motenko; Kevin R Stone; Janan T Eppig; James A Kadin; Joel E Richardson; Martin Ringwald
Journal:  Nucleic Acids Res       Date:  2016-11-28       Impact factor: 16.971

8.  Semi-automated screening of biomedical citations for systematic reviews.

Authors:  Byron C Wallace; Thomas A Trikalinos; Joseph Lau; Carla Brodley; Christopher H Schmid
Journal:  BMC Bioinformatics       Date:  2010-01-26       Impact factor: 3.169

9.  Developing and validating predictive decision tree models from mining chemical structural fingerprints and high-throughput screening data in PubChem.

Authors:  Lianyi Han; Yanli Wang; Stephen H Bryant
Journal:  BMC Bioinformatics       Date:  2008-09-25       Impact factor: 3.169

10.  mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support.

Authors:  Kimchi Strasser; Erin McDonnell; Carol Nyaga; Min Wu; Sherry Wu; Hayda Almeida; Marie-Jean Meurs; Leila Kosseim; Justin Powlowski; Greg Butler; Adrian Tsang
Journal:  Database (Oxford)       Date:  2015-03-08       Impact factor: 3.451

View more
  6 in total

1.  Integrating image caption information into biomedical document classification in support of biocuration.

Authors:  Xiangying Jiang; Pengyuan Li; James Kadin; Judith A Blake; Martin Ringwald; Hagit Shatkay
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

2.  PGxMine: Text mining for curation of PharmGKB.

Authors:  Jake Lever; Julia M Barbarino; Li Gong; Rachel Huddart; Katrin Sangkuhl; Ryan Whaley; Michelle Whirl-Carrillo; Mark Woon; Teri E Klein; Russ B Altman
Journal:  Pac Symp Biocomput       Date:  2020

Review 3.  Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources.

Authors:  Tara Eicher; Garrett Kinnebrew; Andrew Patt; Kyle Spencer; Kevin Ying; Qin Ma; Raghu Machiraju; And Ewy A Mathé
Journal:  Metabolites       Date:  2020-05-15

4.  UPCLASS: a deep learning-based classifier for UniProtKB entry publications.

Authors:  Douglas Teodoro; Julien Knafou; Nona Naderi; Emilie Pasche; Julien Gobeill; Cecilia N Arighi; Patrick Ruch
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

5.  Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training.

Authors:  Inzamam Mashood Nasir; Muhammad Attique Khan; Mussarat Yasmin; Jamal Hussain Shah; Marcin Gabryel; Rafał Scherer; Robertas Damaševičius
Journal:  Sensors (Basel)       Date:  2020-11-27       Impact factor: 3.576

6.  Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase.

Authors:  Valerio Arnaboldi; Daniela Raciti; Kimberly Van Auken; Juancarlos N Chan; Hans-Michael Müller; Paul W Sternberg
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.