Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Integrating image caption information into biomedical document classification in support of biocuration.

Literature DB >> 32294192

Integrating image caption information into biomedical document classification in support of biocuration.

Xiangying Jiang¹, Pengyuan Li¹, James Kadin², Judith A Blake², Martin Ringwald², Hagit Shatkay¹.

Abstract

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation. We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012-2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier's performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation. Database URL.

Entities: CellLine Chemical Disease Species

Year: 2020 PMID： 32294192 PMCID： PMC7159034 DOI： 10.1093/database/baaa024

Source DB: PubMed Journal: Database (Oxford) ISSN： 1758-0463 Impact factor: 3.451

Keyword Cloud
References

18 in total

1. Empirical investigations into full-text protein interaction Article Categorization Task (ACT) in the BioCreative II.5 Challenge.

Authors: Man Lan; Jian Su
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2010 Jul-Sep Impact factor: 3.710

2. Integrating image data into biomedical text categorization.

Authors: Hagit Shatkay; Nawei Chen; Dorothea Blostein
Journal: Bioinformatics Date: 2006-07-15 Impact factor: 6.937

3. Manual curation is not sufficient for annotation of genomic databases.

Authors: William A Baumgartner; K Bretonnel Cohen; Lynne M Fox; George Acquaah-Mensah; Lawrence Hunter
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

4. BeCAS: biomedical concept recognition services and visualization.

Authors: Tiago Nunes; David Campos; Sérgio Matos; José Luís Oliveira
Journal: Bioinformatics Date: 2013-06-04 Impact factor: 6.937

5. Assisting document triage for human kinome curation via machine learning.

Authors: Yi-Yu Hsu; Chih-Hsuan Wei; Zhiyong Lu
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

6. Machine learning for biomedical literature triage.

Authors: Hayda Almeida; Marie-Jean Meurs; Leila Kosseim; Greg Butler; Adrian Tsang
Journal: PLoS One Date: 2014-12-31 Impact factor: 3.240

7. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.

Authors: H-M Müller; K M Van Auken; Y Li; P W Sternberg
Journal: BMC Bioinformatics Date: 2018-03-09 Impact factor: 3.169

8. BioReader: a text mining tool for performing classification of biomedical literature.

Authors: Christian Simon; Kristian Davidsen; Christina Hansen; Emily Seymour; Mike Bogetofte Barnkob; Lars Rønn Olsen
Journal: BMC Bioinformatics Date: 2019-02-04 Impact factor: 3.169

9. Hierarchical bi-directional attention-based RNNs for supporting document classification on protein-protein interactions affected by genetic mutations.

Authors: Aris Fergadis; Christos Baziotis; Dimitris Pappas; Haris Papageorgiou; Alexandros Potamianos
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

10. Figure and caption extraction from biomedical documents.

Authors: Pengyuan Li; Xiangying Jiang; Hagit Shatkay
Journal: Bioinformatics Date: 2019-11-01 Impact factor: 6.937