Literature DB >> 28758165

Analyzing the Moving Parts of a Large-Scale Multi-Label Text Classification Pipeline: Experiences in Indexing Biomedical Articles.

Anthony Rios1, Ramakanth Kavuluru1,2.   

Abstract

Medical subject headings (MeSH) is a controlled hierarchical vocabulary used by the National Library of Medicine (NLM) to index biomedical articles. In the 2014 version of MeSH terminology there are a total of 27,149 terms. Librarians at the NLM tag each biomedical article to be indexed for the PubMed literature search system with terms from MeSH. This means the human indexers look at each article's full text and index it with a small set of descriptors, 13 on average, from over 27,000 descriptors available in MeSH. There have been many recent attempts to automate this process focused on using the article title and abstract text to predict MeSH terms for the corresponding article. There has also been an open automated biomedical indexing challenge, BioASQ [1], that started in 2013. The best general supervised learning framework in these challenges has been a pipeline with four different components: 1. pre-processing and feature extraction; 2. employing the binary relevance and/or nearest neighbor approaches to select a set of candidate terms; 3. ranking these candidate terms using corresponding informative features; and 4. applying label calibration to dynamically predict the number of top terms to be included in the final selection for the current instance. The specific details in how each of these components is implemented determines the performance variations of various entries in the challenge. In this paper, we analyze these moving parts of the MeSH indexing multi-label classification pipeline with experiments involving different combinations. Our best combination achieves ≈ 1% increase in micro F-score compared with the top performing team across the five weeks of the final batch of the BioASQ 2014 challenge. The main take away from our efforts is that small improvements/modifications to different components of the pipeline can offer moderate improvements to the overall performance of the method. Our experiences show that, at least thus far, top performances have resulted mostly due to these improvements rather than drastic changes of the core methodology.

Entities:  

Year:  2015        PMID: 28758165      PMCID: PMC5530873          DOI: 10.1109/ICHI.2015.6

Source DB:  PubMed          Journal:  IEEE Int Conf Healthc Inform        ISSN: 2575-2626


  10 in total

1.  The NLM Indexing Initiative's Medical Text Indexer.

Authors:  Alan R Aronson; James G Mork; Clifford W Gay; Susanne M Humphrey; Willie J Rogers
Journal:  Stud Health Technol Inform       Date:  2004

2.  The effect of feature representation on MEDLINE document classification.

Authors:  Meliha Yetisgen-Yildiz; Wanda Pratt
Journal:  AMIA Annu Symp Proc       Date:  2005

3.  Optimal training sets for Bayesian prediction of MeSH assignment.

Authors:  Sunghwan Sohn; Won Kim; Donald C Comeau; W John Wilbur
Journal:  J Am Med Inform Assoc       Date:  2008-04-24       Impact factor: 4.497

4.  Unsupervised Medical Subject Heading Assignment Using Output Label Co-occurrence Statistics and Semantic Predications.

Authors:  Ramakanth Kavuluru; Zhenghao He
Journal:  Nat Lang Process Inf Syst       Date:  2013-06

5.  Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings.

Authors:  Ramakanth Kavuluru; Yuan Lu
Journal:  Data Knowl Eng       Date:  2014-09-18       Impact factor: 1.992

6.  Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors:  Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal:  Bioinformatics       Date:  2009-03-20       Impact factor: 6.937

7.  Recommending MeSH terms for annotating biomedical articles.

Authors:  Minlie Huang; Aurélie Névéol; Zhiyong Lu
Journal:  J Am Med Inform Assoc       Date:  2011-05-25       Impact factor: 4.497

8.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.

Authors:  George Tsatsaronis; Georgios Balikas; Prodromos Malakasiotis; Ioannis Partalas; Matthias Zschunke; Michael R Alvers; Dirk Weissenborn; Anastasia Krithara; Sergios Petridis; Dimitris Polychronopoulos; Yannis Almirantis; John Pavlopoulos; Nicolas Baskiotis; Patrick Gallinari; Thierry Artiéres; Axel-Cyrille Ngonga Ngomo; Norman Heino; Eric Gaussier; Liliana Barrio-Alvers; Michael Schroeder; Ion Androutsopoulos; Georgios Paliouras
Journal:  BMC Bioinformatics       Date:  2015-04-30       Impact factor: 3.169

9.  Context-driven automatic subgraph creation for literature-based discovery.

Authors:  Delroy Cameron; Ramakanth Kavuluru; Thomas C Rindflesch; Amit P Sheth; Krishnaprasad Thirunarayan; Olivier Bodenreider
Journal:  J Biomed Inform       Date:  2015-02-07       Impact factor: 6.317

10.  PubMed related articles: a probabilistic topic-based model for content similarity.

Authors:  Jimmy Lin; W John Wilbur
Journal:  BMC Bioinformatics       Date:  2007-10-30       Impact factor: 3.169

  10 in total
  1 in total

1.  Predicting mental conditions based on "history of present illness" in psychiatric notes with deep neural networks.

Authors:  Tung Tran; Ramakanth Kavuluru
Journal:  J Biomed Inform       Date:  2017-06-10       Impact factor: 6.317

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.