Literature DB >> 31810495

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Corrado Lanera1, Paola Berchialla2, Abhinav Sharma3, Clara Minto1, Dario Gregori1, Ileana Baldi4.   

Abstract

BACKGROUND: The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews.
METHODS: We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy.
RESULTS: Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65.
CONCLUSIONS: Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.

Entities:  

Keywords:  Classification; Indexed search engine; Machine learning; Text mining; Unbalanced data, systematic review

Year:  2019        PMID: 31810495     DOI: 10.1186/s13643-019-1245-8

Source DB:  PubMed          Journal:  Syst Rev        ISSN: 2046-4053


  4 in total

1.  A Deep Learning Approach to Estimate the Incidence of Infectious Disease Cases for Routinely Collected Ambulatory Records: The Example of Varicella-Zoster.

Authors:  Corrado Lanera; Ileana Baldi; Andrea Francavilla; Elisa Barbieri; Lara Tramontan; Antonio Scamarcia; Luigi Cantarutti; Carlo Giaquinto; Dario Gregori
Journal:  Int J Environ Res Public Health       Date:  2022-05-13       Impact factor: 4.614

2.  Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection.

Authors:  Hongzhong Lu; Feiran Li; Le Yuan; Iván Domenzain; Rosemary Yu; Hao Wang; Gang Li; Yu Chen; Boyang Ji; Eduard J Kerkhoven; Jens Nielsen
Journal:  Mol Syst Biol       Date:  2021-10       Impact factor: 11.429

3.  Construction of Xinjiang metabolic syndrome risk prediction model based on interpretable models.

Authors:  Yan Zhang; Jaina Razbek; Deyang Li; Lei Yang; Liangliang Bao; Wenjun Xia; Hongkai Mao; Mayisha Daken; Xiaoxu Zhang; Mingqin Cao
Journal:  BMC Public Health       Date:  2022-02-08       Impact factor: 3.295

4.  A Novel Approach of Feature Space Reconstruction with Three-Way Decisions for Long-Tailed Text Classification.

Authors:  Xin Li; Lianting Hu; Peixin Lu; Tianhui Huang; Wei Yang; Quan Lu; Huiying Liang; Long Lu
Journal:  Comput Intell Neurosci       Date:  2022-04-16
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.