BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. RESULTS: Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. CONCLUSIONS: We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.
BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. RESULTS: Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. CONCLUSIONS: We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.
Authors: Ivo Grosse; Pedro Bernaola-Galván; Pedro Carpena; Ramón Román-Roldán; Jose Oliver; H Eugene Stanley Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2002-03-25
Authors: Marc E Colosimo; Alexander A Morgan; Alexander S Yeh; Jeffrey B Colombe; Lynette Hirschman Journal: BMC Bioinformatics Date: 2005-05-24 Impact factor: 3.169
Authors: Sampo Pyysalo; Filip Ginter; Juho Heimonen; Jari Björne; Jorma Boberg; Jouni Järvinen; Tapio Salakoski Journal: BMC Bioinformatics Date: 2007-02-09 Impact factor: 3.169
Authors: Alexander A Morgan; Zhiyong Lu; Xinglong Wang; Aaron M Cohen; Juliane Fluck; Patrick Ruch; Anna Divoli; Katrin Fundel; Robert Leaman; Jörg Hakenberg; Chengjie Sun; Heng-hui Liu; Rafael Torres; Michael Krauthammer; William W Lau; Hongfang Liu; Chun-Nan Hsu; Martijn Schuemie; K Bretonnel Cohen; Lynette Hirschman Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583
Authors: Jeffrey P Ferraro; Ye Ye; Per H Gesteland; Peter J Haug; Fuchiang Rich Tsui; Gregory F Cooper; Rudy Van Bree; Thomas Ginter; Andrew J Nowalk; Michael Wagner Journal: Appl Clin Inform Date: 2017-05-31 Impact factor: 2.342
Authors: Bohdan B Khomtchouk; Diem-Trang Tran; Kasra A Vand; Matthew Might; Or Gozani; Themistocles L Assimes Journal: Brief Bioinform Date: 2020-12-01 Impact factor: 11.622
Authors: Irina P Temnikova; William A Baumgartner; Negacy D Hailu; Ivelina Nikolova; Tony McEnery; Adam Kilgarriff; Galia Angelova; K Bretonnel Cohen Journal: LREC Int Conf Lang Resour Eval Date: 2014-05
Authors: K Bretonnel Cohen; Arrick Lanfranchi; Miji Joo-Young Choi; Michael Bada; William A Baumgartner; Natalya Panteleyeva; Karin Verspoor; Martha Palmer; Lawrence E Hunter Journal: BMC Bioinformatics Date: 2017-08-17 Impact factor: 3.169