Literature DB >> 29568819

Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora.

Irina P Temnikova1, William A Baumgartner2, Negacy D Hailu2, Ivelina Nikolova3, Tony McEnery4, Adam Kilgarriff5, Galia Angelova3, K Bretonnel Cohen2.   

Abstract

Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed-English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

Entities:  

Keywords:  corpus linguistics; sublanguage characterisation; sublanguage recognition

Year:  2014        PMID: 29568819      PMCID: PMC5860848     

Source DB:  PubMed          Journal:  LREC Int Conf Lang Resour Eval


  1 in total

1.  Exploring subdomain variation in biomedical language.

Authors:  Thomas Lippincott; Diarmuid Ó Séaghdha; Anna Korhonen
Journal:  BMC Bioinformatics       Date:  2011-05-27       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.