Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora.

Literature DB >> 29568819

Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora.

Irina P Temnikova¹, William A Baumgartner², Negacy D Hailu², Ivelina Nikolova³, Tony McEnery⁴, Adam Kilgarriff⁵, Galia Angelova³, K Bretonnel Cohen².

Abstract

Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed-English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

Entities: CellLine Chemical Disease Species

Keywords: corpus linguistics; sublanguage characterisation; sublanguage recognition

Year: 2014 PMID： 29568819 PMCID： PMC5860848

Source DB: PubMed Journal: LREC Int Conf Lang Resour Eval

Keyword Cloud
References

1 in total

1. Exploring subdomain variation in biomedical language.

Authors: Thomas Lippincott; Diarmuid Ó Séaghdha; Anna Korhonen
Journal: BMC Bioinformatics Date: 2011-05-27 Impact factor: 3.169

1 in total