| Literature DB >> 36060268 |
Çağrı Çöltekin1, A Seza Doğruöz2, Özlem Çetinoğlu3.
Abstract
This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available linguistic resources, we present a set of recommendations, and identify gaps in the data available for conducting research and building applications in Turkish Linguistics and Natural Language Processing.Entities:
Keywords: Corpora; Lexical resources; Linguistics; NLP; Turkish
Year: 2022 PMID: 36060268 PMCID: PMC9417072 DOI: 10.1007/s10579-022-09605-4
Source DB: PubMed Journal: Lang Resour Eval ISSN: 1574-020X Impact factor: 1.835
A summary of currently available Turkish treebanks
| Treebank | Type | Sentences | Tokens |
|---|---|---|---|
| METU-Sabancı (Oflazer et al., | dep | 5 635 | 56 396 |
| ITU Web (Pamay et al., | dep | 5 009 | 43 191 |
| UD-GB (Çöltekin, | dep | 2 880 | 16 803 |
| UD-PUD (Zeman et al., | dep | 1 000 | 16 536 |
| UD-BOUN (Türk Utku et al., | dep | 9 761 | 121 214 |
| TWT (Kayadelen et al., | dep | 4 851 | 66 466 |
| Turkish-Penn-CS (Yıldız et al., | con | 9 560 | 81 419 |
| UD-Turkish-Penn | dep | 9 560 | 87 367 |
| UD-Tourism | dep | 19 750 | 92 200 |
| UD-Kenet | dep | 18 700 | 178 700 |
| UD-FrameNet | dep | 2 700 | 19 221 |
The numbers in the table are based on our own counts on the most recent versions of the datasets. Not all information is reported in the respective papers, and there may be mismatches between the numbers reported in the papers and the released datasets
A summary of WSD resources
| Resource | Type | Additional | Samples | Sent. |
|---|---|---|---|---|
| METU (Orhan et al., | Lexical sample | morph, dep | 26 | 5 385 |
| ITU (İlgen et al., | Lexical sample | – | 35 | 3 616 |
| Işık (Akçakaya & Yıldız, | All-words | morph, con | 7595 | 83 474 |
The ‘Additional’ column mentions additional annotations, namely, morph: POS tags and morphology, dep: dependency, con: constituency
A selection of parallel corpora available for Turkish
| Corpus | Text type | Languages | Sentences |
|---|---|---|---|
| Bianet (Ataman, | News | English, Kurdish | 61 472 |
| Bible | Religious | Multiple (102) | 48 500 |
| EU book shop | EU texts | Multiple (48) | 33 398 |
| GlobalVoices | News | Multiple (92) | 8 796 |
| JW300 (Agić & Vulić, | Religious | Multiple (380) | 535 353 |
| OpenSubtitles | Subtitles | Multiple (62) | 173 215 360 |
| QED (Abdelali et al., | Educational | Multiple (225) | 753 343 |
| SETimes (Tyers & Alperen, | News | Balkan (10) | 1 776 431 |
| TED talks | Subtitles | English | 746 857 |
| Tanzil | Religious | Multiple (42) | 105 597 |
| Tatoeba | Misc | Multiple (359) | 746 857 |
| Wikipedai (Wołk & Marasek, | Wikipedia | English, Polish | 175 972 |
| infopakki | Informational | Multiple (12) | 50 909 |
The third column lists the languages in each corpus (numbers include Turkish), for massively parallel corpora Turkish may not be aligned to all languages. The number of sentences indicates the number of Turkish sentences in the particular corpus. The number of actual aligned sentences vary depending on the target language. All numbers are based on the corpora as available from OPUS parallel corpora collection http://opus.nlpl.eu/
The statistics for Turkish large-scale lexicons
| Lexicon | Lexemes | Additional |
|---|---|---|
| TELL (Inkelas et al., | 30 000 | phonemic transcriptions, roots, inflected forms, etymo. |
| LC-STAR (Fersøe et al., | 104 513 | phonetic transcriptions |
| BabelNet (Navigli & Paolo Ponzetto, | ? | translations, semantic relations |
| Panlex (Kamholz et al., | 242 635 | translations |
The ‘Additional’ column mentions additional annotations. ‘etymo.’ stands for etymological source
Turkish PropBanks and their basic statistics.‘Avg. arg/prd’ stands for average arguments per predicate
| PropBank | Sentences | Avg. arg/prd |
|---|---|---|
| Turkish PropBank (Şahin & Adalı, | 5635 | 1.80 |
| Turkish PropBank (Ak et al., | 9560 | – |
| TRopBank (Kara et al., | ? | 1.68 |
The statistics for Turkish sentiment lexicons. For SentiTurkNet, each synset member is counted as one token
| Sentiment Lexicon | Tokens | Polarity | |
|---|---|---|---|
| Tr SentiStrength |
Vural ( | 1366 | Pos (1-5), Neg (1-5) |
| Multilingualsentiment |
Chen and Skiena ( | 2500 | Pos, Neg |
| SentiTurkNet |
Dehkharghani et al. ( | 21623 | Pos (0-7),Neg (0-7),Neut |