| Literature DB >> 30325347 |
Robert Forkel1, Johann-Mattis List1, Simon J Greenhill1,2, Christoph Rzymski1, Sebastian Bank1, Michael Cysouw3, Harald Hammarström1,4, Martin Haspelmath1,5, Gereon A Kaiping6, Russell D Gray1,7.
Abstract
The amount of available digital data for the languages of the world is constantly increasing. Unfortunately, most of the digital data are provided in a large variety of formats and therefore not amenable for comparison and re-use. The Cross-Linguistic Data Formats initiative proposes new standards for two basic types of data in historical and typological language comparison (word lists, structural datasets) and a framework to incorporate more data types (e.g. parallel texts, and dictionaries). The new specification for cross-linguistic data formats comes along with a software package for validation and manipulation, a basic ontology which links to more general frameworks, and usage examples of best practices.Entities:
Year: 2018 PMID: 30325347 PMCID: PMC6190742 DOI: 10.1038/sdata.2018.205
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Basic rules of data coding, taking cognate coding in wordlists as an example.
(a) Illustrates why long tables[53] should be favored throughout all applications. (b) Underlines the importance of anticipating multiple tables along with metadata indicating how they should be linked[44].
Examples of popular databases produced within the CLLD framework.
| World Atlas of Language Structures | wals.info | Grammatical survey of more than 2000 languages world-wide. |
| Comparative Siouan Dictionary | csd.clld.org | Etymological dictionary of Siouan languages. |
| Phoible | phoible.org | Cross-linguistic survey of sound inventories for more than 2000 languages world-wide. |
| Glottolog | glottolog.org | Reference catalogue of language names, geographic locations, and affiliations. |
| Concepticon | concepticon.clld.org | Reference catalogue of word meanings and concepts used in cross-linguistic surveys and psycholinguistic studies. |
Figure 2Using CSVW metadata to describe the files making up a CLDF dataset.
Practical demands regarding cross-linguistic data formats.
| P | PEP 20 | “Simple things should be simple, complex things should be possible” (cf. |
| R | Referencing | If entities and parameters can be linked to reference catalogues such as Glottolog or Concepticon, this should be preferred to duplicating information. |
| A | Aggregability | Data should be simple to concatenate, merge, and aggregate in order to guarantee their reusability. |
| H | Human- and machine-readability | Data should be both editable |
| T | Text | Data should be encoded as UTF-8 text files or in formats that provide full support for UTF-8. |
| I | Identifiers | Identifiers should be resolvable HTTP-URLs, where possible, if not, this should be documented in the metadata. |
| C | Compatibility | Compatibility with existing tools, standards, and practices should always be kept in mind and never easily given up. |
| E | Explicitness | One row should only store one data point, and each cell should only have one type of data, unless specified in the metadata. |