| Literature DB >> 35675367 |
Timo Rantanen1, Harri Tolvanen1, Meeli Roose1, Jussi Ylikoski2, Outi Vesakoski3,4.
Abstract
Despite remarkable progress in digital linguistics, extensive databases of geographical language distributions are missing. This hampers both studies on language spatiality and public outreach of language diversity. We present best practices for creating and sharing digital spatial language data by collecting and harmonizing Uralic language distributions as case study. Language distribution studies have utilized various methodologies, and the results are often available as printed maps or written descriptions. In order to analyze language spatiality, the information must be digitized into geospatial data, which contains location, time and other parameters. When compiled and harmonized, this data can be used to study changes in languages' distribution, and combined with, for example, population and environmental data. We also utilized the knowledge of language experts to adjust previous and new information of language distributions into state-of-the-art maps. The extensive database, including the distribution datasets and detailed map visualizations of the Uralic languages are introduced alongside this article, and they are freely available.Entities:
Mesh:
Year: 2022 PMID: 35675367 PMCID: PMC9176854 DOI: 10.1371/journal.pone.0269648
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Recommendations for the suitable contents of the geospatial datasets presenting the distribution of languages including the benefits of each, and our solutions (selected in the case study) concerning the Uralic languages.
| Character | Advisable types/features | Benefit/comment | Selected in the case study |
|---|---|---|---|
| Data type | Vector data | Enables the exact location and shape of the object | Vector data |
| Geometry type | Polygon, point | Polygon: Works for areal data, presenting the object’s boundaries | Polygon whenever possible, point in few exceptions |
| Point: Works for presenting the point-like distribution or extensive distribution in general | |||
| File format | Interoperable, up-to-date format, e.g. SHP, GEOJSON, WKT | SHP: Widely used, easy to convert | SHP |
| GEOJSON, WKT: open source-based, new technology | |||
| Coordinate system | WGS84 | Standard in digital map services, works for global and continental-wide data, compatible with other spatio-linguistic and interdisciplinary data | WGS84 |
| Attribute data | ID/FID, language, dialect, branch, time period, sources, Glottocode, ISO code, other information | Increases information on the identity, usability and sharing | All suggested |
| Temporal divisioning | Data-specific, e.g. exact date, division by centuries or more general approach when appropriate | Exact date: When dating is well-known (present-day data) | More general: Division to traditional–current |
| Division by centuries: Well-known historical data | |||
| More general divisioning: imprecise historical data | |||
| Metadata description and file naming | Comprehensive description of data content including at least: file format, data type, coordinate system, data sources, temporal extent and ownership; executed with a logical file naming | Enhance data’s systematicity, transparency and usability | All suggested and point of contacts, maintenance frequency |
Geospatial data consist of spatial (location: coordinates, place name, etc.) and attribute data (features: name or ID of language, name or ID of dialect, etc.). To achieve the best possible structure and operability for each datasets, a data-specific approach is recommended.
Fig 1Geographical overlap of different source materials concerning the distribution of the Khanty language(s) at the beginning of the 20th century.
Original sources Zsirai [34], Haarmann [41], Lytkin et al. [35], Grünthal & Salminen [33] and Abondolo [42] have been visualized using boundaries of each polygon. A solid green area has been created merging the distributions of all Khanty sources, and it is indicating the area where Khanty could have been spoken. Basemap datasets from Natural Earth [43], Digital Chart of the World [44] and ESRI [45].
Fig 2Workflow for best practices in handling of language family data includes three separate phases: I processing and harmonization of spatial data collection: A path from analog and digital source data to a consistent geospatial database, II visualization combined with queries from experts in the case of lesser-studied languages, and creation of improved new maps based on updated information, III data sharing.
The outcomes of the best practices increase research opportunities and general understanding of language distributions. Original data and output are shown as rectangles, processing as ovals and overall benefits as hexagons. Details of the workflow are described in Section ‘Methods’.
The number of dataset files divided into the original published studies (original) and expert-modified distributions (expert) with two overall time periods.
| Time period | Original | Expert | Sum |
|---|---|---|---|
| Traditional | 148 | 55 | 203 |
| Current | 3 | 20 | 23 |
| Sum | 151 | 75 | 226 |
Fig 3Geographical distribution of the Uralic languages at the beginning of the 20th century.
The uncontroversial branches of the family are presented without overlapping areas. A list of original sources is available in S2 Appendix. Basemap datasets from Natural Earth [43], Digital Chart of the World [44] and ESRI [45].
Fig 4Samoyedic languages at the beginning of the 20th century.
Languages are presented without overlapping areas. Original sources: Soviet Census of 1926 [54], Popov [55], Dolgikh [38], Dolgikh & Fajnberg [56], Dolgikh [57], Verbov [58], Grünthal & Salminen [33], Helimski [59], Tuchkova et al. [60], Siegl [61], Brykina & Gusev [62]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44].
Fig 5Traditional (a) and current (b) distribution of Selkup. A comparison of the maps demonstrates the changes in language and dialectal distribution over time. Original sources for traditional distribution are Grünthal & Salminen [33], Tuchkova et al. [60] and for current distribution Tuchkova et al. [60], Kazakevich [63]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44].