Literature DB >> 35675367

Best practices for spatial language data harmonization, sharing and map creation-A case study of Uralic.

Timo Rantanen¹, Harri Tolvanen¹, Meeli Roose¹, Jussi Ylikoski², Outi Vesakoski^3,4.

Abstract

Despite remarkable progress in digital linguistics, extensive databases of geographical language distributions are missing. This hampers both studies on language spatiality and public outreach of language diversity. We present best practices for creating and sharing digital spatial language data by collecting and harmonizing Uralic language distributions as case study. Language distribution studies have utilized various methodologies, and the results are often available as printed maps or written descriptions. In order to analyze language spatiality, the information must be digitized into geospatial data, which contains location, time and other parameters. When compiled and harmonized, this data can be used to study changes in languages' distribution, and combined with, for example, population and environmental data. We also utilized the knowledge of language experts to adjust previous and new information of language distributions into state-of-the-art maps. The extensive database, including the distribution datasets and detailed map visualizations of the Uralic languages are introduced alongside this article, and they are freely available.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35675367 PMCID： PMC9176854 DOI： 10.1371/journal.pone.0269648

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Language geography has recently gained new attention from the growing interest in human history research, which draws evidence from genetic, cultural, and linguistic studies [1-4]. The data from different disciplines studying the human past often includes spatial and/or temporal dimensions, i.e. information about the location and time of each observation. These parameters can be utilized when fusing diverse data on human history as spatial information. Although the usage of geographical distances, for example in gene–language correlation studies, has become more common since a seminal paper by Creanza et al. [5], spatial data and methods have much untapped potential in studies of human history. Linguistic research has often concentrated on non-spatial aspects [6], and geographical inventories grew in number as late as in the 19th century [7]. Many of the first maps depicting the distribution of languages (often labeled as language area or speaker area) were actually illustrations of the locations of different ethnic groups [8-10], at a time when ethnic and linguistic identity were strongly connected. Throughout the history of linguistic cartography, language distributions have often been presented on published maps as non-overlapping regions, or simply as a text label over an approximate location. Occasionally, the distribution of languages has been documented only in written text, especially at the dialectal level. For the sake of cartographic clarity, language maps have also been commonly created from a monolingual perspective or using political mapping units concurrently concealing the regional diversity of languages [11, 12]. The spatial accuracy of the location information in the original studies varies greatly, as some sources aim at giving an overview of the whole language family, whereas others provide a detailed view of individual languages or dialects. In addition, a systematic description of the original mapping methods is often lacking, which complicates the comparability of the data sources. There are about 7000 languages in the world [13, 14], and except for language isolates, none of the language families are comprehensively and uniformly represented as digital spatial data. Linguistic databases are often focused on linguistic (grammatical, lexical) data instead of exact non-linguistic data (the location of the speakers and speaker communities). Many online linguistic databases, such as Ethnologue [14], The World Atlas of Language Structures (WALS) [15] and Glottolog [16], contain general spatial information on languages’ locations, branches (subgroups within a language family) and families as geographic points (geographic coordinates), but many of these services are not targeted to provide language areas (polygons) or to study their possible overlap. Historical spatial language data is often diverse and scattered across analog publications that may even be difficult to find and obtain. We applied geographic information systems (GIS), which enable combining, analyzing and visualizing spatial data, in research on language geography, using Uralic-language areas as a case study. We introduced best practices for collecting and converting such data from the original sources into a harmonized and comparable digital form to create a spatial database of language distributions. This serves a wider purpose in the linguistic domain to promote data interoperability and sharing, with e.g. new standards for cross-linguistic data formats [17]. A geographical approach in linguistic studies has been promoted in several projects e.g. [1, 18–24], which utilize GIS to enable spatial visualization and easy updates. The Uralic languages, spoken in Northwestern Eurasia, provide a compact case for developing a consistent methodology for the collection and harmonization of diverse spatial language data. The Uralic language family is one of the most studied language families, but it presents a less complicated case than, for example, the globally spread Indo-European language family. Depending on the linguistic (structural and sociological) criteria chosen, there are about 30–50 individual Uralic languages with a total of 20 million speakers [25]. Most of the languages are minority languages with only tens to tens of thousands of speakers on both sides of the Ural Mountains in the Russian Federation, while Hungarian, Finnish and Estonian are majority languages in their respective regions, having more than one million speakers each. In terms of size it is among the largest language families, but with around 40 languages, the amount of spatial information is still manageable, providing an excellent test case for compiling a database of language distributions for the whole family with uniform criteria. The spatial data of Uralic languages widens the recently published digital linguistic material on the Uralic basic vocabulary with cognate coding [26-28] and linguistic typology [29]. Our aim was to develop best practices for converting historical and current language-distribution information into digital spatial data, which is comparable to other spatial data and accessible to a wide audience. To achieve this, we compiled and published the first comprehensive spatial database of the Uralic languages. The ultimate goal was to promote the usage of spatial data in linguistic studies, as well as to improve opportunities for multidisciplinary spatio-temporal research. The best practices cover different work phases from data compilation, digitization and harmonization to visualization and verification of language-distribution information through a structured expert evaluation process including also data sharing of the database as open data. In addition, to illustrate the state-of-the-art on the historical geography of the Uralic language family, we created and published a comprehensive collection of historical and current maps based on the datasets. The data and maps are freely available in the Zenodo data repository and Uralic Historical Atlas (URHIA) under a Creative Commons license.

Methods

Methodological considerations

The amount and quality of spatial information of the languages vary between the language families. Instead of digital data, the distribution of the languages is often available only on analog maps and text sources. To be able to use spatial language data in new map visualizations or research with other spatial linguistic or historical datasets, the information needs to be transformed to the digital format, and in addition to be stored in the same database. Determination of a language distribution is complex. Without a unified method for defining languages on the map, the process includes many subjective cartographical and linguistic elements such as how to take into account variations in population density, ethnic groups’ mobility within their living environment and occurrence of bi- or multilingualism as well as the very definition of a language itself on one hand, and the speakers of the language on the other. The lack of systematic description of used mapping methods also complicates the comparability of different historical source materials. However, the development of the actual standard for language area is beyond the scope of this work, and the distributions of the languages are presented exactly as they were defined in the original publications, i.e. the spatial extent of languages remain unchanged in our process. In addition, structured expert evaluations are used to reduce the existing uncertainties in the original publications as well as to increase the harmony between the past and present information of language distributions. In the following chapters we introduce the developed guideline on how heterogeneous spatial language data can be converted to consistent geospatial data by taking into account the standards of linguistics and geographic information science. The workflow consists of ‘Data collection and harmonization’, ‘Creating state-of-the-art maps based on the digitized data and new expert opinions’, and ‘Aspects of data sharing and licensing’. The Uralic language family works as a test case in this study, but the methodological guideline to create consistent geospatial data and database also applies to languages spoken in other geographical regions.

Data collection and harmonization

The existing information about past and present distributions of languages are seldom available as digital spatial data. This was also the case with the Uralic languages for which most of the spatial information was available only as printed maps and text descriptions published since the end of the 19th century, starting from Donner [30, 31]. In addition, information on language distribution was scattered across numerous publications, and the mapping methods used in these studies were highly variable. For example, some studies presented the geographical distribution of the whole language family e.g. [30-33], while others concentrated on individual branches e.g. [34] for (Ob-)Ugric, [35] for Permic and Ugric, [36] for Saami and [37] for Finnic. The pioneering map by Donner [30] did not include the Samoyedic branch, which at the time was not unanimously considered a part of the Uralic family. Donner used the terms Finno-Ugric and Uralic synonymously, whereas the subsequent tradition has often regarded Uralic as consisting of Samoyedic and the remaining Finno-Ugric languages. Spatial language information was most often published in individual language maps with dialect divisions, which were the most detailed mapped information. The spatial accuracy between different languages varied because of the different amounts of available information at the time of the original studies. However, the spatial information of languages was most commonly represented as areas on the maps. To be able to make uniform and comparable representations of different language distributions we visualized those as areas in GIS. In practice, we digitized the data as vector polygons (closed areas including the boundaries making up the areas) instead of points, lines or raster surfaces, which are other options to visualize where languages are spoken on the digital map. The use of polygons allows the presentation of the exact shape and location of the objects depicting the language distributions. However, in cases where spatial information of languages needs to be presented in a more general level or the occurrence of language is point-like such as one village, the usage of point geometry type can be equally reasonable. The basis of the digitization process was to define each language area as precisely as possible while avoiding overly detailed information in the map visualizations. There were several sources for each language where geographical distribution was provided as analog maps. In many cases, the opinions of the exact location and spatial extent of a language varied between the sources. Thus, we compiled different distributions from languages, covering 1–8 sources per language. We collected the information concerning the time period before the extensive changes in Uralic language areas during the 20th century. Therefore, the mapping distribution approximately depicted the situation at the beginning of the 20th century, which is seen as the maximum distribution of the Uralic languages in general. This period is labeled as traditional. For the Sayan Samoyedic languages (Kamas, Mator), which became extinct in the 20th century at the latest [38], the traditional distribution refers to the beginning of the 19th century. We also collected the language distributions corresponding to the current situation, covering approximately the first two decades of the 21st century. The current geographical distribution of the languages was collected using the same principles of spatial generality and accuracy as with the past distributions. This decision ensures the comparability of the data from different time periods, and enables their use in map visualizations and spatial analysis. The original spatial information was transformed into geospatial data using consistent methods. First, the original maps were scanned and saved in a digital image format suitable for GIS software (see more detailed description of the digitization process in e.g. [39, 40]. Second, the scanned and electronic language maps were georeferenced, i.e. tied to a geographic coordinate system using reference basemaps (such as Open Street Map, Google Maps), and properly selected ground control points. As a coordinate system we used the World Geodetic System 1984 (WGS84), since it is a widely-used standard coordinate system for global and regional level data (on average larger than nationwide geographical area), also in linguistic databases such as Glottolog and WALS. Third, the language distributions were digitized into vector shapefiles (shp), i.e. the geographic information was created as georeferenced polygon objects from the maps (language area was determined exactly as in the original publication). At this point, the text descriptions of the language distributions were also digitized into polygon objects. In some rare cases, especially at the dialectal level, text descriptions were the only information available of distribution, and it should be noted that the transformation from written descriptions into polygons is more vague and subjective than digitizing printed maps. After processing the spatial component of the source data, we added the ID, name of the language and dialect, and names of the branches they belong to, together with an indication of the time period that the language distribution corresponds to (Table 1). We also included references to the original source(s) in order to distinguish between different source materials. We also included the respective language’s Glottocode (language ID produced by Glottolog) and ISO 639-3-code (another ID for languages produced by the International Organization for Standardization) within the attribute table. Glottocodes and ISO codes were developed for identifying languages, and they can be used, for example, for identification in cases where languages have several alternative names.

Table 1

Recommendations for the suitable contents of the geospatial datasets presenting the distribution of languages including the benefits of each, and our solutions (selected in the case study) concerning the Uralic languages.

Character	Advisable types/features	Benefit/comment	Selected in the case study
Data type	Vector data	Enables the exact location and shape of the object	Vector data
Geometry type	Polygon, point	Polygon: Works for areal data, presenting the object’s boundaries	Polygon whenever possible, point in few exceptions
Geometry type	Polygon, point	Point: Works for presenting the point-like distribution or extensive distribution in general	Polygon whenever possible, point in few exceptions
File format	Interoperable, up-to-date format, e.g. SHP, GEOJSON, WKT	SHP: Widely used, easy to convert	SHP
File format	Interoperable, up-to-date format, e.g. SHP, GEOJSON, WKT	GEOJSON, WKT: open source-based, new technology	SHP
Coordinate system	WGS84	Standard in digital map services, works for global and continental-wide data, compatible with other spatio-linguistic and interdisciplinary data	WGS84
Attribute data	ID/FID, language, dialect, branch, time period, sources, Glottocode, ISO code, other information	Increases information on the identity, usability and sharing	All suggested
Temporal divisioning	Data-specific, e.g. exact date, division by centuries or more general approach when appropriate	Exact date: When dating is well-known (present-day data)	More general: Division to traditional–current
		Division by centuries: Well-known historical data
		More general divisioning: imprecise historical data
Metadata description and file naming	Comprehensive description of data content including at least: file format, data type, coordinate system, data sources, temporal extent and ownership; executed with a logical file naming	Enhance data’s systematicity, transparency and usability	All suggested and point of contacts, maintenance frequency

Geospatial data consist of spatial (location: coordinates, place name, etc.) and attribute data (features: name or ID of language, name or ID of dialect, etc.). To achieve the best possible structure and operability for each datasets, a data-specific approach is recommended. We created the geospatial data to be compatible with the existing linguistic data, as well as with data from other disciplines. We therefore aimed to utilize the data formats and practices previously used in research into human history. To make the data findable, trackable and transparent, and to improve the data’s usability, we paid special attention to describing the contents of the data. The content description, i.e. the metadata, provides information about e.g. the file format, data type, data sources, coordinate system, temporal extent, point of contacts, ownership, metadata author and maintenance frequency. The metadata management plan also focused on the systematic naming of the dataset files in the catalog (naming conventions for the filesystem directories that hold the data), which is especially important when there are several distributions for one language. Consistent naming facilitates computer-aided search and provides information about a dataset file’s contents without opening the dataset file itself. In addition, the datasets within the database are structured based on the general linguistic classifications of the Uralic languages. In general, when creating the geospatial data to serve a wide range of users it is not justified to limit data feature options strictly. The selection of different solutions during the data creation should be data-dependent, but also the diversity of the end users (GIS vs. computational users) and their expected different working methods can be taken into consideration. Therefore, we decided to utilize flexibility when recommending the different practices for geospatial data creation and harmonization (Table 1). For example, in a case of file format selection the recommendation is to emphasize interoperability and convertibility, for which there can be several suitable formats. Concerning the spatial representation of language distribution the usage of polygons should be the primary option even though some limitations in the amount and quality of spatio-linguistic data advocate using points alongside the polygons. In the historical context, the exact dates of data are not often realistically achievable, especially when going further back to history. Therefore, temporal divisioning should be done as precisely as possible, but in many cases more general division can be preferred to achieve consistent spatio-temporal datasets. In conclusion, systematic implementation through spatial linguistic data processing with comprehensive descriptions of the data contents is crucial when targeting the harmonized geospatial data.

Creating state-of-the-art maps based on the digitized data and new expert opinions

After harmonizing digital spatial data as coherent geospatial datasets they can further be used in map visualizations and spatial analysis. For example, further comparisons of geographical extents from different sources are easy to execute by overlaying separate layers in GIS. The possibility to visualize several layers simultaneously on the map enables a visual inspection of how the language boundaries have been drawn in different sources. It also allows the creation of updated language distributions and thereby improved language maps based on all collected data and basemap features (e.g. information about land and water areas, topography, other natural environment, settlements and administrative boundaries, as well as place names) relevant to understanding the geographical context of a particular language. An updated map visualization can be based on one source depicting the geographical distribution of a language, or the use of several sources. The reliance on only one language extent is straightforward in cases where the distribution of a language is unambiguous. However, in many cases, the geographical distribution of a language is not unambiguous, as different sources present spatially variable views of the distribution (Fig 1). Thus, a new, optimized distribution map of the particular language can be created by examining the different overlapping layers simultaneously, and creating criteria where different characteristics are weighted (see e.g. [24]). For example, a new distribution for a language can be delimited using the common extent occurring in all source materials, and leaving out the areas that occur only in some of the sources. The features can also be prioritized related to the original mapping method, spatial accuracy or reliability. The novelty of the original sources can also be one of the factors regarding the determination of the new distribution of a language.

Fig 1

Geographical overlap of different source materials concerning the distribution of the Khanty language(s) at the beginning of the 20th century.

Geographical overlap of different source materials concerning the distribution of the Khanty language(s) at the beginning of the 20th century.

Original sources Zsirai [34], Haarmann [41], Lytkin et al. [35], Grünthal & Salminen [33] and Abondolo [42] have been visualized using boundaries of each polygon. A solid green area has been created merging the distributions of all Khanty sources, and it is indicating the area where Khanty could have been spoken. Basemap datasets from Natural Earth [43], Digital Chart of the World [44] and ESRI [45]. In our case, it was obvious that different opinions about the language distributions vary notably between the different sources by language. On the other hand, information about the present-day distributions was insufficient. To be able to create spatially consistent state-of-the-art maps for the past and present distributions, we developed a structured expert evaluation process instead of examining the geographical distribution presented in original sources by ourselves. This methodology is particularly applicable for well-known language families, which are being actively investigated. In practice, we utilized a comprehensive database of compiled language distributions. We also collaborated with professional Uralic linguists in the process in order to gain new spatial knowledge on the individual languages, which was unavailable in existing published material. The utilization of expert reviews was useful also because they included an assessment of the previously produced material and evaluated its accuracy in relation to new information. First, we visualized all distributions of each language on draft maps. Second, we designed a query, including the output of visualizations and a set of customized questions to gather structured expert knowledge about each of the Uralic languages (see a more detailed explanation in Rantanen et al. [46], S1 Appendix). The experts consisted of the authors of The Oxford Guide to the Uralic Languages (2022) [47], the most comprehensive handbook of the Uralic family ever produced. Each expert or group of experts (in cases where responsibility of a particular language chapter was shared between more than one author) provided a consensus opinion on draft map regarding the language of their expertise. They were queried about which of the original sources correspond most precisely to their understanding of the language distributions at the beginning of the 20th century, and if none of the sources agreed with the current understanding, where and how the boundaries should be edited (S1 Appendix). We also inquired about the 21st-century distributions of the languages, which is information that was almost totally missing on the preexisting maps. Simultaneously we inquired about the relevant place names (settlements, administrative units, water bodies, natural environment) in the correct spelling to put on the map. Because the queries were assigned only to the responsible author(s) of a particular language chapter, we avoided the possible inconsistencies the language experts may have on the distribution of the languages. In a way, the pool of experts was a preexisting natural set of specialists who had been selected by the handbook editors before the cooperation project. These about 30 experts, in turn, consulted dozens of other specialists and speakers of the languages of their expertise. The expert survey yielded a significant amount of new information concerning the past and present distributions of the Uralic languages, and created an excellent basis for the production of the new state-of-the-art Uralic language maps. All state-of-the-art Uralic language maps were complemented by the expert evaluations, but the amount of new information varies among the languages and time periods. In some cases, the presented past distributions strictly followed earlier studies, but in others there were notable changes. The information of the current distributions were received almost as a whole from the experts, and as an exception for the overall usage of polygon type, it was reasonable to use points alongside with polygons in some map visualizations. In sum, new distributions for all languages were determined in accordance with the opinion of the experts. The sources that were used to create a new distribution for the languages are comprehensively presented in figure captions. For this publication, we created three types of visualizations: 1) individual language maps, 2) maps for the main branches of the Uralic languages, and 3) an overall map of the whole language family. The maps present the most recent and precise information on the geographical distribution of each Uralic language. All maps in each category were based on the same datasets, but the most detailed information, including dialect areas, was usually presented in individual language maps. To achieve visual consistency and clarity among the collection of maps, we decided not to present overlapping areas of different languages. At the same time, we did not indicate the areas of bilingualism or multilingualism on the map, even though bi- and multilingualism commonly occur in the overlapping areas. Suitable accuracy and spatial scale were selected separately according to the purpose of each map. We also provided the created map drafts to The Oxford Guide to the Uralic Languages [47] in return. The visually modified versions of the maps presented here are published there alongside each text chapter, which serve to introduce the Uralic languages.

Aspects of data sharing and licensing

Spatial data platforms play an important role in making it easier for users to publish and access scientific geospatial information. To maximize the accessibility of the Uralic language spatial data, we first stored all the compiled and harmonized datasets (shp) and map visualizations as images (png) in the same spatial database called the Geographical database of the Uralic Languages [48]. Then all the data were stored in the Zenodo data repository, and published under the Creative Commons CC BY 4.0 license, allowing flexible possibilities to manage the data (e.g. data uploads without waiting time, as well as usage statistics and DOI (Digital Object Identifier) citation). The permanent DOI link enables effortless citation of the data and eliminates the problem of ever-changing web addresses. Whenever it is necessary to edit or update the uploaded dataset files, Zenodo registers every new version number (e.g. v.1.0., v.1.1), so that it is also possible to track the evolution of the database. All human history researchers or lay audiences can not be expected to master geospatial techniques [49, 50]. Therefore, the full benefits of the published database can be difficult to achieve. For example, to be able to create own map visualizations based on the datasets, a basic understanding of the usage of desktop GIS is required. To serve especially the audience who are not fluent GIS users, we published the new Uralic language maps in Zenodo as images in PNG format. The database with the datasets and maps are available also in the Uralic Historical Atlas (URHIA) [51], which is an interactive spatial data platform with a map view [52], enabling visual inspection in a web browser without the need to download the datasets. The URHIA map interface also enables the creation of own customized map visualizations and serves a possibility for loading them as multiple different file formats such as SHP, CSV or GEOJSON.

Results

Practices for spatial language data harmonization, visualization and sharing

To improve the opportunities to carry out spatial historical research from linguistic and interdisciplinary perspectives, we introduce a methodological guideline for unifying and presenting the geographical information of language distributions (Fig 2). We operated in the context of the Uralic language family, but the workflow is applicable to other language families or geographical areas as well. As a result, we suggest a three-step process, using the Uralic language data to exemplify the workflow: I) all the spatial source material is digitized into geospatial data using a systematized procedure for data collection, where spatial and attribute data is processed into a comparable and consistent form, which is stored in a database with uniform settings; II) the language distribution data is verified by experts in the particular languages, resulting in new and updated information on past and current language distributions, and state-of-the-art maps are created based on the expert review; and III) open data sharing ensures the usability of datasets in research. It should be noted that a three-step process can be used to digitalize, harmonize and upgrade all kinds of historical spatial data from diverse analogical sources. The developed guideline can be applied also without step II (the expert evaluation) in cases where a particular language has no experts to evaluate the distribution based on different presented opinions. In these cases, some other well-reasoned method to generate state-of-the-art distributions should be used (different options are presented in ‘Creating state-of-the-art maps based on the digitized data and new expert opinions’).

Fig 2

Workflow for best practices in handling of language family data includes three separate phases: I processing and harmonization of spatial data collection: A path from analog and digital source data to a consistent geospatial database, II visualization combined with queries from experts in the case of lesser-studied languages, and creation of improved new maps based on updated information, III data sharing.

The outcomes of the best practices increase research opportunities and general understanding of language distributions. Original data and output are shown as rectangles, processing as ovals and overall benefits as hexagons. Details of the workflow are described in Section ‘Methods’.

Workflow for best practices in handling of language family data includes three separate phases: I processing and harmonization of spatial data collection: A path from analog and digital source data to a consistent geospatial database, II visualization combined with queries from experts in the case of lesser-studied languages, and creation of improved new maps based on updated information, III data sharing.

Geographical database of the Uralic languages–geospatial datasets

The Geographical database of the Uralic languages [48], published in Zenodo (S2 Appendix) covers the geographical distribution of all Uralic languages (S1 and S2 Tables) in roughly two time periods: 1) at the beginning of the 20th century–indicating approximately the widest known distribution of Uralic languages, labeled traditional in what follows, and 2) a current distribution covering approximately the beginning of the 21st century up to the present day. There are 1–8 traditional and 0–2 current distributions available for each language, compiled initially from published sources and secondarily updated and improved by experts in these languages (S1 Table). The database follows a hierarchical structure presenting both the individual branches of the family (e.g. Saami, Finnic, Samoyedic), the individual languages within those branches (e.g. Saami: North Saami, Skolt Saami, Kildin Saami), and some dialectal divisions within individual languages (e.g. North Saami: Torne, Western Inland, Eastern Inland, Sea). Note that the hierarchical structure of these languages takes no position on how to taxonomically position the (uncontroversial) branches within the family or individual languages within the branches they belong to. The total number of datasets is 226 (Table 2). Each dataset consists of the spatial location of the language either polygons, which is principally selected geometry type (222 cases) or data points, used in few well-reasoned exceptions (4 cases). All datasets are available as shapefiles in the WGS84 coordinate system. The attribute information consists of the FID (feature ID), language/dialect name, information on the branch it belongs to, the time period, original sources, Glottocode (language ID) and ISO 639-3-code (another ID), all according to international linguistic standards. The language IDs allow merging the datasets with existing language data operating with the same codes. By constructing the datasets uniformly, the usability of data is optimized also with other kinds of spatial data, such as D-Place [53], which provides a vast amount of cultural and environmental information. In addition, metadata descriptions that introduce the data collection methods and the data characteristics were comprehensively created.

Table 2

The number of dataset files divided into the original published studies (original) and expert-modified distributions (expert) with two overall time periods.

Time period	Original	Expert	Sum
Traditional	148	55	203
Current	3	20	23
Sum	151	75	226

Geographical database of the Uralic languages–state-of-the-art language maps

In addition to the geospatial data, the database (S2 Appendix) presented here consists of 45 maps with colors depicting the location of past and present distributions of the Uralic languages (S2 Table). The maps are divided into the following categories, each of which is illustrated in this paper as example maps, which introduce the hierarchical structure of the database and the temporal dimension: 1) an overall map of the whole language family (Fig 3), 2) maps for the nine uncontroversial main branches of Uralic (Fig 4), and 3) individual language maps (Fig 5). All map levels are based on the same original source data, but the most detailed information exists in the individual language maps (Fig 5), which predominantly also includes dialect distributions and thus forms the optional fourth level in the hierarchy. In some exemplary cases, past and present distributions are shown as their own layers on the map (see examples in S2 Appendix), but in some cases, there are separate panels for the time periods (Fig 5A and 5B). The layout of each map has been customized independently, emphasizing the environmental (lakes, rivers, topography), cultural (settlements, nomenclature) and political features (administrative borders) which facilitate an understanding of the spatial context of a particular language. To achieve visually clear and easily comprehensible illustrations, overlapping languages are not shown on the maps.

Fig 3

Geographical distribution of the Uralic languages at the beginning of the 20th century.

The uncontroversial branches of the family are presented without overlapping areas. A list of original sources is available in S2 Appendix. Basemap datasets from Natural Earth [43], Digital Chart of the World [44] and ESRI [45].

Fig 4

Samoyedic languages at the beginning of the 20th century.

Fig 5

Traditional (a) and current (b) distribution of Selkup. A comparison of the maps demonstrates the changes in language and dialectal distribution over time. Original sources for traditional distribution are Grünthal & Salminen [33], Tuchkova et al. [60] and for current distribution Tuchkova et al. [60], Kazakevich [63]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44].

Geographical distribution of the Uralic languages at the beginning of the 20th century.

Samoyedic languages at the beginning of the 20th century.

Languages are presented without overlapping areas. Original sources: Soviet Census of 1926 [54], Popov [55], Dolgikh [38], Dolgikh & Fajnberg [56], Dolgikh [57], Verbov [58], Grünthal & Salminen [33], Helimski [59], Tuchkova et al. [60], Siegl [61], Brykina & Gusev [62]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44]. Traditional (a) and current (b) distribution of Selkup. A comparison of the maps demonstrates the changes in language and dialectal distribution over time. Original sources for traditional distribution are Grünthal & Salminen [33], Tuchkova et al. [60] and for current distribution Tuchkova et al. [60], Kazakevich [63]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44].

Discussion

Best practices for the processing of spatial language data were developed in the course of digitization, harmonization and sharing of cartography on the distribution of the Uralic languages. However, the suggested process is applicable to other current and historical spatial data, including other areas and language families, as well as data from other disciplines, such as archaeology and genetics. The benefits of consistent practices are apparent: the language distribution data created is coherent and comparable to other geospatial data, and the data is uniformly described. The data is stored in one database, allowing customized map visualizations, visual comparisons, and further spatial analyses of the linguistic data, such as phylogeographical modeling of language spread. The availability of the data is secured through open-access publication of the Geographical database of the Uralic languages (S2 Appendix). The database includes finalized state-of-the-art maps for each language, and therefore it is not necessary to master GIS methods to use the geospatial information. We also offer easy access to map processing via spatial data platform URHIA [51]. Bringing the data on language distribution into the digital realm not only enables a review of the massive amount of work done so far in historical linguistics, but also opens new horizons for bridging the knowledge to linguistic research and teaching in general, as well as to interdisciplinary holistic studies of human history. The geographical approach allows location-based studies of language areas (which are increasingly desired) in parallel with, for example, archaeological, genetic and environmental data [64-66]. It must be noted, however, that the accuracy of language-distribution information is higher for modern times than for historical or prehistorical eras. It must also be noted that languages’ distribution may have changed significantly; for example, the Saami languages have been present in most of Lapland for less than 1500 years [67]. The temporal dynamics of the language distributions are reflected in the data as time layers, as far as the original sources allow. Even though the time frame of the documented Uralic-language distribution data does not extend far back in history, the temporal dimensions provide insight into the spatio-temporal dynamics of these languages. In the context of the Uralic languages, the original yet sometimes unsubstantiated representations of language distributions have often been accepted as such, and the presented distribution boundaries have been perpetuated in maps to the present day. This basic setting affects the possibilities to create sophisticated visualizations of a historical language area in digital cartography. However, the conversion of historical data into digital spatial data, operating with polygons and points, remarkably improves the possibilities to use language-distribution data innovatively, for example by simultaneously visualizing multiple map layers. Using separate layers for comparing different original sources expands the possibilities to create new information on past distributions, which were previously not presented on maps. In many cases, also in the history of the Uralic languages, different interpretations of languages’ distributions at the same time periods have been presented by different authors. In the process of creating the Uralic languages’ distributions as geospatial data, we turned to expert opinions in order to calibrate and harmonize the source data. This was to assure, for example, that the relationship between the source data and other knowledge concerning the population and cultural history of the region is accounted for. The decisions made by the original investigators, the descriptions of their methods, the geographical scale, and temporal coverage all have an impact on the data itself, but a careful expert review helps to unify these factors to a degree. The main challenge concerning the mapping of language distribution, in general, is related to the definition of a language area. There is no established standard to determine the distribution of a language on the map [11, 46, 68–70], i.e. where the boundaries of language distributions should be drawn. Mapping methods have varied among the inventories, for example, according to the amount of existing data and the ultimate purpose of the map. Also, personal preferences may affect the visualization output, even though maps should be neutral and realistic [70]. Using a structured expert-evaluation process during the digitization of the source material is a feasible way to mitigate and adapt to the issue of how a language area is defined. Also other issues, such as the uncertain definition of a speaker, difficulties in distinguishing between dialects and languages, variation in ethnic groups’ mobility within their living environment (sedentary vs. nomadic lifestyle), regionally unevenly distributed populations and migration to new territories have complicated the interpretation of the geographical extent of the languages and emphasized the subjectivity of depicting language-distribution boundaries through history. In addition, systematic descriptions of the chosen methods are often lacking in the historical sources, making it challenging to assess the source data’s quality and repeat the original methodology. Luebbering [71] presents an illustrative list of caveats that customarily accompany language maps. For further discussion about the history, challenges and suggestions for future work concerning the mapping of languages’ distribution, see e.g. [46, 70, 72–74]. Our solutions to these challenges in the case of Uralic languages are documented in ‘Methods’. A common challenge faced when illustrating language distributions is that often several languages are spoken in one region, or even within one population. When presenting languages and dialects as individual objects in spatial data, this is not a problem, since overlap can easily be analyzed and visualized in GIS. Therefore, there is no need to stick to the classical cartographic representation of regional monolingualism, and we have also created each language distribution polygon of the Uralic languages individually. Thus, any area can include as many languages in the data as needed, and the polygons can and do overlap where multiple languages have been observed. However, when multiple such data layers are visualized on the same map, the possible overlaps need to be handled adequately by using a clear classification for overlap areas. Time as one component of spatial language data allows for analyzing the dynamics in the development of languages and dialect areas. At the simplest, overlaying distribution maps of different time periods show how the distribution of one language has evolved during the known historical period (see Fig 5 for example, [21]). Combining this with e.g. environmental data, opens further possibilities to analyze the spatial interaction between the speaker populations’ migrations and changes in the environment. To our knowledge, this is the first time that an entire language family has been mapped and visualized as one harmonized database. The database, including the distribution maps for the Uralic languages, is available in Zenodo [48], and the data has also been published in the Uralic Historical Atlas (URHIA) [51], which enables online visualization of spatial data in a map interface, together with other data from the region.

A query of the North Khanty language.

Similar query was sent to each responsible author(s) of the particular language chapters of The Oxford Guide to the Uralic Languages. These expert evaluations were created to collect up-to-date information of past and present geographical distribution of Uralic languages. (PDF) Click here for additional data file. Rantanen T, Vesakoski O, Ylikoski J, Tolvanen H. Geographical database of the Uralic languages. Version v1.0.; 2021. Database: Zenodo [Internet]. Available from: http://doi.org/10.5281/zenodo.4784188. (DOCX) Click here for additional data file.

Number of distributions per language and time period in the geospatial datasets.

Language distributions are based on the published studies (original) and separate expert evaluations (expert) done in collaboration with the authors of The Oxford Guide to the Uralic Languages. Some original studies do not separate subgroups of language branches, which is the reason to use branch or general names in the ‘Language’ column (labelled in italics) in some cases, for example ‘Mordvin’ or ‘Khanty’. Branches: Saami (I), Finnic (II), Mordvin (III), Mari (IV), Permic (V), Mansi (VI), Khanty (VII), Hungarian (VIII), Samoyedic (IX). *Skolt Saami: Distribution after resettlement in 1950s. **Livonian: Medieval and 1900s distributions. (DOCX) Click here for additional data file.

List of maps containing information on each Uralic language.

Table shows how many dialects are presented per language as well as information of temporal coverage. Total number of language maps is 45. Main branches of Uralic languages are indicated as Roman numerals. *Meänkieli and Kven are seen here as separate languages, in S1 Table both instead belong to Finnish language. **Lule Saami is divided to three proper and three transitional dialects. ***Mordvin branch consists of five Erzya language dialects and four Moksha language dialects. (DOCX) Click here for additional data file. 17 Jan 2022

PONE-D-21-40182

Best practices for spatial language data harmonization, sharing and map creation – a case study of Uralic

PLOS ONE Dear Dr. Rantanen, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Although both reviewers judge the revisions needed to be minor they address different aspects, so the combined amount of suggested revisions would be more than minor. Both reviewers point to some cases where a particular practice might be improved or where a suggestion is not perceived as being optimal. No doubt, there is a room for different opinions about what is a best practice, so more discussion of alternative could be added. And sometimes there is still room for improvement in your own practice--it may actually strengthen the paper to admit for this possibility where relevant.

Please submit your revised manuscript by Mar 03 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Søren Wichmann, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that Figures 1, 2, 4, 5, and 6 in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: a. You may seek permission from the original copyright holder of Figures 1, 2, 4, 5, and 6 to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. The following resources for replacing copyrighted map figures may be helpful: USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/ The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/ Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/ Landsat: http://landsat.visibleearth.nasa.gov/ USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/# Natural Earth (public domain): http://www.naturalearthdata.com/ [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The data and practice behind this paper are extensively and quite well-described but lacks the most important part, namely what they consider the "language area" (that is to be mapped). After noting this problem (pp 19-20) they offer no definition nor or declaration of how they have dealt with it for the data at hand, except to say they've consulted experts. Unfortunately the paper also lacks any detail who is to be considered an expert, with what instructions they have consulted the experts and how they weed out inconsistencies across experts. These are relevant questions. Presumably the experts are the same kind of people who write papers about the same languages, and as the authors duly note, they have "different opinions" and "vary notably". For example, for Nganasan the maps show a continuous area including the area to the coast east of Lake Taimyr and north of Nordvik island with sources "Dolgix 1963, Popov 1948, Brykina & Gusev 2015, Wagner-Nagy 2018:3". None of these sources sanction this chunk as Nganasan settled. In particular, the extremely detailed maps by Dolgix (temporally differentiated) and Popov (summer-winter differentiated) --- based on their own extensive fieldwork --- stop just east of lake Tajmyr, not including the NE area up to the coast. You could of course say that area might have been hunting grounds and certainly no other ethnic group lay claim to it, but then the entire northern Tajmyr could just as well be mapped as Nganasan as hunting grounds (Golovnev 1999 says this explicitly). Also, following Popov, the settlements were always along drainages, and the interior areas hunting grounds, but the maps show continuous areas but not including *all* hunting grounds. What is the actual intent? Another example is South Saami whose northen dialect on the map includes Södra Tärna with the sources "Hasselbrink 1981–1985, Rydving 2008: 360–361, Rydving 2016, Maja Lisa Kappfjell & Jussi Ylikoski (p.c.)" but Rydving is explicitly arguing for South Tärna to be counted as Ume Saami. What was the reasoning here and what informaed was asked of or provided by Ylikoski? The bottom line is, there are two choices: (i) Either the paper stays without a definition of the intended areas to map and/or a methodology for using the experts, but then the authors have not defined a "practice", let alone a "best practice", and the title should be changed accordingly, or (ii) such a "best practice" is defined and included in a revision. This is the only important point, everything else is essentially fine and the resource as such is excellent. Non-important issues: * Rephrase: "In our case, it was obvious that different opinions about the language distributions vary notably between the different sources by language." * "but the workflow is applicable to other language families or geographical areas as well" There are no experts for every language available for most families and areas so this workflow is not applicable in the same form for others. * It would be nice of more detail (or a reference) were given on how to best digitize printed maps * Imperatorskoe Russkoe Geograficheskoe, Obshchestvo -> remove comma * Imperatorskogo Russkogo Geograficheskogo Obshchestva -> why genitive? * Some Russian sources are in Roman, some in Russian majuscule and some in Russian minuscule. Harmonize. Golovnev, Andrei V. (1999) The Nia (Nganasan). In Richard B. Lee & Richard Daly (eds.), The Cambridge Encyclopedia of Hunters and Gatherers, 166-169. Cambridge: Cambridge University Press. Reviewer #2: The idea of publishing a best practice example of handling spatio-linguistic data is quite good and the relevance of this publication should not be underestimated. Good practice guidelines for data handling is available in all natural sciences and is also available in some disciplines of the humanities. A general good practice approach should be part of the digital humanities but also requires specific components for each topic. This article combines general aspects of handling geospatial data and specific aspects of linguistics. The current review focusses on the first part. Though, the workflow employs rather general steps, it can be helpful for considering all steps and after all, too narrow standards are rather preventing the concept from being applied. In particular pointing at the importance of open licences (cc) and open platforms (zenodo) is an important consideration for up to date research. Another important point is to acknowledge certain redundancies such as offering the geospatial data as well as the ready made maps in order to allow benefits for different target groups. The part on geospatial data can be further improved. - It seems clear why polygons should be used, but why are four point data sets part of the database? It seems that this fact destroys the consistency of the database. - I understand why WGS84 is used but for mapping and geospatial analysis projected coordinate systems are usually preferred. It would be helpful to explain the decision. - The decision for shape files is not understandable at all. It is true that shape files are still wide spread but they are based on an completely outdated technology. A new god practice guideline and a new workflow should be based on up to date and sustainable technology. For small data as in this case, a text and WKT or EWKT based format such as csv would have the benefit of being easily to integrate in full reproducible research workflows. Further more, this formats are as software-independent as possible. For larger data spatialite databases and derivates such as geopackage would be a good choice. It would be helpful to discuss this point and to refer to newer developments in geography and geospatial technology. - An example of overlapping data would be helpful and a more detailed explanation of how to combine data and handle the overlapping would be appreciated. Another possible improvement could address the handling of chronological information. Perhaps, a more general approach would make sense. Categories such as "traditional" might make sense in a specific case but are rather confusing and imprecise in general. Examples of how to handle temporal data are known from temporal data bases and from archaeology. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Harald Hammarström Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Apr 2022 Reviewer #1: Comment: The data and practice behind this paper are extensively and quite well-described but lacks the most important part, namely what they consider the "language area" (that is to be mapped). After noting this problem (pp 19-20) they offer no definition nor or declaration of how they have dealt with it for the data at hand, except to say they've consulted experts. Response: This comment was valuable. The “language area” was not very clearly defined in the original manuscript. We made the following changes: an addition of a paragraph to the ‘Methodological considerations’ chapter, where the problematicity of determining the distribution of languages was described, including also the explanation of how language areas were presented in this work: “the languages are presented exactly as they were defined in the original publications, i.e. the spatial extent of languages remain unchanged in our process” (lines 122–125). In addition, we also referred to this aspect in ‘Discussion’ (lines 523–524) and added a reference to in-depth discussion in our chapter in The Oxford Guide to the Uralic languages (line 523). C: Unfortunately the paper also lacks any details who is to be considered an expert, with what instructions they have consulted the experts and how they weed out inconsistencies across experts. These are relevant questions. Presumably the experts are the same kind of people who write papers about the same languages, and as the authors duly note, they have "different opinions" and "vary notably". R: The experts were indeed the researchers who wrote the particular language chapters to The Oxford Guide to the Uralic Languages. We now note this more clearly in lines 288–292. We also explain more detailedly how the expert evaluation process actually proceeded (changes in lines 276–279, 296, 299–304, 313–316, 319–320). To clarify the instructions we assigned to the experts, a new ‘S1 Appendix’ file was added to the manuscript. It includes an example of a query assigned to the North Khanty expert indicating the overall style of the all queries in this research. Concerning the comment on how we dealt with the inconsistencies across experts, we clarified the phrasing in the text (lines 299–304). In brief, there were no inconsistencies across different experts because we asked the opinion only from one expert or a group of experts representing the responsible authors of each language chapter of The Oxford Guide to the Uralic Languages. We also recognize that there are and will be different opinions about the exact boundaries of particular languages (at least as long as a standard for language mapping will be developed). In this case, the beginning of the process was to collect different opinions from published sources in the same database. Then, new state-of-the-art maps (consensus maps) were created on their basis. Experts evaluated the distribution based on these earlier sources as well as complementary information they had. After evaluating all the information simultaneously, the expert or (group of experts) made the decision on how to define the updated distribution on the map draft. These updated opinions ended up on the state-of-art maps as such. Further, updated opinions were also exported to the Geographical database of the Uralic languages, ‘S2 Appendix’, (former ‘S1 Appendix’) as ‘Expert distributions’ with ‘Original distributions’ (earlier sources). This process is now explained more detailedly in the manuscript (lines 275–316). Noteworthy, this is only one way to execute the consensus maps, and because all languages have no experts, we have explained the other relevant possibilities for creating certain kinds of maps in general level (lines 257–264). C: For example, for Nganasan the maps show a continuous area including the area to the coast east of Lake Taimyr and north of Nordvik island with sources "Dolgix 1963, Popov 1948, Brykina & Gusev 2015, Wagner-Nagy 2018:3". None of these sources sanction this chunk as Nganasan settled. In particular, the extremely detailed maps by Dolgix (temporally differentiated) and Popov (summer-winter differentiated) --- based on their own extensive fieldwork --- stop just east of lake Tajmyr, not including the NE area up to the coast. You could of course say that area might have been hunting grounds and certainly no other ethnic group lay claim to it, but then the entire northern Tajmyr could just as well be mapped as Nganasan as hunting grounds (Golovnev 1999 says this explicitly). Also, following Popov, the settlements were always along drainages, and the interior areas hunting grounds, but the maps show continuous areas but not including *all* hunting grounds. What is the actual intent? R: As mentioned above there really are many opinions about the “right” distribution of each language and Nganasan is a good example of how much different sources vary. Nganasan speakers have traditionally relied on nomadism, which also means a mobile way of living compared to the sedentary lifestyle common in the southern regions, which also are more densely populated areas. These differences in ethnic groups’ mobility as well as in population densities have been mapped with very variable methods through history. In this work, the idea was first to collect different opinions to the same database and then plot these different opinions on the map so that experts can evaluate their accuracy concerning the past and present distributions with complementary knowledge they had. Based on all collected information they made their decision on how to define a particular language on the map. The sources that have been mentioned to be used as a basis for a new state-of-the-art map are listed alongside the figures. In general, if some source has been mentioned, it means that in some way it has been used for creating a new map even though it can be inconsistent with some other source in the same source list. The selected expert(s) made the decision, which are the sources that should be listed in each case. To indicate how different sources have been used to create state-of-the-art maps we added a short note in the manuscript (lines 314–316). When we will release a new version of S2 Appendix (Geographical database of the Uralic languages) in Zenodo we will also explain this aspect with more details. Lastly, we also added a reference to the Oxford Guide to the Uralic Languages chapter (in print) where we introduce the map-making process more in-depth (line 288). To validates its usage as a reference here and to ensure that reviewers and editor has an access to it, we provided a pre-print of this chapter in Academia.edu repository: https://www.academia.edu/70120154/Mapping_the_distribution_of_the_Uralic_languages C: Another example is South Saami whose northen dialect on the map includes Södra Tärna with the sources "Hasselbrink 1981–1985, Rydving 2008: 360–361, Rydving 2016, Maja Lisa Kappfjell & Jussi Ylikoski (p.c.)" but Rydving is explicitly arguing for South Tärna to be counted as Ume Saami. What was the reasoning here and what informaed was asked of or provided by Ylikoski? R: This is an important remark from the reviewer: reference to Rydving 2016 is unfortunately misleading. The South Saami map was created based on these sources, but we now realize that Rydving 2016 is misleadingly included in the list (it did contribute to our work with this map, but we do not yet fully agree with it). We will remove Rydving 2016 from the source list when releasing a new version of the database (S2 Appendix) in Zenodo. As to the subject matter itself, it may be noted that Rydving 2016 explicitly argues against the traditional view on the South Saami – Ume Saami border, which our map describes. While we (Ylikoski) sympathize with Rydving’s arguments, we consider the question still unsettled, and will try to describe the border as less definite in future releases. From the description of ‘S2 Appendix’: “The significant impact of expert(s) have been indicated with personal communication (p.c.) label in the sources. Personal communications were used when creating state-of-the-art distributions for the languages. In these cases, the received information is often unpublished, and it is implemented to the new language distributions in different ways.” C: The bottom line is, there are two choices: (i) Either the paper stays without a definition of the intended areas to map and/or a methodology for using the experts, but then the authors have not defined a "practice", let alone a "best practice", and the title should be changed accordingly, or (ii) such a "best practice" is defined and included in a revision. This is the only important point, everything else is essentially fine and the resource as such is excellent. R: The intended areas and the expert evaluation process are now much more in detail explained in the manuscript (see the changes above). We also sharpened the message that “the development of the actual standardization of language area is beyond the scope of this work, and the distributions of the languages are presented exactly as they were defined in the original publications” (lines 122–124). In practice, this means that language area standardization is not part of the “Best practices workflow”. Instead, the actual content of “Best practices workflow” is comprehensively described in the final paragraph of ‘Introduction’, through the ‘Methods’, in the first part of ‘Results’, and in the ‘Discussion’. Therefore, we preserve the original title for this article. C: Non-important issues: * Rephrase: "In our case, it was obvious that different opinions about the language distributions vary notably between the different sources by language." * "but the workflow is applicable to other language families or geographical areas as well" There are no experts for every language available for most families and areas so this workflow is not applicable in the same form for others. R: It is true that expert evaluations as a method to produce state-of-the-art language areas and maps is not applicable as such to all language families or geographical areas. However, we are not claiming that this methodological guideline is applicable to all other language families or geographical areas. Further, this sentence refers to the whole Best practices workflow, not only to the part where we introduced the state-of-the-art map procedure. We however added clarifications to describe the applicability of the expert evaluation process in ‘Methods’ (lines 278–279) and ‘Results’ (lines 372–377). C: * It would be nice of more detail (or a reference) were given on how to best digitize printed maps R: We made a few clarifications into the paragraphs where digitization was presented (lines 184–190) and added two references for a more detailed description of the process (line 181). C: * Imperatorskoe Russkoe Geograficheskoe, Obshchestvo -> remove comma * Imperatorskogo Russkogo Geograficheskogo Obshchestva -> why genitive? R: Both of the sources have been corrected -> Imperatorskoe Russkoe Geograficheskoe Obshchestvo C: * Some Russian sources are in Roman, some in Russian majuscule and some in Russian minuscule. Harmonize. R: All Russian sources have been harmonized to the ‘References’. The whole reference list is also updated. Reviewer #2: Comment: The idea of publishing a best practice example of handling spatio-linguistic data is quite good and the relevance of this publication should not be underestimated. Good practice guidelines for data handling is available in all natural sciences and is also available in some disciplines of the humanities. A general good practice approach should be part of the digital humanities but also requires specific components for each topic. This article combines general aspects of handling geospatial data and specific aspects of linguistics. The current review focusses on the first part. Though, the workflow employs rather general steps, it can be helpful for considering all steps and after all, too narrow standards are rather preventing the concept from being applied. In particular pointing at the importance of open licences (cc) and open platforms (zenodo) is an important consideration for up to date research. Another important point is to acknowledge certain redundancies such as offering the geospatial data as well as the ready made maps in order to allow benefits for different target groups. Response: This comment is a good summary why different parts of the Best practices guideline are relevant in this context. The comment also helped us to understand that the ‘Data collection and harmonization’ chapter was focused on the Uralic case study instead of being a guideline in general. Therefore, we made the following changes: We updated Table 1 to cover more widely our recommendations for geospatial data production instead of concentrating only on the Uralic case study (changes in Table 1 are highlighted). We separated the columns for general level (Advisable types/features) and what types/format was used in the Uralic case study (Selected in the case study). ‘Temporal divisioning’ and ‘Metadata description and file naming’ were the new features in Table 1, and thereby included as a part of ‘Data collection and harmonization’ standard. We updated the table caption (lines 205–210). We also added a new paragraph where we explain these steps on a more general level (lines 225–240). In practice, this means that we now introduce, for example, a wider palette of geometry types, data formats and ways to execute temporal divisioning when harmonizing the geospatial data. This guarantees that a developed practice takes more broadly into account the differences between language families and geographical regions and makes the overall process more flexible and applicable. The comment actually was very valuable and put us to think about different parts of practice from multiple aspects. We truly feel that it helped us to further develop the whole practice section. C: The part on geospatial data can be further improved. - It seems clear why polygons should be used, but why are four point data sets part of the database? It seems that this fact destroys the consistency of the database. R: This comment was also very useful and made us think about the presentation of language distributions in general and from different aspects. It is true that polygons should be primary options to present areal information of the languages. However, there are some exceptions when the usage of points is well-reasoned, for example in cases where spatial information of languages needs to be presented at an ultimately general level (e.g. political reasons or privacy) or the occurrence of language is strongly concentrated on settlement centers such as villages. In addition, many earlier published data on languages is represented with coordinates why point geometry type should not be totally rejected. Concerning the structure of the database, we do not see that different geometry types destroy its consistency - even more we see them to complement each other. In summary, we made the following changes to the manuscript: - Better justification for the use of point data (lines 159–161, Table 1 -> Changes in ‘Geometry type’). - Decision to recommend polygons as the primary option to present language distributions, but in some exception possibility to use points alongside them (lines 232–235, Table 1 -> changes in ‘Geometry type’). C: - I understand why WGS84 is used but for mapping and geospatial analysis projected coordinate systems are usually preferred. It would be helpful to explain the decision. R: The decision to select the WGS84 coordinate system is based on the fact, that it is a standard in digital databases and map user interfaces in linguistics and other disciplines studying human history. The compatibility and usability were the main factors we emphasized when making this decision, but the other options were also considered in our data production process. The usage of geographic coordinate systems is justified when the study area is large (global or continental-wide), which often is a case with the language families. Projected coordinate systems could be an alternative if the data cover smaller geographical regions (e.g. an area of a small country), but in general level this seldom happens. As a result of the comment, we better justified the selection of WGS84 in the manuscript (lines 184–187, Table 1). C: - The decision for shape files is not understandable at all. It is true that shape files are still wide spread but they are based on an completely outdated technology. A new god practice guideline and a new workflow should be based on up to date and sustainable technology. For small data as in this case, a text and WKT or EWKT based format such as csv would have the benefit of being easily to integrate in full reproducible research workflows. Further more, this formats are as software-independent as possible. For larger data spatialite databases and derivates such as geopackage would be a good choice. It would be helpful to discuss this point and to refer to newer developments in geography and geospatial technology. R: In the field of geoinformatics there is a lot of discussion about “right and wrong” data formats - some prefer one format and another something else. Further, geospatial techniques are rapidly developing and there definitely is no consensus which format is the best and which one should be avoided. However, we took the criticism concerning the usage of shapefile seriously and carefully examined the pros and cons of the different data formats. From some points of view shapefiles can be outdated technology and compared to newer formats it may have some deficiencies, for example the comment of software independence was valid. On the other hand, shapefile really is a widely used and easily convertible format, which can be used with all common GIS softwares. It also usually works well as such in the softwares developed for the geostatistical analyses, and if not, shapefiles are easy to convert to some other format. However, when the aim is to produce the “Best practice” for the data format, which will be based on the best possible technology also in the future, the recommendation should be thought of closely. Therefore we decided to modify our practice to be more permissive. Instead of selecting only one format to be used, we decided to recommend any format which fulfills the requirements of being interoperable, up-to-date and convertible, but emphasizing widespread use and notoriety in our procedure. After careful review we decided to mention only a few exemplary formats as a recommendation including those which have a long tradition to be used in GIS and those which are based on newer technology and fit with OGC (Open Geospatial Consortium) standards: “e.g. SHP, GEOJSON and WKT” (Table 1). The same content was added to text (lines 225–232, especially 230–232). Finally, we wanted to emphasize the possibility to utilize the developed spatial data platform URHIA also in the file format conversion. Therefore we added a mention of the possibility to download the datasets in a variety of formats (lines 353–354). C: - An example of overlapping data would be helpful and a more detailed explanation of how to combine data and handle the overlapping would be appreciated. R: An example of overlapping data is presented in Fig 1 (former Fig 2) titled as “Geographical overlap of different source materials concerning the distribution of the Khanty language(s) at the beginning of the 20th century”. This figure points out how differently separate sources of Khanty language determine the language distribution on the map. It also indicates the overlap quite illustratively. On the other hand, we have also explained the different options on how to handle overlap when creating consensus maps based on them in 1) general level (lines 248–264, more detailed explanation in Haynie & Gavin 2019 [24]), and 2) concerning our solution in Uralic case study in chapter ‘Creating state-of-the-art maps based on the digitized data and new expert opinions’. In our case, consensus information on languages was collected using an expert evaluation process where an expert or group of experts defined the distribution of a particular language based on earlier sources and the other information they collected or had. Noteworthy, the expert evaluation process is explained in more detail after the comments from reviewer 1. See also S1 Appendix, which is a new Supplementary material exemplifying the query format sent to the expert at the beginning of the process. C: Another possible improvement could address the handling of chronological information. Perhaps, a more general approach would make sense. Categories such as "traditional" might make sense in a specific case but are rather confusing and imprecise in general. Examples of how to handle temporal data are known from temporal data bases and from archaeology. R: The reviewer is right, especially concerning the more general approach when determining the way to handle temporal divisions in historical geospatial datasets. We edited our manuscript concerning the standards of temporal divisioning (lines 235–238, Table 1). Now we take into consideration the variability of the different datasets: data can be temporally precise or inexact when different classifications are well-reasoned. In practice, we recommend exact dating for the data whenever possible (often present-day data), but more general divisioning when the source materials do not allow more (just as in our case study and with many other historical or prehistorical data). Even then, it would be good to strive to target century level (e.g. 19th century, 20th century or 1800, 1900 etc.) rather than more abstract or general classes in temporal structure. Then the data is more probably applicable with the data from other disciplines. Academic Editor: Comment: Although both reviewers judge the revisions needed to be minor they address different aspects, so the combined amount of suggested revisions would be more than minor. Both reviewers point to some cases where a particular practice might be improved or where a suggestion is not perceived as being optimal. No doubt, there is a room for different opinions about what is a best practice, so more discussion of alternative could be added. And sometimes there is still room for improvement in your own practice--it may actually strengthen the paper to admit for this possibility where relevant. Response: We see that the comments really helped us to understand the deficiencies in the manuscript and they led to the following quite remarkable editions to the text: We explained the key concepts (e.g. language area, who was seen as an expert) more detailedly. We also added several clarifications concerning the digitalization and expert evaluation processes. Importantly, we made notable changes to ‘Data collection and harmonization’ for balancing the general aspect and Uralic case study in geospatial data creation including also the wider and more flexible recommendations for suitable file formats, geometry types and temporal divisions (see Table 1 and lines 225–240). The different aspects of the Best practices are now also better taken into account, and most comprehensively these are opened up in ‘Methods’ which from our point of view was the optimal section to deepen these aspects. However, some clarifications were also made to the ‘Results’ and ‘Discussion’. Submitted filename: Response to Reviewers.docx Click here for additional data file. 26 May 2022 Best practices for spatial language data harmonization, sharing and map creation – a case study of Uralic PONE-D-21-40182R1 Dear Dr. Rantanen, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Søren Wichmann, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Comments have been adequately addressed............................................................. Reviewer #2: Many thanks for the revised version. All comments have been addressed and all issues are solved. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Harald Hammarström Reviewer #2: Yes: Oliver Nakoinz 30 May 2022 PONE-D-21-40182R1 Best practices for spatial language data harmonization, sharing and map creation – a case study of Uralic Dear Dr. Rantanen: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Søren Wichmann Academic Editor PLOS ONE

4 in total

1. A comparison of worldwide phonemic and genetic variation in human populations.

Authors: Nicole Creanza; Merritt Ruhlen; Trevor J Pemberton; Noah A Rosenberg; Marcus W Feldman; Sohini Ramachandran
Journal: Proc Natl Acad Sci U S A Date: 2015-01-20 Impact factor: 11.205

2. D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity.

Authors: Kathryn R Kirby; Russell D Gray; Simon J Greenhill; Fiona M Jordan; Stephanie Gomes-Ng; Hans-Jörg Bibiko; Damián E Blasi; Carlos A Botero; Claire Bowern; Carol R Ember; Dan Leehr; Bobbi S Low; Joe McCarter; William Divale; Michael C Gavin
Journal: PLoS One Date: 2016-07-08 Impact factor: 3.240

3. Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics.

Authors: Robert Forkel; Johann-Mattis List; Simon J Greenhill; Christoph Rzymski; Sebastian Bank; Michael Cysouw; Harald Hammarström; Martin Haspelmath; Gereon A Kaiping; Russell D Gray
Journal: Sci Data Date: 2018-10-16 Impact factor: 6.444

4. Triangulation supports agricultural spread of the Transeurasian languages.

Authors: Martine Robbeets; Remco Bouckaert; Matthew Conte; Alexander Savelyev; Tao Li; Deog-Im An; Ken-Ichi Shinoda; Yinqiu Cui; Takamune Kawashima; Geonyoung Kim; Junzo Uchiyama; Joanna Dolińska; Sofia Oskolskaya; Ken-Yōjiro Yamano; Noriko Seguchi; Hirotaka Tomita; Hiroto Takamiya; Hideaki Kanzawa-Kiriyama; Hiroki Oota; Hajime Ishida; Ryosuke Kimura; Takehiro Sato; Jae-Hyun Kim; Bingcong Deng; Rasmus Bjørn; Seongha Rhee; Kyou-Dong Ahn; Ilya Gruntov; Olga Mazo; John R Bentley; Ricardo Fernandes; Patrick Roberts; Ilona R Bausch; Linda Gilaizeau; Minoru Yoneda; Mitsugu Kugai; Raffaela A Bianco; Fan Zhang; Marie Himmel; Mark J Hudson; Chao Ning
Journal: Nature Date: 2021-11-10 Impact factor: 49.962

4 in total