Literature DB >> 33296372

Using lexical language models to detect borrowings in monolingual wordlists.

John E Miller1, Tiago Tresoldi2, Roberto Zariquiey3, César A Beltrán Castañón1, Natalia Morozova2, Johann-Mattis List2.   

Abstract

Lexical borrowing, the transfer of words from one language to another, is one of the most frequent processes in language evolution. In order to detect borrowings, linguists make use of various strategies, combining evidence from various sources. Despite the increasing popularity of computational approaches in comparative linguistics, automated approaches to lexical borrowing detection are still in their infancy, disregarding many aspects of the evidence that is routinely considered by human experts. One example for this kind of evidence are phonological and phonotactic clues that are especially useful for the detection of recent borrowings that have not yet been adapted to the structure of their recipient languages. In this study, we test how these clues can be exploited in automated frameworks for borrowing detection. By modeling phonology and phonotactics with the support of Support Vector Machines, Markov models, and recurrent neural networks, we propose a framework for the supervised detection of borrowings in mono-lingual wordlists. Based on a substantially revised dataset in which lexical borrowings have been thoroughly annotated for 41 different languages from different families, featuring a large typological diversity, we use these models to conduct a series of experiments to investigate their performance in mono-lingual borrowing detection. While the general results appear largely unsatisfying at a first glance, further tests show that the performance of our models improves with increasing amounts of attested borrowings and in those cases where most borrowings were introduced by one donor language alone. Our results show that phonological and phonotactic clues derived from monolingual language data alone are often not sufficient to detect borrowings when using them in isolation. Based on our detailed findings, however, we express hope that they could prove to be useful in integrated approaches that take multi-lingual information into account.

Entities:  

Year:  2020        PMID: 33296372      PMCID: PMC7725347          DOI: 10.1371/journal.pone.0242709

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Problem and motivation

Lexical borrowing (i.e., the direct transfer of words from one language to another) is one of the most frequent processes of language evolution [1]. We can easily observe the process in real time, especially regarding vocabulary from religion or technology, since words are often transferred along with cultural practices or innovations. While it took scientists a long time to find out that languages constantly change [2], it was already clear in ancient times that languages acquire lexical material from their neighbors [3], as evidenced in Plato’s Kratylos dialog (409d-10a) [4] where Socrates discusses the problem that lexical borrowings impose on studies in etymology. Nonetheless, detecting borrowings is still one of the outstanding problems in historical linguistics, specifically when it comes to computational approaches [5]. Discrimination between inherited and borrowed words (or loanwords) is crucial for the successful application of both the comparative method in historical linguistics [2], which seeks to identify genetically related languages and reconstruct their ancestral stages which are not recorded in written sources, and in phylogenetic reconstruction, which seeks to identify the most plausible phylogenies (often represented by a family tree) by which languages in a given family evolved into their current shape [6]. Lexical borrowing is a very peculiar process that cannot be directly compared with other processes of language change. While sound change, for example, proceeds in a surprisingly regular manner that tends to impact all words in the lexicon of a particular spoken language, in which the sound occurs in a given phonetic environment, it is obvious that lexical borrowing crucially depends on the initial contact situation. This means also that language contact will impact differently on particular languages, depending on their geographic location or the interaction with speakers from other cultures of their original speakers. For this reason, it has also proven very difficult to come up with general statistics on lexical borrowing, and although scholars seem to agree in general that some words are less likely to be borrowed, depending on the meanings they express [7, 8], it is extremely difficult to derive generalizations, given that individual language histories provide many surprises [9]. When linguists try to detect borrowings, they use an arsenal of different techniques which essentially aim at detecting conflicts in the data for individual words [10]. English mountain for example, which was borrowed from Old French, exhibits a phylogenetic conflict, given its similarity to the words denoting ‘mountain’ in the Romance languages (cf. Italian monte, Spanish montaña) as opposed to its lack of similarity with words in Germanic languages (cf. German Berg, Dutch berg). Since English is a Germanic language, the word mountain shows a phylogenetic conflict, given its similarity with words from Romance languages. As another kind of conflict, German Damm ‘dam’ is obviously similar to its translational equivalent dam in English, but the similarity is not expected, since English words starting in d tend to have related words starting in t in German, and we observe a conflict with established sound correspondences. While computational approaches to borrowing detection have tried to detect at least some of these conflicts systematically [10], there is another type of conflict that has not yet been explored so far. This conflict is reflected in the fact that at least in some languages and under specific circumstances borrowed words may still retain certain foreign properties when they have just entered a given language. These characteristics include both specific phonological properties (‘foreign sounds’) as well as specific phonotactic properties (‘foreign distributions of sounds’), and they tend to disappear with time, due to the process of loanword nativization [11, p. 200]. Although masked with time, language-internal evidence for borrowing can be observed in many languages from different families. In many Hmong-Mien languages, for example, some Chinese words are borrowed with a very specific tone that only occurs in Chinese words [12]. Similarly, it is easy for German speakers to identify job as a loan from English, since only in borrowed words the grapheme j is pronounced as [dZ] in German. In the same line, but in a radically different context, speakers of Iskonawa, an obsolencent Panoan language spoken in Central Peruvian Amazonia can easily identify loanwords from Shipibo-Konibo, the dominant language in the area, due to straightforward phonological features. For instance, Iskonawa has dropped word-initial [h], thus forms like [hana] ‘tongue’ or [huni] ‘man’ are easily detected as loanwords from Shipibo-Konibo. Apart from specific sounds and tones, language-internal evidence for borrowing may include peculiar constructions, specific phonotactic elements (such as certain consonant clusters or vowel combinations), unusual stress patterns [13, 14], or even specific semantics or morphology. For instance, in Spanish words like análisis ‘analysis’ or curriculum ‘curriculum’ are easily identified as loans due to their irregular plural forms, análisis (invariant) and curricula (with a final a), respectively. However, already upon entering the language, speakers adapt borrowed words to the phonological conditions of their language, and the more time has passed since a word was first borrowed, the harder it is to detect it from its external characteristics alone [15]. Although we expect some limitations of language-internal evidence for lexical borrowing it seems worthwhile to test to what degree it could be employed for automated borrowing detection approaches in computational comparative linguistics. Assuming that the strongest language-internal evidence for lexical borrowings can be found in the phonology and phonotactics of borrowed words, all that we need to do in a computational approach to mono-lingual borrowing detection is to derive computational models of phonology and phonotactics from annotated wordlists of a given language and then calculate to which degree a word resembles a typically inherited or a typically borrowed word. To model the phonology and the phonotactics of a language, we make use of different lexical language models. Assuming that a language model refers to “any system trained only on the task of string prediction, whether it operates over characters, words or sentences, and sequentially or not” [16], our lexical language models are specific cases of language models derived from lexical data typically provided in the form of a wordlist, with words being represented by phonetic transcriptions. Having trained lexical language models for inherited and borrowed words with the help of a given annotated wordlist representing a given language variety in a supervised learning setting, we can then try to measure to which degree words that were not used to train a given model can be classified as either being inherited or borrowed. In this study, we test how well three different lexical language models—one non-sequential model based on a support vector machine, and two sequential models, one based on Markov chains and one based on recurrent neural networks—perform in detecting borrowed words. We apply our models to the World Loanword Database [17], a large, cross-linguistic sample of wordlists in which borrowed words are annotated, which we considerably improved by adding harmonized phonetic transcriptions instead of the original orthographic representations of word forms. We perform a series of experiments with real and some with augmented artificial data. While we find borrowing detection results are unsatisfying for many language varieties, they become more promising (a) with increased amounts of borrowings, and (b) when a higher rate of borrowings goes back to a single donor language. Overall, the recurrent neural network method performs better, although the differences in comparison with the other methods are small.

State of the art

Although the detection of borrowed words is one of the major tasks in historical language comparison, the classical, non-computational techniques which linguists use to identify borrowings have never been properly formalized or explicitly described [10]. As mentioned before, classical linguists make extensive use of proxies to assess whether or not a given word has been borrowed. While most of the evidence linguists employ to detect borrowed words is based on the comparison of several languages, conflicts in phonology and phonotactics are also routinely used for borrowing detection, specifically when dealing with recent borrowing events. Similar to the prevalence of multilingual approaches to borrowing detection in classical historical linguistics, most recent attempts to detect borrowings automatically have also been based on comparative rather than monolingual evidence. Various authors have tried to detect borrowings by searching for phylogenetic conflicts [18-24]. Other approaches identify similar words in unrelated languages [25-27]. Occasionally, authors have tried to detect borrowings by relying on the idea that some words can be more easily borrowed, because of the meanings they express [28]. While the detection of words borrowed between unrelated languages seems to work relatively well [27], all other approaches that have been proposed in the past, have never been rigorously tested. In contrast to multilingual approaches to borrowing detection, monolingual approaches in which borrowings are identified by relying on the (annotated) data of one language alone, have been rarely applied so far, and the rare exceptions we know of, involve very particular settings for individual languages, as opposed to generic approaches that could be generally applied [29, 30]. Although—to our knowledge—language models have not yet been used to identify borrowings in exclusively monolingual wordlists, the idea of using lexical language models for specific tasks in comparative linguistics is not new. Language identification, for example, which seeks to identify the natural language in which a given document is written [31], shows certain similarities with the task of monolingual borrowing detection. Distinguishing foreign words within a paragraph or sentence is similar to the problem of detecting recently borrowed words in a wordlist.

Materials and methods

Materials

We use the multilingual wordlist collection provided by the World Loanword Database (WOLD) [17], which we modified by adding harmonized phonetic transcriptions. Each of the 41 wordlists in this collection provides translation equivalents for 1,460 distinct concepts (see the Concepticon resource for details on this concept list [32]). Since translations may lack or one concept may have been represented by more than one word form, the resulting wordlists comprise between 956 and 2,558 word forms. While word forms were provided in orthographic form or phonological transcriptions in the original data, we added phonetic transcriptions which follow the unified Broad IPA transcription system proposed by the Cross-Linguistic Transcription Systems reference catalog [33, 34] with the help of orthography profiles [35] manually compiled by reading the relevant literature for each language. Orthography profiles can be best thought of as a specific look-up table, which allows to convert transcriptions from one orthography into another one (compare the presentation in Wu et al. [36] for details); while such assisted transcription can introduce noise in the data, no comparable lexical database with transcriptions and loanword annotation was available. Each word form is given a so-called borrowed score, indicating the rating of a linguistic expert that the item was borrowed on a five-point scale. To make sure that we only consider clear-cut borrowings in our tests, we treated as borrowed only the words which were labeled as clearly borrowed. The derived database with phonetic transcriptions for all 41 wordlists was curated with the help of the CLDFBench toolkit [37], which allows for a convenient, test-based data curation workflow in which the resulting dataset is offered in the formats recommended by the Cross-Linguistic Data Formats initiative (CLDF, https://cldf.clld.org [38]). These format specifications have proven very useful in the past, as they allow not only for a quick aggregation of data from different sources [39], but also for their convenient integration in computational workflows [36]. An illustrative subset of the data, as stored in memory for training and evaluation, is provided in Table 1.
Table 1

Illustrative subset with the most salient information in the data.

IDLanguageConceptValueSegmentsBorrowed
1SwahiliWORLDduniaɗ u n i aTrue
45TarifiytBerberVALLEYtiziθ i z iFalse
481EnglishCALMcalmk ɑ: mTrue
992MapudungunFOAMtronüntȿ o n ɨ nFalse
For testing purposes, we created an additional German wordlist, taken from an etymological dictionary of German [40], with phonetic transcriptions added with modifications from the CELEX database [41]. While the enhanced WOLD database has been curated on GitHub (https://github.com/lexibank/wold) and archived with Zenodo [42], the German wordlist, available as a stand-alone tabular file in the package we wrote for monolingual borrowing detection, represents an older version of a refined wordlist that, combining additional sources [43], has been published separately [44].

Lexical language models

For the purpose of testing how well borrowed words in a wordlist can be detected through language-internal information alone, we employ three different lexical models which reflect unique characteristics of phonological and phonotactic clues which can be used to identify borrowings. The Bag of Sounds method represents words internally as a set of the sounds of which they consist, the Markov Model represents words by their sound n-grams, and the Neural Network represents words in the form of sequences of learned vector representations of sounds. We perform borrowing detection on each wordlist individually, modeling word expectedness with Bag of Sounds, Markov Model [45], and Neural Network [46] methods. The Bag of Sounds is a baseline method, which uses a support vector machine to directly detect borrowings based only on the set of sounds. The Markov Model and Neural Network produce sequential sound segment probability estimates, which we transform into word entropies and use to predict borrowed words. The Markov Model serves as the standard approach and the Neural Network as an improved alternative to borrowing detection with entropy methods. The Markov Model and Neural Network methods focus on phonotactics, while the Bag of Sounds method focuses on phonology.

Bag of Sounds

Since the word forms in our data are available as harmonized phonetic transcriptions, it is straightforward to represent each word form in a given language as a vector indicating the presence and absence of distinct sound segments. Since the order of these sound segments is not important, and neither is their frequency considered, this vector can be thought of as a simple bag of sounds, in which the sounds making up a given word form are represented as a set. The task of distinguishing borrowed from inherited words can then be pursued with the help of a support vector machine with a linear kernel [47, 48]. The support vector machine identifies the plane which optimally separates inherited from borrowed words based on the set of sound segments. The Bag of Sounds method does not consider the order or the frequency of elements in a given sound sequence, and we did not expect it to perform extraordinarily well in all languages in our sample. The advantage of the model is that it is simple and fast in application. It also provides a baseline for those cases where peculiar sounds provide enough information to identify a given borrowed word.

Markov Model

An n − 1 order Markov Model, emits a sound segment with probability dependent on the n − 1 previous sound segments (an n-gram model). The product of sound segment probabilities estimated by the Markov Model are transformed into per sound segment word entropies which are then used in borrowing detection. We use a second order Markov Model, a 3-gram model, from the Natural Language Toolkit (NLTK) [49]. In the second order model, the emission probability, , is conditioned on the previous 2 sound segments. The second order Markov Model is local with longer range effects resulting from the second order probabilistic process. We can approximate the probability of a sequence of n sound segments that make up a word, , by the product of the n second order conditional probabilities: We transform word probabilities to a per sound segment word entropy, which typically exhibits a smooth distribution with moderate right skew for wordlists. The second order model with a sound segment vocabulary size V requires V3 probability parameters for sound segment emission probabilities conditioned on the previous two sound segments. With wordlists of just 1,000 to 2,500 word forms and a typical sound segment vocabulary size of V ≈ 50, estimating 503 = 125, 000 parameters by maximum likelihood would cause sparse parameter estimation with problems of both undefined conditional probabilities and overfitting. We use interpolated Kneser-Ney smoothing to accommodate unseen tri-grams, reduce overfitting, and reduce the number of estimated parameters to less than the V3 required under maximum-likelihood.

Recurrent Neural Network

Recurrent Neural Networks provide word length order conditioning via the recurrent layer with memory. Word probabilities are expected to be better estimated, i.e., better approximating human performance, than for the Markov Model, as we can infer from early work of language modeling by [46] and more recent work with transformer language models [50]. Conditional sound segment emission probabilities are dependent on and estimated from all earlier sound segments of the current word: We can approximate the probability of a sequence of n sound segments that make up a word, , by the product of the n corresponding conditional probabilities: Word probabilities are again transformed to a per sound segment word entropy. The challenge and advantage of the recurrent Neural Network method is in the estimation of the conditional sound segment probabilities, with the function f(c, …, c1), using a more complex architecture but with fewer parameters (Fig 1) than the second order Markov model. Sparse indicator vectors, c, representing sound segments are transformed into dense real input vectors, x. In the recurrent layer, input vectors, x, and prior hidden state vectors, h, are linearly transformed and passed through a tanh activation function to produce current hidden state, h, and output, o, vectors. Resulting output vectors are linearly transformed in a dense output layer of logits, y, representing possible output segments. The softmax activation function transforms logit values y into sound segment probability estimates,
Fig 1

Recurrent Neural Network—Lexical model.

(A) Configuration parameters. (B) Model architecture.

Recurrent Neural Network—Lexical model.

(A) Configuration parameters. (B) Model architecture. While the recurrent Neural Network model requires a high baseline number of parameters given its embedding length and recurrent layer length, the growth in number of parameters is just linear with the vocabulary size. As a result, the number of parameters in the Neural Network is on the order of 10,000, and this does not change much with the vocabulary size. Furthermore, the number of parameters does not increase with word length in sound segments even though the conditioning is on all previous sound segments. We implement our recurrent Neural Network in Tensor-Flow 2.2 [51] and parameterize the model to permit ready changes in architecture, regulation, and fitting parameters during experimentation. The configuration used in this study is shown in Fig 1. Neural network models, even with just thousands of parameters, may suffer from substantial variance between training and test due to overfitting, especially when the amount of training data is comparatively small as in this case. We apply methods of dropout and l2 regulation to reduce overfitting.

Decision procedures

Models are trained on labeled data and then used to predict whether unlabeled test words are inherited or borrowed. The Bag of Sounds method directly decides which test words are borrowed. Both Markov Model and Neural Network methods estimate test word entropies from dual models trained on inherited and borrowed words separately. We assume that for a model trained on inherited words, the entropy estimates for unobserved inherited words will be less than for borrowed words. Similarly, for a model trained on borrowed words, entropy estimates for unobserved borrowed words will be less than for inherited words. Words are designated as inherited or borrowed depending on which of the models has the lesser entropy. The choice of the model with the lesser entropy can be expressed as the difference of entropies compared to a critical value, in this case zero:

Assessing detection performance

We assess detection performance using precision, recall, and harmonic mean (F1 score), as well as accuracy measures based on frequency counts of borrowing detection by true borrowing status as defined in Table 2. Following [52], precision is the proportion of true positive borrowings out of all detected positives, recall is the proportion of true positive borrowings out of all borrowings, F1 score is the harmonic mean of precision and recall, and accuracy is the proportion of all detections that are correct, We consider F1, since it combines both precision and recall, as the primary measure. Accuracy does not specifically focus on borrowing detection and is of secondary importance.
Table 2

Frequency counts of borrowing detection by true borrowing status.

Borrowing DetectionTrue borrowing status
BorrowedInherited
Positivetp = true positivefp = false positive
Negativefn = false negativetn = true negative

Implementation

Methods for borrowing detection and evaluation have been implemented in the form of the Python PyBor package and have been published along with supplemental information accompanying this study [53]. The Python package contains the code, access to data, and examples that replicate all studies here presented and illustrate how to perform new analyses.

Experiments with results

We run several experiments as follows. First, we simulate detection of recent borrowings from artificially seeding wordlists with various proportions of words from a foreign language. Second, we test borrowed word detection more realistically by using wordlists without alteration. Third, we perform correlational and regression analyses to diagnose performance as a function of proportion of borrowed words and phonological variables. Fourth, we stratify wordlists by number of borrowed words and presence of a dominant donor language and analyze borrowed word detection by strata. Last, we examine entropy distributions for a few exemplary wordlists to see how the entropy method works.

Detection of artificially seeded borrowings

To simulate a situation in which foreign words have recently entered a language without being modified by borrowed word nativization processes, we designed an experiment in which the wordlists in our base datasets were artificially mixed with words from another wordlist which was not part of the original WOLD collection. The idea to use “artificially seeded” borrowings instead of borrowings attested in actual language was originally proposed for evaluating methods for lateral gene transfer detection in biology [54], and later tested on linguistic data in order to assess the power of phylogenetic methods for borrowing detection across multiple languages [23]. The advantage of this procedure is that it creates simulated data without requiring the efforts of detailed simulation experiments. Artificial borrowings were seeded into a wordlist in three steps. We first removed all borrowed words from the wordlist to guarantee that no recent borrowings from other languages could influence the results. We then added inherited words from the additional German list, which we created for testing purposes. Here, we tested three different proportions of borrowed words, 5%, 10%, and 20%, in order to allow to compare different degrees of contact. In a final step, we then split the resulting wordlist into a training and a test set (reserving 80% of the data for training and 20% for testing) and ran the three methods for monolingual borrowing detection, Bag of Sounds, Markov Model, and Neural Network. The results of this experiment are given in Table 3, where the borrowing detection results are provided in form of precision, recall, and F1 scores for the three different borrowing rates. Fig 2 presents plots for 5% and 10% borrowing rates. Accuracy results, not shown, were all above 0.95 and varied little over methods and rates. Individual results indicating the scores achieved by method and borrowing rate for each language are provided as in S1 Table.
Table 3

Borrowing detection results for artificially seeded borrowings.

MethodRate%Prec.RecallF1
Bag of Sounds50.801.000.88
Markov Model50.960.670.76
Neural Network50.970.840.90
Bag of Sounds100.870.990.92
Markov Model100.960.870.91
Neural Network100.970.930.95
Bag of Sounds200.910.990.94
Markov Model200.970.940.95
Neural Network200.990.970.98

Results averaged over all languages for each method and borrowing rate.

Fig 2

Borrowing detection results for artificially seeded borrowings.

(A) 5% borrowing rate. (B) 10% borrowing rate.

Borrowing detection results for artificially seeded borrowings.

(A) 5% borrowing rate. (B) 10% borrowing rate. Results averaged over all languages for each method and borrowing rate. As can be seen from the results, all methods perform well when artificially seeded borrowings amount to 20%. With a borrowing rate of 10%, all methods still achieve F1 scores of more than 0.90, with the Bag of Sounds showing the lowest precision and the Markov Model showing the lowest recall. When borrowings only amount to 5%, we can observe the same trend of low precision for the Bag of Sounds and low recall for the Markov Model. However, while the Bag of Sounds still comes close to the performance of the Neural Network with respect to the F1 score (0.88 vs. 0.90), the Markov Model shows a drastic drop here, resulting from the dramatic loss in recall (0.67).

Borrowing detection on real language data

Our experiment on artificially seeded borrowings simulated an ideal situation of language contact in which new words were recently introduced into a given language without being adjusted to the recipient language’s target phonology. While this experiment provided high scores in our evaluation experiment, the experiment does not allow us to estimate how well the three borrowing detection methods will perform when being exposed to “real” data. For this reason, we designed a second experiment on the WOLD data in their original form. Given that the wordlists are quite small, while specifically Markov Model and Neural Network language models tend to require larger amounts of data, we used cross validation techniques, in which the data are repeatedly partitioned into training and test data and evaluation results are measured for each trial and later summarized. We employed ten-fold cross validation for this experiment, where each word list was partitioned into 10 parts, and over 10 successive trials, one part was successively designated the test set while the remaining nine parts were designated the training set. This resulted in 10 separate estimates of borrowing detection performance, with each word appearing once in test sets and nine times in training sets. Table 4 shows the averages and standard deviations of results (precision, recall, F1 score, accuracy) of this experiment for each of our three methods. Fig 3 summarizes the averaged results. Individual results indicating the scores achieved by method for each language are provided as in S2 Table.
Table 4

Borrowing detection results of the cross validation experiment.

MethodStatisticPrec.RecallF1Acc.
Bag of SoundsMean0.2860.5780.3490.843
Language SD0.2500.2870.2680.081
Pooled SD0.0780.2260.0880.030
Markov ModelMean0.6780.5210.5780.828
Language SD0.1360.1810.1700.060
Pooled SD0.1140.0880.0820.034
Neural NetworkMean0.6970.5460.6030.844
Language SD0.1640.1910.1810.062
Pooled SD0.1000.0820.0720.030

Mean and standard deviation over languages, and pooled standard deviation within languages for each method over all languages.

Fig 3

Results of the cross validation experiment.

Averaged for each model over all languages in our sample.

Results of the cross validation experiment.

Averaged for each model over all languages in our sample. Mean and standard deviation over languages, and pooled standard deviation within languages for each method over all languages. As can be seen from the table and the figure, the Neural Network marginally outperforms the Markov Model, while both the Neural Network and the Markov Model clearly outperform the Bag of Sounds. The strength of the entropy-based methods lies in their high precision, while the Bag of Sounds shows the highest recall, but an extremely low precision. When examining the individual results achieved by each method for each individual language in our sample, one can find a rather huge variation in the results, ranging from results which one may consider as satisfying (such as the performance of the Neural Network on Zinacantán Tzotzil with an F1 score of 0.81) up to extremely bad results (such as the performance of all methods on Mandarin Chinese, with F1 scores below 0.02). The reasons for the underwhelming results on Mandarin Chinese are twofold. On the one hand, the language barely borrows words directly, but rather resorts to loan translation, by which new concepts are rendered with the help of the lexical material in the target language. As a result, Mandarin has the lowest amount of direct borrowings in our sample. On the other hand, Mandarin Chinese (as well as all Chinese dialects and many languages from Southeast Asia) has an extremely restricted syllable structure that makes it impossible to render most foreign words truthfully [55]. As a result, words are usually directly adjusted to Chinese phonotactics when being borrowed and also written with existing Chinese characters, which again further masks their foreign origin [56]. However, this very specific situation also makes it also difficult if not impossible for most Mandarin Chinese speakers to identify borrowings when considering phonotactic criteria alone.

Factors that influence borrowing detection

Given that the performance of our supervised borrowing detection methods varied substantially, ranging from poor performance with F1 scores below 0.5, average performance with F1 scores between 0.5 and 0.8, and acceptable performance with F1 scores above 0.8, we performed analyses to assess to which degree certain factors might influence the borrowing detection methods. In concrete, we computed specific characteristics of each language variety in our sample and then checked to which degree these characteristics correlated with the test performance. As characteristics, we chose the proportion of borrowed words in a given language wordlist, since statistical and machine learning methods perform better with sufficient representation, and the proportions of unique sounds in borrowed words and in inherited words, as potential contributors to prediction performance. A higher proportion of borrowed words corresponded moderately to a lower proportion of unique sounds in inherited words, otherwise characteristics were independent. Statistical analysis, correlational study, matrix plots, and regression, were performed with Minitab® Statistical Software [57]. The correlation results, based on all wordlists in our sample taken from the WOLD database, are reported in Table 5, and accompanied by detailed plots in Figs 4, 5 and 6. We focus on the strength of the relationship between characteristics and borrowing detection rather than statistical significance.
Table 5

Correlations between phonological characteristics and performance of borrowing detection methods.

Bag of SoundsMarkov ModelNeural Network
Proportion ofPrec.RecallF1Prec.RecallF1Prec.RecallF1
Borrowed words0.5840.3370.5390.3870.7360.6540.3990.6900.600
Borrowed sounds0.1850.3450.1990.3450.2740.2970.3770.2680.301
Inherited sounds-0.006-0.010-0.0040.035-0.330-0.263-0.075-0.178-0.148

All correlations with |r| ≥ 0.33 are significant at p < 0.05.

Fig 4

Determining characteristics that influence the performance of the Bag of Sounds.

Fig 5

Determining characteristics that influence the performance of the Markov Model.

Fig 6

Determining characteristics that influence the performance of the neural network.

All correlations with |r| ≥ 0.33 are significant at p < 0.05. As can be seen from the correlations and the plots, there is a moderately strong to strong positive correlation between the proportion of borrowed words and the evaluation scores for all tests. The effect of proportion of borrowed words appears non-linear for the entropy methods, where less than 5% borrowings has much worse borrowing detection than expected in the linear correlation plot from Figs 5 and 6. For the other factors, the proportion of sounds occurring exclusively in borrowed words, and the proportion of sounds occurring exclusively in inherited words, the results are less clear. While we observe a moderate correlation between the proportion of exclusively borrowed sounds with the recall for the Bag of Sounds, there is an equal or higher correlation with the precision for the other two methods. In order to further investigate the influence of the three factors on the borrowing detection performance, we further analyzed them by fitting a multiple regression model to them. Our major goal was to check whether exclusively borrowed and exclusively inherited sound proportions can help us explain the methods’ performance beyond the overall proportion of borrowed words in each wordlist. We fit a second order regression model to predict F1 scores from our three characteristics using Minitab’s forward information criteria for model selection. Regression results are reported in Table 6. Almost 50% of variability in performance is explained for the Markov Model and Neural Network. In both cases proportion of borrowed words and proportion of exclusively borrowed sounds strongly contribute to F1 performance. For the Neural Network, exclusively inherited sounds also has a minor impact on F1 performance. The negative coefficients for the squared proportion of borrowed words and exclusively borrowed sounds terms, serve to flatten out F1 performance at 0.8 and higher. Almost 30% of variability in performance is explained for the Bag of Sounds, most of this due to the proportion of borrowed words with a minor impact due to exclusively inherited sounds.
Table 6

Regression analysis on phonological characteristics that influence borrowing detection F1 scores.

MethodRegression modelpred-R2
Bag of SoundsF1 = − 0.040 + 1.53bw + 0.76is29.9%
Markov ModelF1 = 0.141 + 2.66bw + 2.05bs − 3.38bw2 − 5.05bs248.8%
Neural NetworkF1 = 0.032 + 3.12bw + 2.43bs + 0.43is − 3.93bw2 − 6.35bs249.9%

Detecting borrowings from a single donor language

Testing our lexical language models on the WOLD data in their entirety could be considered as unfair to the methods, given that we know well that monolingual evidence for borrowing in phonotactics may get lost easily and that the WOLD database was never restricted to recent borrowings alone. Another problem of the data is that the distinction between inherited words on the one hand and borrowings on the other hand is as well a simplifying assumption, since we know that in intensive contact situations borrowings come from a specific donor language. As a result, it seems to be justified to test the three methods for monolingual borrowing detection with the help of more specific experiments in which the task consists in the detection of borrowings when there is a single or dominant language donor, as in intensive contact situations, versus the case when no language donor dominates. To test whether our methods show an improved performance when there is a dominant language donor as opposed to detecting borrowed words per se, we first created two subsets of the WOLD database, one containing languages with 300 and more borrowed words (17 language varieties), and one containing languages with 100 and more borrowed words (37 language varieties). We then searched for “dominant donor languages” in all wordlists in each sample, with dominant donor languages being defined as those donor languages (as identified in the WOLD database) that would account for two-thirds of all borrowings identified for a given language variety. For our sample of language varieties with 300 and more borrowings, this yielded a partition of the data into 8 language varieties for which a dominant donor could be identified and 9 for which none could be found. For the sample of language varieties with 100 and more borrowings, the partition yielded 20 language varieties with a dominant donor and 17 without. We were able to apply results of the 10-fold cross validation study for these two subsets of the data, which we had previously applied to all language varieties in the WOLD database. In order to test whether the observed differences between dominant donor and no dominant donor categories were significantly different, we also performed randomization resampling tests of 5,000 iterations each, using Student’s independent t statistic with unequal variances as our test statistic. We report p-values from the empirical distribution of t statistics calculated under the hypothesis of no difference due to dominant donor, i.e., dominant and no dominant categories are exchangeable. As can be seen from the results in Table 7, the performance of all borrowing detection methods improves when the vast majority of the borrowings come from a single donor language. The performance also improves, as we saw previously, with more borrowed words. While performing worse than the other two methods, the Bag of Sounds method shows a strong increase in performance, which is mostly owed to a strong increase in precision, when most borrowings come from a single donor language.
Table 7

10-fold cross validation—Dominant versus no dominant donor.

BorrowedMethodDonorPrecisionp<Recallp<F1p<
≧ 300Bag of SoundsDominant (8)0.536.03000.739.02000.588.0400
No dominant (9)0.3080.6720.390
Markov ModelDominant0.785.00300.722.00200.749.0030
No dominant0.6720.5850.622
Neural NetworkDominant0.810.00020.722.00700.760.0030
No dominant0.6900.6060.642
≧ 100Bag of SoundsDominant (20)0.418.00300.737.00200.490.0010
No dominant (17)0.1920.4980.252
Markov ModelDominant0.762.00020.600.03000.661.0060
No dominant0.6390.5050.558
Neural NetworkDominant0.787.00020.619.02000.685.0060
No dominant0.6550.5230.567

Comparing entropy distributions

The Markov Model and the Neural Network methods estimate word entropy on a per sound basis given the inherited or borrowed words on which they are trained. Models trained on inherited words should estimate lower entropies for inherited words, and models trained on borrowed words should estimate lower entropies for borrowed words. However, since words are borrowed over time and potentially also from various donor languages, using a single language model for borrowed words is not always optimal. Our decision procedure for the Markov Model and the Neural Network methods requires the comparison of competing entropies for a given word, the entropy of the lexical language model derived from inherited words and the entropy of the lexical language model derived from borrowed words. If the difference between the entropies is greater than zero, we designate the word as borrowed, and if it is smaller than or equal to zero, we designate the word as inherited. In order to investigate the discriminative force of this procedure, it is useful to compare entropy difference distributions of inherited and borrowed words for a given language variety. The distributions for training and test data from the English wordlist in the WOLD database are shown in Fig 7. While there is a certain overlap between entropy difference distributions for inherited and borrowed words, the problem of discriminating between them based on entropy differences seems tractable, and we can assume that improvements in entropy estimation would have an immediate benefit on prediction.
Fig 7

Distribution of entropy differences for English—Neural Network method.

(A) Training (85%) entropy deltas. B) Testing (15%) entropy deltas.

Distribution of entropy differences for English—Neural Network method.

(A) Training (85%) entropy deltas. B) Testing (15%) entropy deltas. Since both the Markov Model and the Neural Network performed considerably well on Imbabura Quechua, a Quechua language spoken in Ecuador, with an F1 score above 0.8, it is not surprising that we find a good separation between the entropy difference distributions for inherited and borrowed words, as shown in Fig 8.
Fig 8

Distribution of entropy differences for Imbabura Quechua—Neural Network method.

(A) Training (85%) entropy deltas. (B) Testing (15%) entropy deltas.

Distribution of entropy differences for Imbabura Quechua—Neural Network method.

(A) Training (85%) entropy deltas. (B) Testing (15%) entropy deltas. Neither method performed very well on Oroqen, a Northern Tungusic language spoken in the Mongolian region of the People’s Republic of China, with F1 scores below 0.36. Consequently, as can be seen in Fig 9 the entropy difference distributions for inherited and borrowed words are not well separated.
Fig 9

Distribution of entropy differences for Oroqen—Neural Network method.

(A) Training (85%) entropy deltas. (B) Testing (15%) entropy deltas.

Distribution of entropy differences for Oroqen—Neural Network method.

(A) Training (85%) entropy deltas. (B) Testing (15%) entropy deltas. This strong relationship between the distribution of entropy differences and borrowing detection, indicates a tactic for improving monolingual lexical borrowing detection—increase the separation of difference distributions for inherited versus borrowed words. An examination of our sample cases reveals: 1. English and Imbabura Quechua, even though there were substantial borrowings, have reduced separation between inherited and borrowed word difference distributions for testing, resulting in reduced discriminative power, and 2. Oroqen, with few borrowings, has almost no separation between inherited and borrowed word distributions for testing, resulting in little discriminative power. Identification of problems permits trying to solve them, such as through improved controls for training of Neural Networks, and by obtaining more borrowings, real or simulated, for training.

Discussion

Artificially seeded borrowings

In our artificially seeded borrowings experiment, we simulated very close, intensive, and recent language contact, where borrowed words were transferred without alteration. All methods performed well when the proportion of artificially borrowed words was high, and degraded differently when borrowings decreased. While the Bag of Sounds outperformed the other two methods regarding recall, the Markov Model and, especially, the Neural Network outperformed the Bag of Sounds method in precision. Since the core strategy of the Bag of Sounds lexical language model is to identify borrowed words by their specific sounds, while the order of sounds itself is ignored, it is not surprising that the method performs better in identifying artificially seeded borrowings, i.e., better recall, since the direct transfer of words from one wordlist to another wordlist, as it was done in our experiment, will always introduce a larger number of sounds which were not present in the recipient wordlist prior to the transfer. In contrast, taking into account the order of sounds, gives the Markov Model and Neural Network a tremendous advantage at ruling out unseen borrowed word sound sequences resulting in uniformly high precision. In our real language borrowing detection experiment, we performed a 10-fold cross validation of borrowing detection across unaltered WOLD wordlists. Here, the Neural Network performed marginally better than the Markov Model, and both of them performed much better than the Bag of Sounds method. A major factor favoring the Neural Network seems to be that it includes conditional dependencies from all previous sound segments, without having to explicitly estimate numerous extra parameters for this dependency. Similar to the previous experiment, the Bag of Sounds method showed a high recall, but suffered from a low precision as well. So while the Bag of Sounds suspects considerably many words of being borrowings, it does not necessarily always pick the right ones and shows a rather high rate of false positives, as can be seen from the low rates of precision. In contrast, the Markov Model and the Neural Network methods show a lower recall, but also a much higher precision. They are therefore much more conservative than the Bag of Sounds method. When the overall proportion of borrowed words in wordlists is small, all models perform poorly. This is not surprising, since low borrowing proportions make it difficult to learn the phonotactics or phonology of borrowed words, if these can be identified at all. It is also not clear to which degree trained linguists would be able to identify borrowed words in the respective languages and even less so over entire wordlists instead of just recent borrowings, if they were given only monolingual information alone.

Factors influencing borrowing detection

Given the disappointing results with real language data, we tried to determine the major factors that influence the performance of borrowing detection methods. Besides proportion of borrowings, we thought that proportions of sounds occurring exclusively in borrowed words and exclusively in inherited words might impact borrowing detection, especially for the Bag of Sounds method, which explicitly deals with sounds, while ignoring phonotactic aspects. While the effect of the proportion of borrowed words was remarkable, showing a strong linear increase in performance for all methods when the proportion of borrowed words was 5% and more, the impact of the proportions of sounds occurring exclusively in borrowed words and sounds occurring exclusively in inherited words surprised us. Based on the regression model, increased frequency of sounds occurring exclusively in borrowed words resulted in improved performance of both Markov Model and Neural Network methods, but not for the Bag of Sounds method, while frequency of exclusively inherited sounds had little impact on any of the methods. It seems that modeling phonotactics with Markov Model and Neural Network methods also takes good advantage of the simple occurrence of borrowed sounds in words too. With respect to the Bag of Sounds method, what we may have overestimated was that—even if a given language has many sounds occurring exclusively in borrowed words—this does not mean that these sounds need to occur in each and every borrowed word. Thus, while the presence of specific sounds may be a powerful indicator of a borrowing or an inherited word, this evidence may be too sparse in comparison with the full lexicon of a given language. Since we create lexical language models for borrowed and inherited words, it is straightforward to question why our basic approach would treat all borrowed words as if they come from a single donor language. While it may hold for specific contact situations that a given language is heavily influenced by one single, dominant donor language, it is also possible that borrowings form distinct layers in the lexicon of a given language, reflecting borrowings from different donor languages and different times. If the majority of the borrowings attested in a given language stem from a single donor, however, we would assume that our lexical language model approaches to monolingual borrowing detection would perform better, since the donor language which we access through the recipient language would provide a much more coherent and consistent picture than would a mix of words from different donor languages. We therefore systematically tested whether the performance of our methods would increase for those wordlists in our sample for which a dominant donor language could be identified. Our assumption, that the methods should show an increased performance for languages with a dominant donor language were largely confirmed, as reflected in substantially increased F1 scores of ≈0.75 for the Markov Model and the Neural Network methods in cases of high contact with more than 300 borrowings. While we still consider the overall performance of the monolingual borrowing detection disappointing, this experiment reflects the importance of having a consistent sample of the donor language when dealing with monolingual borrowing detection. Our final evaluation was intended to demonstrate how the Markov Model and Neural Network methods discriminate between inherited and borrowed words. We showed how plots of the distribution of entropy differences between competing inherited and borrowed word models served to explain borrowing detection results. When comparing the distributions of entropy differences, we found that in those cases where the proportion of borrowings was small, the discriminative force of the word entropy differences seemed to drop drastically for testing. Even when the proportion of borrowings seemed adequate for training we saw a reduction in discriminative force for testing due to reduced separation of inherited and borrowed word entropy difference distributions. This provided additional evidence that monolingual borrowing detection heavily depends on the presence of a large enough proportion of borrowed words, and also that modest improvements might be possible with improved training controls.

Conclusion

We presented three supervised methods, Bag of Sounds, Markov Models, and Neural Networks, for the detection of borrowings in monolingual wordlists. These methods are based on lexical language models and are intended to model specific aspects of phonology and phonotactics in the lexicon of spoken languages. Assuming that phonological and phonotactic properties of words in the lexicon of a spoken language can provide enough clues to identify borrowings by language-internal comparison of words alone, we designed workflows in which the lexical language models could be trained with monolingual wordlists in which borrowings are annotated and then used to detect borrowings when being confronted with so far unobserved words. While tests on artificially seeded borrowings showed promising results, tests on real wordlists taken from the WOLD database revealed a rather disappointing performance for all three methods. Consecutive attempts to identify the potential reasons for this mediocre performance revealed two main factors that considerably influence how well the methods performed, namely (1) the amount of borrowings in a given language variety, and (2) the uniformity of the borrowings in a given language variety, as reflected in the presence of a dominant donor language. While the first factor reflects the importance of having enough training data when working in supervised learning frameworks, the second factor reflects more specific linguistic conditions of monolingual borrowing detection. Our methods identify borrowings primarily from phonological and phonotactic clues, and perform better in those cases where the words’ properties are coherent and consistent. This is generally the case for inherited words, and also for words that were borrowed uniformly from the same donor language. While our results do not recommend any of the three methods represented here as a replacement for previously proposed methods for borrowing detection, we believe that the methods we created offer a valuable and promising baseline for the further exploration of monolingual approaches to borrowing detection. We are even convinced that they may be useful in some concrete applications. Given that we know that our methods rely heavily on sufficiently large samples of training data, our methods may be useful especially for those studies in which borrowed words or sentences need to be identified in large amounts of data, preferably in situations where borrowings are considerably young. Here, especially, larger linguistic corpora could be analyzed and tagged for inherited and borrowed words. But even if future research shows that our attempts to model phonology and phonotactics with the help of language models in a supervised framework for monolingual borrowing detection cannot be improved any further, we consider it worthwhile to share our results along with the software and the data we used to create them, since this may—in the worst case—save those colleagues who might want to test the same idea some precious time. Additionally, we think that—given that by now no single method for borrowing detection has been proposed that exhibits satisfactory performance—our methods add to the growing pool of automated approaches to borrowing detection which could ideally be later combined into an integrated workflow in which evidence from multi-lingual sources can be combined to form a unified picture of language contact.

Detection results by language for seeded borrowings.

Borrowing rates of 5%, 10%, and 20%. (PDF) Click here for additional data file.

Ten-fold cross validation of detection results by language for WOLD wordlists.

Cross-validation means and standard deviations. (PDF) Click here for additional data file. 1 Oct 2020 PONE-D-20-27304 Using lexical language models to detect borrowings in monolingual wordlists PLOS ONE Dear Dr. Miller, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. All three reviewers have some good suggestions that you should take into account. For Reviewer 1 it was a stumbling block that the paper seems to try to model a native speaker's ability to identify loanwords. Given that it doesn't do a good job at that the verdict was a rejection. But it seems that the paper is really about the extent to which it is possible for a computer to identify loanwords given information about the target language only. If you clearly spell out that focus and downplay the importance of discussion about what native speakers can and cannot then you might avoid some confusion. Please submit your revised manuscript by Nov 15 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Søren Wichmann, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please note that in order to use the direct billing option the corresponding author must be affiliated with the chosen institute. Please either amend your manuscript or remove this option (via Edit Submission). 3. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 5, 7, 8, 9 and 10 in your text; if accepted, production will need this reference to link the reader to the Table. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The article is a rather mechanical application of several machine-learning algorithms, two of them severely outdated and none of them state-of-the-art, to what is essentially a non-task. The motivation for the experiments---that speakers of different languages (cited publications refer to Russian and Korean in particular) are good at identifying borrowings---is way too slim. Firstly, this only applies to very recent borrowings (Russian, just like English, is saturated with older borrowings, which are undetectable by native speakers). Secondly, this does not provide a cross-lingual baseline against which to compare the performance of ML algorithms. Thirdly, as the authors point out themselves, this is not how borrowings are detected in the historical-comparative literature (where the principle of irregular sound correspondences is the only one of real standing; of course, this principle is rather hard to automate because one has to establish the correspondences manually to begin with). It is hard to grasp what exactly the study is trying to show. It could have been construed as an attempt to model discriminative abilities of native speakers, in which case the focus should have been on how the neural net discriminates between native and borrowed words (cf. the abundant recent literature on what BERT might know about syntax, etc.). Instead all the models are treated as black boxes, and the analysis boils down to identifying situations in which they perform better or worse. This may have been of interest had the proposed method been of practical use. Some possible applications are listed in the conclusion ("studies in which borrowed words or sentences need to be identified in large amounts of data"; "[work] on code switching, where multilingual language users switch between different varieties based on sociolinguistic contexts"); however, the fact that the proposed methodology can help there itself needs to be tested against appropriate baselines and competing approaches. Reviewer #2: The authors introduced a new approach to automatic loanword detection in the field of computational historical linguistics (CHL). While most of the existing methods aim at identifying loanwords in multilingual wordlists, the attempt of this paper is to identify borrowings using a monolingual approach. The authors use the WOLD database, which is the only database containing loanword information along with information about the donor language and loaned status. What I really liked about the work is that the three methods build on another: the Bag of Sound model using an SVM is the simplest model, integrating only the phonology of the words without considering the order and frequency of the sounds; the Markov Model is a tri-gram model relying on the two previous sounds in the word; the recurrent neural network also relies on the phonotactics of the word, taking all previous sounds into account. The authors perform two experiments, one on artificially seeded borrowings and one on the “real” WOLD data. Since the promising results on the simulated data could not be obtained by the experiments using the WOLD data, the authors made additional experiments in order to explain the performance of the methods, which was achieved. Although, the results and the performance are not satisfying, the three introduced methods along with the lexical language models open a new perspective for further research, especially the recurrent neural network, serves as basis for improvements and further explorations in the field of automatic loanword detection. Two things I really liked about the manuscript are the detailed explanation of the methods and the representation of the data, which help the reader to get clear insights in the evaluation of the different methods. The statistical evaluation are carried out and explained in detail. The authors made the effort to perform additional analyses to explain the performance of the methods in the two experiments. The statistical evaluation are carried out in a rigorous way, giving some explanations according to the performance and the results. The process of nativization plays a crucial role in the motivation of the approach introduced in the first chapters, however it was not revive in more detail in the conclusion. The detection of loanwords using the proposed methods depends highly on the data and the adjustment status of the word in the recipient language. The methods might not identify older loanwords or loaned words from related languages, which show no clear differences in the phonology. This issue was not discussed in detail in the conclusion, but could be one of the reasons of the poor performance of the methods. In addition, the automatically derived IPA transcriptions of the words from the WOLD database could lead to noise in the analyses, depending on correctness of the transcription. However, since no other database is available containing loanword annotations along with the donor language and loanword status, compromises need to be made. The data is completely available online. Within the data, the additional German wordlist used for the artificially seeded borrowings is not identifiable at first sight. I would encourage the authors to provide the list in a format like csv and allocate it at first sight in the python package. Additionally, as a small notice, on p.6 the authors wrote that the German word list and the software package are available on GitHub. However, everything is uploaded on osf.io. For consistency reasons the authors could correct this. Spelling comment: I encourage the authors to check the formula for the softmax activation on p.9. To my understanding a comma is missing in the brackets of the formula. Reviewer #3: Summary: This manuscript introduces three methods to identify the borrowed words based on monolingual wordlists. The author evaluates the performances of these methods by setting up various scenarios. Despite the methods work well in case of artificially seeded data, performances of these methods are not all satisfied. Nevertheless, the author points out that the high proportion of borrowing and the existence of a dominant donor language in a language are beneficial to the task. Also, a more promising method for borrowing detection is recommended for future study. Strength: This paper applies a dataset including a large amount of languages so that we can know the characteristics of languages that are suitable for the examined methods. It is interesting to see that the phonological and phonetic features are applied to build up the lexical models. In terms of the Marko model and neural network model, it is also interesting to see your way to use entropy difference to classify borrowed. The author provides useful suggestion on future works and extract meaningful information despite the unsatisfied result of the models. Weakness: Generally, the structure of the paper is fine, but sometimes I was surprised to see some contents that don’t belong to a section. I also saw some duplicated and redundant information. Besides, it’d be better to clearly indicate the meaning of the notations in formula. Below is the specific comments referring to specific sections. --Abstract You mentioned all necessary information in the abstract. However, it is not always clear and I had to spend some time to look for the information I needed. For instance, what you did and the result of this study are not clearly specified. In my opinion using phrases like “in this study, we did this…” “the result shows that…” could be helpful to catch the necessary information easier and faster before reading it. --Introduction You mention a lot in the introduction on how good a native speaker can identify the borrowing in a his/her own language, and it seems that this is related to your motivation to do the study. But you didn’t address this later in the rest of the paper. Hence, I'm confused about the actual motivation or the problem you try to address in the paper. Also, it is a little wired that you include results and some discussion content at the end of the introduction, making this part like an abstract. --Materials It’d be easier to understand the WOLD dataset if you introduce more about the data format and show some examples. I had to search for the WOLD online to understand how the data looks like. Meanwhile, adding some phonetic transcription examples would be more interesting and easier to understand what you did. Line 134 citation [36], there is typo for the reference of this citation in page 32/39. --Markov Model Page 8/39 it might be better to explain the notation. I had to spend some time to guess what they meant. The same as in Page 9 Line 202. Any reference for this statement? Line 232 to 238 Maybe you wanna sum up the three methods, but it seems a little redundant as they have been introduced previously. Try to re-organise it and avoid duplicated content. --Result In general, you wrote the result for each experiment separately, which is clear and good. However, at the beginning of each section, you introduced the detail of each experiment, and the introduction of the experiments should not be part of the result section. The introduction of the experiments should be somewhere else as I only expect actual result like digits, tables, figures and relevant explanation of them in this section. Another option is to write each experiment totally separately, meaning that you write about the introduction, result, and discussion of an experiment together. The structure of the whole paper would be clearer and more logical. Line351 to 353, any reason you choose these three characteristics? Besides, have you considered that if they are independent from each other? Page 17/39, Table 4, are all these correlations significant? Maybe you can also include p-value. Page 18/39 Table 5, it is a nice table showing regression model R square value. But it’d be better to involve the information in the table when you discuss the influence of the three factors, instead of just putting it here. --Discussion You had to introduce each experiment again in this section, and it is redundant. As I suggest above in the result section, write about the experiment one by one, then you don’t have to constantly describe the experiments again and again. Line 502-507 maybe you can try the frequency of these unique sounds. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Liqin Zhang [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 7 Nov 2020 We have responded to editor and reviewer comments in the response to reviewers document included in the attach files step with file name RebuttalLetterPONE-D-20-27304-final.pdf We are copying the letter contents here: Response to Reviewers (PONE-D-20-27304) Comments by the Editor All three reviewers have some good suggestions that you should take into account. For Reviewer 1 it was a stumbling block that the paper seems to try to model a native speaker's ability to identify loanwords. Given that it doesn't do a good job at that the verdict was a rejection. But it seems that the paper is really about the extent to which it is possible for a computer to identify loanwords given information about the target language only. If you clearly spell out that focus and downplay the importance of discussion about what native speakers can and cannot then you might avoid some confusion. We thank the editor for the very helpful review process and the thorough selection of reviewers. We have now tried to modify the draft consistently, especially taking the global points raised by the reviewers into account. Our modifications are specifically reflected in: 1. No longer talking about the native speakers as our inspiration and desire for modeling, as we find that the clues we use are also used by traditional linguists who try to detect borrowings. 2. Re-structuring the presentation of the results, following mostly reviewer III’s very useful recommendations. 3. Publishing data and code openly and archiving them with Zenodo. In addition, we carried out many minor modifications, as reflected in our detailed response below, as well as in the modified text, in which we have marked all modified text blocks in blue font. Please include the following items when submitting your revised manuscript: ● A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. ● A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. ● An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. Comments by Reviewer #1 The article is a rather mechanical application of several machine-learning algorithms, two of them severely outdated and none of them state-of-the-art, to what is essentially a non-task. The motivation for the experiments---that speakers of different languages (cited publications refer to Russian and Korean in particular) are good at identifying borrowings---is way too slim. We understand from the reaction of the reviewer that our first draft has severely failed to make clear what the motivation behind this study was. In our updated version, we have now tried to clarify it. To summarize the changes (we will also provide detailed points below), our revision should make clear that: 1. Borrowing detection is a task that takes multiple pieces of evidence into account, but computational methods have so far not fully used all of the evidence considered by experts. 2. Our study proposes one way to operationalize language-internal evidence considered by experts, drawn from phonology and phonotactics. 3. To explore the usefulness of this evidence and our operationalization, we conduct experiments and use three different approaches, which provide varying levels of complexity, as it is standard in machine learning operations (one should never only test the most complex model to avoid overfitting) 4. Our findings suggest that phonological and phonotactic clues for lexical borrowings are not able to provide fully satisfying results in general, especially since they are mostly useful for recent borrowings, but that they are consistent with our expectation (increasing accuracy with more borrowings and with more borrowings from a unique source). We hope that this is enough to show that we are not necessarily talking about null-results here. To further emphasize this, we explicitly point out in the conclusion, that we do not believe that research should only be based on “good results” alone, but that all results are worth being shared in order to advance science (which is also in line with the mission of PLOS). Firstly, this only applies to very recent borrowings (Russian, just like English, is saturated with older borrowings, which are undetectable by native speakers). This should be much clearer in our updated version now, that we are aware that phonotactic and phonological clues are strongest for recent borrowings. Secondly, this does not provide a cross-lingual baseline against which to compare the performance of ML algorithms. We have tried to make our point clearer by explicitly mentioning in the introduction of the new draft, that borrowing is a unique process and it is therefore difficult to generate a baseline (gold standard) for it. Thirdly, as the authors point out themselves, this is not how borrowings are detected in the historical-comparative literature (where the principle of irregular sound correspondences is the only one of real standing; of course, this principle is rather hard to automate because one has to establish the correspondences manually to begin with). We now emphasize in the draft, that classical historical linguists use phonology and phonotactics as one clue among many, when it comes to the detection of borrowings, and that the majority of clues are comparative in nature, that is, they require the comparison of the language that received borrowings with potential donors. It is hard to grasp what exactly the study is trying to show. It could have been construed as an attempt to model discriminative abilities of native speakers, in which case the focus should have been on how the neural net discriminates between native and borrowed words (cf. the abundant recent literature on what BERT might know about syntax, etc.). Instead all the models are treated as black boxes, and the analysis boils down to identifying situations in which they perform better or worse. This may have been of interest had the proposed method been of practical use. As mentioned before, this should be clarified in the updated draft. We hope specifically that we have done a better job now at explaining why we think that our method is of practical use. Some possible applications are listed in the conclusion ("studies in which borrowed words or sentences need to be identified in large amounts of data"; "[work] on code switching, where multilingual language users switch between different varieties based on sociolinguistic contexts"); however, the fact that the proposed methodology can help there itself needs to be tested against appropriate baselines and competing approaches. We are confident that the usefulness of our method has now been more properly explained in the updated version. We also explicitly mention now that — even if it turns out that our approach does not provide useful for any follow up studies or that it cannot be further improved — we think it is important to share our results, since this kind of “failed research” may help those who want to try similar approaches in the future to save some precious time, building on our efforts. Comments by Reviewer #2 The authors introduced a new approach to automatic loanword detection in the field of computational historical linguistics (CHL). While most of the existing methods aim at identifying loanwords in multilingual wordlists, the attempt of this paper is to identify borrowings using a monolingual approach. The authors use the WOLD database, which is the only database containing loanword information along with information about the donor language and loaned status. What I really liked about the work is that the three methods build on another: the Bag of Sound model using an SVM is the simplest model, integrating only the phonology of the words without considering the order and frequency of the sounds; the Markov Model is a tri-gram model relying on the two previous sounds in the word; the recurrent neural network also relies on the phonotactics of the word, taking all previous sounds into account. The authors perform two experiments, one on artificially seeded borrowings and one on the “real” WOLD data. Since the promising results on the simulated data could not be obtained by the experiments using the WOLD data, the authors made additional experiments in order to explain the performance of the methods, which was achieved. Although, the results and the performance are not satisfying, the three introduced methods along with the lexical language models open a new perspective for further research, especially the recurrent neural network, serves as basis for improvements and further explorations in the field of automatic loanword detection. Two things I really liked about the manuscript are the detailed explanation of the methods and the representation of the data, which help the reader to get clear insights in the evaluation of the different methods. The statistical evaluation are carried out and explained in detail. The authors made the effort to perform additional analyses to explain the performance of the methods in the two experiments. The statistical evaluation are carried out in a rigorous way, giving some explanations according to the performance and the results. The process of nativization plays a crucial role in the motivation of the approach introduced in the first chapters, however it was not revive in more detail in the conclusion. The detection of loanwords using the proposed methods depends highly on the data and the adjustment status of the word in the recipient language. The methods might not identify older loanwords or loaned words from related languages, which show no clear differences in the phonology. This issue was not discussed in detail in the conclusion, but could be one of the reasons of the poor performance of the methods. We have now largely modified the description of the approach and decided to give up the parallel to borrowing detection by native speakers. First, we found that native speakers’ knowledge may often not be as perfect as it seems (judging on our discussion among the multi-lingual team of authors and linguists and what we know about the knowledge of native speakers in our respective native tongues), and second, we found that phonological and phonotactic clues are also justified and discussed as such in the traditional linguistic literature. As a result, we no longer talk about the native speakers’ intuition, but rather emphasize the importance of testing the power of language-internal clues for borrowing detection on a larger scale. In addition, the automatically derived IPA transcriptions of the words from the WOLD database could lead to noise in the analyses, depending on correctness of the transcription. However, since no other database is available containing loanword annotations along with the donor language and loanword status, compromises need to be made. We have added a table for demonstrating how the data is organized in memory, and expanded the discussion on how it is organized in disk via the CLDF standard. We also expanded the discussion and the references on how the transcriptions were obtained (i.e., orthographic profiles) and added a note informing, as mentioned by the reviewer, that the transcription might add noise to the data. The data is completely available online. Within the data, the additional German wordlist used for the artificially seeded borrowings is not identifiable at first sight. I would encourage the authors to provide the list in a format like csv and allocate it at first sight in the python package. Additionally, as a small notice, on p.6 the authors wrote that the German word list and the software package are available on GitHub. However, everything is uploaded on ​osf.io​. For consistency reasons the authors could correct this. We’ve provided an updated github/zenodo link for all the data and code. Additionally we’ve provided a reference including url to the published German wordlist. Spelling comment: I encourage the authors to check the formula for the softmax activation on p.9. To my understanding a comma is missing in the brackets of the formula. We’ve added the missing comma. Comments by Reviewer #3 Summary: This manuscript introduces three methods to identify the borrowed words based on monolingual wordlists. The author evaluates the performances of these methods by setting up various scenarios. Despite the methods work well in case of artificially seeded data, performances of these methods are not all satisfied. Nevertheless, the author points out that the high proportion of borrowing and the existence of a dominant donor language in a language are beneficial to the task. Also, a more promising method for borrowing detection is recommended for future study. Strength: This paper applies a dataset including a large amount of languages so that we can know the characteristics of languages that are suitable for the examined methods. It is interesting to see that the phonological and phonetic features are applied to build up the lexical models. In terms of the Marko model and neural network model, it is also interesting to see your way to use entropy difference to classify borrowed. The author provides useful suggestion on future works and extract meaningful information despite the unsatisfied result of the models. Weakness: Generally, the structure of the paper is fine, but sometimes I was surprised to see some contents that don’t belong to a section. I also saw some duplicated and redundant information. Besides, it’d be better to clearly indicate the meaning of the notations in formula. Below is the specific comments referring to specific sections. --Abstract You mentioned all necessary information in the abstract. However, it is not always clear and I had to spend some time to look for the information I needed. For instance, what you did and the result of this study are not clearly specified. In my opinion using phrases like “in this study, we did this...” “the result shows that...” could be helpful to catch the necessary information easier and faster before reading it. We’ve added appropriate hints to the abstract to help with reader processing of information. --Introduction You mention a lot in the introduction on how good a native speaker can identify the borrowing in a his/her own language, and it seems that this is related to your motivation to do the study. But you didn’t address this later in the rest of the paper. Hence, I'm confused about the actual motivation or the problem you try to address in the paper. It is the case that perceptions of native user performance, including our own anecdotal experiences of loanword awareness, helped to inspire this study. However, consulting the linguistic literature again, and discussing the degree to which native speakers really are able to identify borrowings showed that this point of inspiration is more a potential myth than a tested fact. As a result, we’ve now removed all hints to native speakers from the new version and emphasize instead that classical linguists routinely consider phonological and phonotactic clues when it comes to the detection of borrowings. So our intention is not to model native speakers, but to provide and test a method for lexical borrowing detection that is exclusively based on mono-lingual data in a supervised framework. This, we hope, is much clearer now in the draft. Also, it is a little weird that you include results and some discussion content at the end of the introduction, making this part like an abstract. We reduced the text substantially while still anticipating results and discussion sections. --Materials It’d be easier to understand the WOLD dataset if you introduce more about the data format and show some examples. I had to search for the WOLD online to understand how the data looks like. Meanwhile, adding some phonetic transcription examples would be more interesting and easier to understand what you did. We’ve added a small table of transcription examples from represented languages. Line 134 citation [36], there is typo for the reference of this citation in page 32/39. The reference has been corrected, providing the paper’s title. --Markov Model Page 8/39 it might be better to explain the notation. I had to spend some time to guess what they meant. The same as in Page 9 Notation has been better explained in the text for both Markov and Neural models. Line 202. Any reference for this statement? We added references to Bengio’s original language modeling work using recurrent layers and the more recent, transformer language model. Both show improvements over Markov Models in language modeling and we expected similar improvements for sound segment modeling. Line 232 to 238 Maybe you wanna sum up the three methods, but it seems a little redundant as they have been introduced previously. Try to re-organise it and avoid duplicated content. We have reduced the redundancy and better introduced the idea that we are using dual inherited and borrowed models as part of the decision procedure. --Result In general, you wrote the result for each experiment separately, which is clear and good. However, at the beginning of each section, you introduced the detail of each experiment, and the introduction of the experiments should not be part of the result section. The introduction of the experiments should be somewhere else as I only expect actual result like digits, tables, figures and relevant explanation of them in this section. We’ve reorganized the presentation of experiments and results so that each experiment and its results are treated separately from the others. This avoids the redundancy of multiple presentations of the same experiment. We hope this comes close to what the reviewer has suggested as the alternative option (in the next paragraph) Another option is to write each experiment totally separately, meaning that you write about the introduction, result, and discussion of an experiment together. The structure of the whole paper would be clearer and more logical. Thank you for your suggestion. We now follow your advice by presenting each experiment along with its results in turn, with each occurring as a subsection of the “Experiments with results” section. Line351 to 353, any reason you choose these three characteristics? Besides, have you considered that if they are independent from each other? We now explain the reason for choosing these characteristics in the text. The proportion of borrowed words is moderately negatively correlated with the proportion of unique native sound segments, otherwise the characteristics are independent. Page 17/39, Table 4, are all these correlations significant? Maybe you can also include p-value. We wanted to emphasize the strength of the relationships between phonological characteristics and borrowing prediction performance. So we noted in the table of correlations that all magnitudes >= 0.33 are significant at p < 0.05. We hope this is satisfying enough for the reviewer. Page 18/39 Table 5, it is a nice table showing regression model R square value. But it’d be better to involve the information in the table when you discuss the influence of the three factors, instead of just putting it here. The table information has now been explicitly referenced in the text of the results section. --Discussion You had to introduce each experiment again in this section, and it is redundant. As I suggest above in the result section, write about the experiment one by one, then you don’t have to constantly describe the experiments again and again. We decided to keep the discussion section separate from the experiments. With the changes made, following the reviewer’s suggestion above regarding the organization of the text by experiment (and more careful editing), we hope we succeeded in reducing the redundancy of the presentation. Line 502-507 maybe you can try the frequency of these unique sounds. Thanks for this hint. The text now (hopefully) interprets the results better (especially of the univariate and multiple regression analyses), by including the impact of exclusively borrowed word sounds. Submitted filename: RebuttalLetterPONE-D-20-27304-final.pdf Click here for additional data file. 9 Nov 2020 Using lexical language models to detect borrowings in monolingual wordlists PONE-D-20-27304R1 Dear Dr. Tresoldi, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Søren Wichmann, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): The revisions look fine, and should be satisfactory for the reviewers, so a second round of reviewing is not necessary. I noted a couple of typos/stylistic issues that you could fix: depends from the initial contact situation -> depends on the initial contact situation and (b) the more borrowings go back -> and (b) when more borrowings go back [or something like that] Reviewers' comments: 16 Nov 2020 PONE-D-20-27304R1 Using lexical language models to detect borrowings in monolingual wordlists Dear Dr. Tresoldi: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Søren Wichmann Academic Editor PLOS ONE
  6 in total

1.  Networks uncover hidden lexical borrowing in Indo-European language evolution.

Authors:  Shijulal Nelson-Sathi; Johann-Mattis List; Hans Geisler; Heiner Fangerau; Russell D Gray; William Martin; Tal Dagan
Journal:  Proc Biol Sci       Date:  2010-11-24       Impact factor: 5.349

2.  Using hybridization networks to retrace the evolution of Indo-European languages.

Authors:  Matthieu Willems; Etienne Lord; Louise Laforest; Gilbert Labelle; François-Joseph Lapointe; Anna Maria Di Sciullo; Vladimir Makarenkov
Journal:  BMC Evol Biol       Date:  2016-09-06       Impact factor: 3.260

3.  Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics.

Authors:  Robert Forkel; Johann-Mattis List; Simon J Greenhill; Christoph Rzymski; Sebastian Bank; Michael Cysouw; Harald Hammarström; Martin Haspelmath; Gereon A Kaiping; Russell D Gray
Journal:  Sci Data       Date:  2018-10-16       Impact factor: 6.444

4.  The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies.

Authors:  Christoph Rzymski; Tiago Tresoldi; Simon J Greenhill; Mei-Shin Wu; Nathanael E Schweikhard; Maria Koptjevskaja-Tamm; Volker Gast; Timotheus A Bodt; Abbie Hantgan; Gereon A Kaiping; Sophie Chang; Yunfan Lai; Natalia Morozova; Heini Arjava; Nataliia Hübler; Ezequiel Koile; Steve Pepper; Mariann Proos; Briana Van Epps; Ingrid Blanco; Carolin Hundt; Sergei Monakhov; Kristina Pianykh; Sallona Ramesh; Russell D Gray; Robert Forkel; Johann-Mattis List
Journal:  Sci Data       Date:  2020-01-13       Impact factor: 6.444

5.  Networks of lexical borrowing and lateral gene transfer in language and genome evolution.

Authors:  Johann-Mattis List; Shijulal Nelson-Sathi; Hans Geisler; William Martin
Journal:  Bioessays       Date:  2013-12-27       Impact factor: 4.345

6.  The causality of borrowing: Lexical loans in Eurasian languages.

Authors:  Gerd Carling; Sandra Cronhamn; Robert Farren; Elnur Aliyev; Johan Frid
Journal:  PLoS One       Date:  2019-10-30       Impact factor: 3.240

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.