Literature DB >> 36107963

A multiplex analysis of phonological and orthographic networks.

Pablo Lara-Martínez¹, Bibiana Obregón-Quintana¹, C F Reyes-Manzano², Irene López-Rodríguez³, Lev Guzmán-Vargas³.

Abstract

The study of natural language using a network approach has made it possible to characterize novel properties ranging from the level of individual words to phrases or sentences. A natural way to quantitatively evaluate similarities and differences between spoken and written language is by means of a multiplex network defined in terms of a similarity distance between words. Here, we use a multiplex representation of words based on orthographic or phonological similarity to evaluate their structure. We report that from the analysis of topological properties of networks, there are different levels of local and global similarity when comparing written vs. spoken structure across 12 natural languages from 4 language families. In particular, it is found that differences between the phonetic and written layers is markedly higher for French and English, while for the other languages analyzed, this separation is relatively smaller. We conclude that the multiplex approach allows us to explore additional properties of the interaction between spoken and written language.

Entities: Chemical

Mesh：
Humans
Language
Phonetics

Year: 2022 PMID： 36107963 PMCID： PMC9477335 DOI： 10.1371/journal.pone.0274617

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

The complexity of natural language has been studied from different perspectives of scientific research [1-5], among which characterizations based on phonological [6-8], morphological [9, 10], syntactic [11, 12], and semantic aspects [13, 14] stand out. Some of these approaches have shown that the complexity expressed in these aspects (phonetic, lexical, syntactic, semantic) are general properties, such as Zipf’s law [15] and other linguistic laws, observed in all languages [1, 8, 16–18], while some particularities, such as the divergence between written and spoken language, may exhibit differences across languages. In some previous studies, the levels of complexity have been evaluated in terms of modeling based on complex single-layer networks or their extension to multilayer networks [19-21]. Of particular interest are findings about emergent organizational properties that encompass facets of language ranging from semantics to phonetics, including the written structure of language [22-25]. In many of these network-based approaches, it has been found that the behavior of the connectives -the number of neighbors of a given node- is often described by distributions that lie between power-law and narrow exponential behavior, depending on the language and the association criterion. For instance, Arbesman et al. [6] reported that in the case of phonological networks, the degree distribution follows a truncated power law with different parameters when comparing different languages [6]. From the perspective of orthographic networks, it has been reported [22] that the distribution of connectives for the mental lexicon of elementary-level learners is well described by a power law with small-world properties. Although different natural language properties based on transformations to complex networks have been analyzed, few of them have focused on incorporating multilayer aspects of the language [20, 26]. In this study, we address orthographic and phonetic features of language using a multiplex approach. In particular, by estimating the similarity between word pairs, a two-layer network is constructed in which the nodes are the words, and a link exists if a threshold value of the similarity is satisfied. For the purpose of our study, this distance similarity between two words, A and B, can be defined as the minimum number of edit operations needed to transform A into B, which is the well-known Damerau-Levenshtein (DL) [27-29] (see Methods). Among the widely recognized characteristics of many natural languages is the non-existence of a biunivocal correspondence between the writing of a word and its corresponding pronunciation. Thus, the correspondence between graphemes and phonemes is not biunivocal, giving rise to situations such as homography (when one letter corresponds to two phonemes), digraphy (two letters correspond to one phoneme or vice versa), heterography (one phoneme corresponds to two or more letters), etc. [30-35]. In fact, at the word level, the appearance of phenomena such as homophony and transparency in natural languages has been the subject of extensive study from the linguistic perspective [36, 37]. On the other hand, the use of complex networks has been incorporated into systems analysis as the language where multiplex modeling is most appropriate. In these cases, the nodes are placed in layers with connections between them and the nodes are common to all layered networks. Several real and simulated multilayer networks have been studied in contexts such as finance and economics [38-40], social systems [41, 42], synchronization [43] and linguistics [21]. A direct comparison between orthographical and phonological networks would be important to quantify the local and global connectivity patterns and their changes across different languages. Related to the latter, and in the context of psycholinguistic studies, the identification of these differences and similarities potentially contribute to the understanding of the mechanisms that act on cognitive processes, such as word recognition and retrieval, and whose manifestations are particularly different when looking at orthographic or phonetic organization. In a more potentially applicable context, the relationship between orthographic and phonological networks could be of great interest for the robustness of automatic speech recognition systems, as they are often prone to failures in transcription to written text. And it could also impact issues such as cross-language transfer learning, where a neural network that can recognize one language might perform well in another language depending on the similarity of the multiplexed network. The contributions of this study focus on answering the questions that were posed in the registered protocol and are summarized as follows. (i) Unlike previous studies based on single-layer networks, multiplex orthographic-phonetic networks were constructed for 12 natural languages based on similarities between 5x104 words. The observed properties reveal that it is possible to differentiate levels of organization between orthographic and phonetic structure in natural language. (ii) Our results indicate that while certain languages exhibit a high correlation, for node-based measures, between phonetic and orthographic similarity, for others this correlation is rather low, reinforcing the identification of differences at the local level. (iii) Our approach based on a multiplex analysis presents an alternative view for understanding the organization of language by combining the written and spoken form.

Orthographic and phonological network

In this study, the multiplex language network consists of an orthographic network and phonological network (see Fig 1 for a schematic representation). For the orthographic network, we generate a similarity network at word-level G[ = (V[, E[), where nodes are words and a link between two nodes is defined if the DL distance is smaller or equal than a threshold value ℓ. In a similar way, the phonological network G[ = (V[, E[) is constructed in terms of nodes which represent words translated to the international phonological alphabet (IPA), and links are defined if the DL, is smaller or equal than a given threshold ℓ. Next, the orthographic and phonological networks are combined to generate a two-layer network, denoted by , with α = O, P. Here, the adjacency matrix for the multiplex network is given , where indicates that there is a link between node (word) i and node (word) j at layer α. More formally, the adjacency matrix associated with each layer is defined as: , where Θ(−) represents the Heaviside function, δ is the Kronecker delta, the DL distance between word i and word j at layer α. Here, the factor is considered to exclude the cases for which the link does not reflect similarity, and and are the lengths (in characters) of words i and j, respectively.

Fig 1

Construction of the multiplex language network.

Representative multiplex network for English language and several distance thresholds. Each layer represents the orthographic (top) and phonological (bottom) networks. Here, nodes are words and there is a link if the Damerau-Levenshtein distance is smaller than a given threshold (a) ℓ = 1, (b) ℓ = 2 and (c) ℓ = 3. Notice that words in the phonological layer were translated into the International Phonetic Alphabet and then the DL was calculated. The figures where generated by using the free python library pymnet [45, 46].

Construction of the multiplex language network.

Databases

Our study focuses on analyzing orthographic-phonological networks of 12 natural languages belonging to four language families: Germanic (English, German, Dutch and Swedish), Romance (French, Spanish, Portuguese and Italian), Slavic (Russian, Ukrainian and Polish) and Uralic (Hungarian). A corpus of words for each language was constructed using a set of books available from the Gutenberg project www.gutenberg.org. The written texts were pre-processed to remove function words, stop words and any mark symbol. The titles of the written texts and the resulting corpus are described in https://doi.org/10.6084/m9.figshare.14668593 [44]. The final corpora contains 50 × 103 words with their corresponding translation to the international phonetic alphabet for each language (transliterated by the epitran library of Python version 3.6.8).

Results

First, prior to the description of the multiplex properties of the OP network, we present updated results for the calculations of the basic structural properties of the orthographic and phonological networks (see Methods for details). These results correspond to topological features calculated for corpus with 50 × 103 words for each of the 12 languages in our study. Table 1 shows the results for 4 representative languages of each linguistic family (see supporting information online at FigShare [44] for results from all the languages in our study). For a comparison between the interlayer values of these network metrics, the phonological/orthographic ratios for the different languages are shown in Fig 2.

Table 1

Results for the basic topological network quantities obtained from the ortographic (G) and phonological (G) networks.

Language	Network	G ^O			G ^P
	Metric	G ^O			G ^P
	Threshold	ℓ = 1	ℓ = 2	ℓ = 3	ℓ = 1	ℓ = 2	ℓ = 3
English	Density	1.38(10⁻⁴)	10.32(10⁻⁴)	74.58(10⁻⁴)	2.62(10⁻⁴)	18.05(10⁻⁴)	97.81(10⁻⁴)
	Average degree k¯	4.76	47.57	366.41	8.79	81.63	476.86
	Nearest neighbor knn¯	5.90	66.94	552.60	10.92	113.03	689.39
	Clustering c¯	0.20	0.28	0.32	0.21	0.31	0.35
	Average component size	3.63	18.58	143.69	4.49	19.29	125.42
	Maximum modularity Q*	0.80	0.55	0.39	0.80	0.55	0.39
Spanish	Density	0.82(10⁻⁴)	4.66(10⁻⁴)	33.46(10⁻⁴)	1.02(10⁻⁴)	6.69(10⁻⁴)	47.95(10⁻⁴)
	Average degree k¯	2.79	21.19	163.83	3.67	31.17	235.96
	Nearest neighbor knn¯	3.43	29.65	249.16	4.58	43.77	354.44
	Clustering c¯	0.14	0.31	0.31	0.14	0.30	0.31
	Average component size	2.75	16.00	118.23	3.36	23.26	161.59
	Maximum modularity Q*	0.85	0.54	0.38	0.85	0.54	0.38
Russian	Density	0.82(10⁻⁴)	2.90(10⁻⁴)	19.96(10⁻⁴)	0.86(10⁻⁴)	2.57(10⁻⁴)	14.48(10⁻⁴)
	Average degree k¯	2.21	12.61	95.69	2.16	10.77	68.16
	Nearest neighbor knn¯	2.66	17.26	145.03	2.60	14.60	101.85
	Clustering c¯	0.22	0.34	0.32	0.21	0.35	0.34
	Average component size	2.28	9.25	49.25	2.19	7.14	28.47
	Maximum modularity Q*	0.95	0.71	0.49	0.95	0.71	0.49
Hungarian	Density	1.08(10⁻⁴)	3.26(10⁻⁴)	18.19(10⁻⁴)	1.13(10⁻⁴)	3.28(10⁻⁴)	16.40(10⁻⁴)
	Average degree k¯	2.36	12.77	83.07	2.36	12.39	73.43
	Nearest neighbor knn¯	2.92	17.74	129.16	2.90	16.81	109.21
	Clustering c¯	0.17	0.31	0.35	0.18	0.30	0.33
	Average component size	2.31	9.18	34.88	2.30	8.18	30.56
	Maximum modularity Q*	0.94	0.72	0.54	0.94	0.72	0.54

Table notes. Topological metrics of the orthographic network and the phonological network. These results were obtained from networks with 50 × 103 words at each layer. The average values of the degree (), clustering () and nearest neighbor () are presented. We observe that the density, , and exhibit an increasing behavior for the four languages and the two layers, with some similarities such as it occurs for in both layers and distances ℓ = 2, 3. For the modularity and the average cluster size, we observe they exhibit opposite trends, while the modularity decreases as l increases, the average cluster size increases because a larger number of nodes tends to be connected to a giant component. See extended data online at https://doi.org/10.6084/m9.figshare.14668593 FigShare [44].

Fig 2

Ratio (phonological/orthographic) values for some metrics shown in Table 1.

Ratio (phonological/orthographic) values for some metrics shown in Table 1.

The cases of (a) density, (b) average clustering, and (c) average component size are depicted. Here a value close to 1 indicates that similar metric-values are obtained either for the ortographic or phonological layer, while a value greater (smaller) than 1 is obtained when the phonological (orthographic) exceeds the opposite layer. It is observed that French exhibits the highest asymmetry for density and component size, while for clustering, most languages display values close to 1, except German with higher values in the orthographic layer. Table notes. Topological metrics of the orthographic network and the phonological network. These results were obtained from networks with 50 × 103 words at each layer. The average values of the degree (), clustering () and nearest neighbor () are presented. We observe that the density, , and exhibit an increasing behavior for the four languages and the two layers, with some similarities such as it occurs for in both layers and distances ℓ = 2, 3. For the modularity and the average cluster size, we observe they exhibit opposite trends, while the modularity decreases as l increases, the average cluster size increases because a larger number of nodes tends to be connected to a giant component. See extended data online at https://doi.org/10.6084/m9.figshare.14668593 FigShare [44]. The density ratio (Fig 2a) indicates that the phonological network has more connections than the orthographic network, confirming that the sound affinity between words is greatest in languages such as French, and to a lesser extent for English and German. This result also aligns with previous findings about properties like homophony (when two or more words sound the same, but carry distinct meanings) in several human languages [33], which favors the increase of the degrees in the phonetic layer, while the word spelling is circumscribed by the repetition of the characters. The average clustering coefficient exhibits relatively similar ratio values for almost all languages (Fig 2b), indicating that the local structure (presence of triangles) is similar whether looking at orthographic or phonological properties. Moreover, as shown in Fig 2, the average cluster size obtains larger values in the phonological vs. orthographic layer for French and to a lesser extent for Spanish, while for the rest of the languages the difference is smaller and even reverses this behavior for the Slavic family and Hungarian. This reveals that the fragmentation of the layers occurs differently when comparing the languages, with the phonological layer showing the greatest cohesion in the languages noted above. In addition, the determination of the functional form of the degree distribution of nodes has gained notoriety for establishing the behavior of connectivities and structural analysis of networks [47]. In the case of the distributions corresponding to the orthographic and phonological layer, it is observed that they correspond to distributions with a broad degree distribution, also known as fat-tailed distribution [48] (See Fig 3). For each degree distribution in our study, we performed fits to the data by considering the following distributions: Gumbel, Exponential, Loglogistic, Lognormal, Weibull and Power-law (see Supplementary Material [44]). To establish the best distribution that fits the data of the phonological and orthographic networks, we used the Akaike and Bayesian information criteria [49, 50] (see Methods and Supplementary Material [44] for details). According to statistical tests, most of the degree distribution networks of each layer (either orthographic or phonological) can be well described by the Weibull distribution, while for the remaining distributions the best fits correspond to the Loglogistic and Lognormal, although the Weibull remains the second best fit in most of these cases (see Supplementary Material [44]). The corresponding survival cumulative distribution of the Weibull is represented by a stretched exponential function which have been used to describe a variety of phenomena [51-56]. We note that this distribution is more skewed than a single exponential distribution but less skewed than a power law distribution. As shown in Table 4 of the Supplementary Material [44], the Weibull fitting exponents ( and , for the orthographic are slightly greater than the corresponding phonological exponent, except for Spanish and French, indicating that larger connectivities are present in phonological networks, i. e., a relatively small number of words concentrate similarity with many other words in terms of phonetic structure. These new values for the scaling exponents improve the description of the connectivities previously reported in the preliminary analysis [57], and are consistent with the fact that the distributions are of the heavy-tailed type. To reinforce the choice of the fit to a Weibull function in most cases, we have performed the likelihood ratio test as described by Clauset et al. [58] (see Methods for details). The results show that almost all the fits lead to p-values lower than 10−10, in terms of the probability that they fit better to a Weibull-type function than to any other of the four distributions considered in our analysis. Similar conditions were found when considering cases where the selection corresponds to Loglogistic or Lognormal. Our findings of the behavior of degree distributions are also in general agreement with previous estimates made for some languages and based on phonological and written similarities [6].

Fig 3

Degree distributions for phonological and orthographic networks.

Degree distributions for phonological and orthographic networks.

The cases of English, Spanish, Russian and Hungarian are depicted for three distance thresholds. The top row shows the distributions of the orthographic layer, while the bottom row shows the phonological layer. Dashed lines represent a Weibull-type function fit (see Table 4 in Supplementary Material at FigShare [44]). We find that for the majority of languages and both layers, the distributions display a heavy-tailed behavior. For a better comparison of the data, the insets of each plot show the corresponding degree distribution for normalized degrees k/k* (horizontal axis), where k* = max(log(k)). In order to assess the similarities between distributions from different languages we resort to a robust measure to estimate the distance between them: the Jensen-Shannon distance (JSD), (see Methods). Fig 4 shows the matrix of JSD values between all pairs of orthographic vs phonological degree distributions for ℓ = 1, 2, 3. The order of columns and rows has been determined by their similarities by using an agglomerative hierarchical clustering methodology (see Methods for details). The resulting dendograms are shown in top and left sides of the distance matrices. From the orthographic perspective, it is observed that English is the most divergent from the rest, while Russian, Hungarian and Ukranian are the least distant to one another, specially for ℓ = 1 and ℓ = 2. With respect to the phonetic component, clearly English and French are the languages with the greatest separation from the rest, while Russian, Hungarian and Ukrainian are also the closest languages.

Fig 4

Language similarity evaluated by the Jensen-Shannon distance between layers orthographic (horizontal) and phonological (vertical).

Language similarity evaluated by the Jensen-Shannon distance between layers orthographic (horizontal) and phonological (vertical).

The cases of (a) ℓ = 1, (b) ℓ = 2 and (c) ℓ = 3 are depicted. The dendograms have been determined in terms of the similarities between languages by using the agglomerative hierarquical clustering method. We observe that for ℓ = 1 and ℓ = 2, the dendogram for the orthographic dimension at intermediate height (dashed line), four groups are identified, G1 (Russian, Hungarian and Ukranian) is the one which exhibits the highest internal similarity (low JSD); the other three groups correspond to G2 (English), G3 (Dutch, Italian and Swedish), G4 (Spanish, Polish, German and Portuguese). It is important to notice that in groups G3 and G4, Romance, Germanic and Slavic families are mixed and English is an isolated language. In contrast, for the phonological dimension, the JSD values at an intermediate cut-off (dashed line), also four groups are again observed, being the English and French the ones that stands out for a large distance with any other language, while Ukranian, Russian and Hungarian are described by relatively low distances. For ℓ = 3, we observe that English is the most divergent from the rest in terms of writing, while English and French are the most divergent in terms of phonological structure. Next we further explore the relationship between some topological features of the orthographic and phonetic networks. First, the Spearman-rank coefficient is calculated to evaluate the presence of correlations. Fig 5 shows the results of the calculations of the correlations for degree, clustering and average nearest neighbor. Positive correlations are observed in all properties, but some differences are remarkable when comparing individual languages and linguistic families. For degrees (Fig 5a), we observe that English, German and French exhibit a relatively low correlation values for the threshold value ℓ = 3, while for the rest of languages and the three threshold values, a higher correlation is present (≥0.7). These results indicate that, for most of the languages in our study, words with high similarity in their orthographic structure tend to have also more phonological similarity and viceversa, except for the three languages listed above. For k (Fig 5b), higher and similar coefficient values for the correlation are observed for all languages and threshold values ℓ = 1 and ℓ = 2, confirming that by increasing the mean spelling similarity (with other words) of the neighbors of a given word, the phonetic similarity of the neighbors also increases. This fact is particularly remarkable for Romance languages, except French. For clustering (Fig 5c), languages which belong to the Germanic family and French have lower correlations coefficients, revealing that words with a high fraction of connected (with similar orthographic structure) neighbors tend to have rather a smaller fraction of connected neighbors in terms of phonological similarity and viceversa.

Fig 5

Correlations between some structural properties of the orthographic and phonological networks.

Correlations between some structural properties of the orthographic and phonological networks.

We show the Spearman-rank correlation coefficient for (a) degree, (b) average nearest neighbor degree and (c) average clustering. For most of the languages, similar levels of positive correlations are observed for the three properties, except the cases of the Germanic family and French for which the clustering is noticeable lower compared to the rest of languages. In other to evaluate the link overlap across orthographic and phonological layers, the normalized local Jaccard’s index was calculated (see Methods). Here, a value close to one would indicate that words tend to have the same neighbors, either in the orthographic or phonological layer, while when it is close to zero, words do not necessarily share the same neighbors. Fig 6a shows the results of the calculations of the Jaccard’s index. The Germanic family (Dutch, English, German and Swedish) together with French display a relatively low index value, indicating that a low overlapping is present across both layers, while for the Romance, Slavic and Uralic families, similar values are observed which represent the fact that words tend to have the same neighbors across layers.

Fig 6

Average Jaccard index (J), modularity ratio and similarity F1*.

Average Jaccard index (J), modularity ratio and similarity F1*.

(a) Jacccad’s index which indicate the extent of link overlap across layers. (b) Modularity ratio between layers. The value of the phonological layer divided by the value of the orthographic layer is shown. (c) Score F1* to evaluate the similarity between communities associated to a given modularity. Moreover, we also explored the similarities between both layers from the perspective of modularity, which measures the property of a given network to be divided into groups [59]. First, we evaluated the ratio between single-layer modularity. Fig 6b show the ratio phonological/orthographic for all languages. It is observed that for Dutch and Swedish the dissociation between modules tends to be markedly high in the phonetic layer compared to the orthographic one, while the opposite is observed for French, i.e., for French the modules tend to be not very well defined due to a dense connectivity with different phonetically similar groups. Next, to get further insight in the identification of differences and common properties between layers, we explore the similarity between the communities associated to a given modularity Q, i.e. to which extent words which belong to a certain community in the orthographic layer are also contained in the corresponding group in the phonological network. We resort to the index F1*-score (see Methods) to estimate the similarity (in terms of overlapping of communities) between the two layers. Fig 6c shows that for English and French F1* exhibits lower values, specially for ℓ = 3, indicating that for these languages words tend to fall into different communities across layers. In contrast, higher values for F1* are observed for the rest of the languages, which are consistent with the fact that most words are identified with the same group regardless of whether they are written or spoken. Finally, we performed multiple robustness tests to explore the similarities and differences between layers in terms of two general strategies: directed attacks and random failures [60]. To test whether both layers are affected by the removal of a fraction f of most connected nodes or selected at random, the mean size of the components and the diameter of the networks were monitored. Fig 7 presents the results of the calculations for four representative languages (see extended data online at Figshare [44]). For attacks, we find that for ℓ = 1 and for both layers (Fig 7a), as we increase the fraction of removed nodes, the normalized diameter tends to increase until a maximum value and then it decreases. The threshold value (f*) for which the diameter exhibits a peak changes for each language, while for Slavic languages and Hungarian is located below 0.1, for Germanic and Romance languages it is located slightly above 0.1 (see accompanying information online at FigShare [44]). Interestingly, this transition occurs in the phonological layer systematically for slightly larger values of removed node fractions, except for the Slavic family and Hungarian, which exhibit transitions at similar values of f. For larger values of the DL distance, the transition point seems to be located to the right, i.e., a larger value of f is needed to detect the transition (Fig 7a). For the case of random failures, increasing fractions of removed nodes reveals a limited effect on the normalized diameter in all languages. In contrast, when the average component size is monitored in terms of f, important differences emerge between the layers and languages (Fig 7b). We observe that both layers exhibit a decaying behavior with different rates for attack compared with random failures. For a more direct comparison between both layers, we computed the average differences between the values of the normalized component size of the phonetic and the corresponding values of the orthographic layer for either attack or random failures (Fig 8). French is the language with the largest deviations either positive or negative, i. e., the orthographic layer is more robust under attacks while for random failures, the phonological layer is more robust. A similar behavior, but to a lesser extent, is present in Spanish, while for Germanic languages a greater robustness is observed from the orthographic perspective. The rest of the languages exhibit small deviations with a slightly higher orthographic robustness either for failures or attacks.

Fig 7

Robustness of the networks.

Directed attacks consist in removing the most connected nodes and for the failures nodes are removed randomly. a) Behavior of the normalized diameter in terms of the fraction of removed nodes. b) As above but for normalized component size. Different profiles for the decay are observed when comparing orthographic and phonological networks from the four languages. The results for failures correspond to the average from 10 independent realizations.

Fig 8

Average component size differences between phonological and orthographic networks.

We show the cases of attacks (filled circles) and failures (open circles) for several thresholds ℓ of the DL distances, a) ℓ = 1, b) ℓ = 2, and c) ℓ = 3. The results for failures correspond to the average from 10 independent realizations.

Robustness of the networks.

Average component size differences between phonological and orthographic networks.

Discussion and conclusion

We have shown that orthographic and phonological layers exhibit similarities and differences across several languages from four linguistic families. Interestingly, our network analysis based on a wide variety of measures showed that natural languages reveal different levels of proximity when viewed from the written or spoken perspective. Our findings significantly extend previous studies based on small word corpora and limited to a few languages [32]. In previous works, models of inter-word similarity networks have been approached based only on purely orthographic or phonological properties, showing that there are changes in network characteristics when different languages are compared. In our approach, the different metrics are compared in parallel and the differences or asymmetries are highlighted across languages. More importantly, the results about a higher density in the phonological layer for languages like French and English are consistent with linguistic reports which point out the presence of homophony and more opacity in these languages [31, 33, 36, 37]. A remarkable fact, in the context of our study, is that when languages were grouped based on the distance between connectivity distributions, the categorizations did not necessarily correspond to the classification by language family, with cases such as English and French having a greater divergence (specially phonetic) with respect to the rest of the languages. Although our approach is based on a simple string-distance metric without incorporating other elements such as syllables, morphemes, etc., the similarities and differences suggest that additional quantitative evaluations, which can incorporate these additional information, can be performed across several natural languages. The results we report here are in general agreement with studies focused either on phonological or orthographic networks, and reinforce the idea of common general organization in natural languages. In addition, the accuracy of the network metrics and their changes across layers to determine similarities and differences make it appropriate to benchmark with other languages, and eventually apply this approach on a scale beyond the word level. The present study can be naturally extended with the incorporation of additional layers containing semantic information, polarity information, etc., to explore additional properties with potential use in contexts of text classification, automatic speech recognition systems and pattern identification in natural languages. Our study has some limitations, the most notorious of which is that the similarity of phonetic structure based on the Damerau- Levenhstein distance tends to be overestimated because the discretization of sounds leads to a loss of structure [8]. Also, the sample size may impact the estimation of some parameters of the multiplex network. We conclude that the multiplex analysis reveals additional features, which have not been evaluated by other methods, and provides a way to obtain important information about the interaction between spoken and written language. In addition, this study offers an alternative multiplex network-based methodology for language analysis and can be easily extended to other languages to contribute to the understanding of language complexity.

Methods

Damerau-Levenshtein distance

The distance similarity between two strings A and B can be defined as the minimum number of edit operations needed to transform A into B. These operations are: (1) substitute a character in A to a different character, (2) insert a character into A, (3) delete a character of A, and (4) transpose two adjacent characters of A. The Damerau-Levenshtein (DL) distance is then defined as the length of the optimal edit sequence [27-29]. In our analysis, we adopt the DL distance ℓ as a threshold value to define a link between two words.

Network metrics

Our analysis is focused on the basic topological characteristics of individual networks, and then to proceed to investigate similarities and differences of the two layers. We listed the single-layer-network measures (of a network with N nodes) in a multiplex network that are evaluated [61, 62]: Density. The density of a layer α, ρ[, is given as: where m[ is the number of actual connections within the layer α. Degree distribution. The degree of a node i is the number of links outgoing (or incoming) to that node, The degree distribution for layer α is then defined as the fraction of nodes in the network with degree k, where is the number of nodes with degree k. Clustering Coefficient. Measures the degree of transitivity in connectivity among the nearest neighbors of a node i within the layer α. is calculated as [61], where is the number of links between the neighbors of the node i within the layer α. Average Nearest-Neighbor Degree. Measures the average of the neighbors of a node [61]. The is calculated as: Modularity. Given the community associated to the node i within the layer α, where , with P a natural number. The modularity, Q[ of a given layer α is given by [62]: where δ is the Kronecker delta. We use the Louvain algorithm [63] to perform a greedy optimization of the modularity.

Fitting of degree distributions

To determine functional forms of the degree distributions, we have resorted to procedures based on two indicators: the Akaike information criterion (AIC) [50, 64] and the Bayesian information criterion (BIC) [49, 50]. These criteria represent two of the most widely used families of model selection indicators for identifying the “best model”. Details of our procedure for discriminating the significance of adjustments can be found in the Supplementary Material [44]. Besides, we evaluated the goodness-of-fit by calculating the p-values of the likelihood ratio test introduced by Clauset et al. [58, 65] to compare the fits, thereby confirming that in most the cases, the observed data fit better to a Weibull distribution than to any of the other four distributions considered in our analysis. Similarly, when the best fit corresponded to Loglogistic or Lognormal, the p-values were very small (p < 10−10)

Jensen-Shannon distance and agglomerative clustering

Given two distributions P and Q, the JSD is defined as , where R = (P+ Q)/2 and D is the Kullback-Leibler divergence. For a better description of the distances between distributions, an agglomerative hierarchical clustering algorithm was used [66]. Briefly, the clustering method consists in recursively cluster two items at a time. At the beginning, each item defines its own cluster and two most similar items are then clustered. Next, the process is repeated for most similar items or clusters until forming a single cluster. In our case, the JSD was used as a similarity between two languages either from an orthographic or phonetic perspective.

Communities and F1-score

For multiplex networks we adjust the F1-score [67] defined in Ref. [68] as follows. Given two collections of communities and of layers α and β, respectively. We define the F1*-score as: where F1 represents the average F1-score of a reconstructed community with respect to the best match in the opposite layer [68]. It is important to notice that F1 and F1 are well defined, since the node sets are the same in both layers (V[ = V[). The F1*-score between two collections of communities can be interpreted as the degree of similarity between them. For the Louvain method [63] and F1-score implementation, we use NetworKit [69].

Jaccard index

For each node i, the local overlap [62] between two layers α and β is defined as the total number of nodes such that they are neighbors of node i in both layer α and layer β: The local overlap can be normalized to have a bounded measure which indicates the similarity of neighbors of nodes across the layers, obtaining the Jaccard index: The average Jaccard index is constructed by simply considering the total number of nodes in the network. (PDF) Click here for additional data file. 4 Apr 2022

PONE-D-21-38860

A multiplex analysis of phonological and orthographicnetworks

PLOS ONE Dear Dr. Guzmán-Vargas, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular, please consider the suggestions and comments made by the Reviewer #2 as they could strengthen and make more clear your manuscript. Please submit your revised manuscript by May 19 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Irene Sendiña-Nadal Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 4. Thank you for stating the following in the Acknowledgments Section of your manuscript: “This work was partially supported by COFAA-IPN, EDI-IPN, and Conacyt-M´exico” We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: “This work was partially supported by programs EDI and COFAA from Instituto Politécnico Nacional and Consejo Nacional de Ciencia y Tenología, México. No additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Does the manuscript adhere to the experimental procedures and analyses described in the Registered Report Protocol? If the manuscript reports any deviations from the planned experimental procedures and analyses, those must be reasonable and adequately justified. Reviewer #1: Yes Reviewer #2: Yes ********** 2. If the manuscript reports exploratory analyses or experimental procedures not outlined in the original Registered Report Protocol, are these reasonable, justified and methodologically sound? A Registered Report may include valid exploratory analyses not previously outlined in the Registered Report Protocol, as long as they are described as such. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Are the conclusions supported by the data and do they address the research question presented in the Registered Report Protocol? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. The conclusions must be drawn appropriately based on the research question(s) outlined in the Registered Report Protocol and on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors analyze the topological properties of phonological and orthographic networks of 12 languages by adopting a multiplex formalism. They perform a comprehensive analysis of these graphs and uncover interesting patterns of language use. Strength * The report is well-prepared, and the analysis is comprehensive. Weakness * The figures and captions are out of place. The authors should consider redesigning all the figures to fit within the text, even if it involves splitting the figures into separate subfigures. I recommend the authors fix the figures (see above) as a minor revision to make this fit for publication. Reviewer #2: The article entitled “A multiplex analysis of phonological and orthographic networks” explores a multiplex representation of words based on orthographic or phonological similarity to evaluate their structure. The authors perform this study in 12 languages, comparing their network metrics. I do consider this article of great interest as it explores the frontier between written and oral communication by using the novel approach of multiplex networks analysis. Furthermore, this approach may be of great interest in some applications related to Natural Language Processing or Automatic Speech Recognition. My main criticism of the article is that it seems that a series of very interesting analyzes have been carried out but without a clear objective. For example, in the abstract, it is said: “We report that from the analysis of topological properties of networks, there are different levels of local and global similarity when comparing…” This result somehow would be expected. The interesting question would be univocally framing those results with other linguistic research, phylogenetic approaches, information theory or sociology hypothesis of language. I believe that the article has great potential, but it should make it more straightforward about the hypotheses, objectives, and substantial contributions beyond an exciting series of analyses. Additionally, I would like to propose the following comments: Introduction. Some sentences are vague, e.g. “The complexity of natural language has been studied from different perspectives of scientific research”. This kind of sentence can be more concrete and explicative, including references to previous works. 1. Line 9-13: This should be referenced. 2. Line 13.15: However, it was later fitted to a Weibull distribution. 3. Related to ref 5 and the relationship between ortographic and phonology I would suggest to consider the reference: Torre, I. G., Luque, B., Lacasa, L., Kello, C. T., & Hernández-Fernández, A. (2019). On the physical origin of linguistic laws and lognormality in speech. Royal Society open science, 6(8), 191023; where the authors have addressed this kind of statistical regularities of language. 4. The relationship between orthographic and phonology networks could be of great interest to computer science, where automatic speech recognition systems may fail to transcribe into written text. This somehow could be related to the transfer learning between languages: e.g. a neural network that can recognize a language could work well in another language depending on the similarity of the multiplex network. 5. Limitation of ortographic translation of Damerau-Levenshtein. The orthographic transcription is, in some way, a discrete representation of the spoken word, but the orality is a continuum spectrum. In linguistics, the vowels are represented in the IPA vowel trapezium, some of them physically closer to others. The Damerau-Levenshtein distance does not recover this fact. Material and methods, and Results 6. Degree distributions are fitted with a Weibull-type function. This hypothesis should be supported by literature and compared to another type of distribution function, e.g. power-law or exponential. 7. The fitting to a probability distribution function, particularly a power law, has been a hot topic in recent years, so it should be rigorously addressed. The article should include the methodology used, the estimators and the goodness of fit. Some useful references: • Clauset, A., Shalizi, C. R., & Newman, M. E. (2009). Power-law distributions in empirical data. SIAM review, 51(4), 661-703. • Navas-Portella, V., González, Á., Serra, I., Vives, E., & Corral, Á. (2019). Universality of power-law exponents by means of maximum-likelihood estimation. Physical Review E, 100(6), 062106. 8. Line 144. The Pearson correlation test hypothesis should be revisited. It usually requires a normal distribution of the values, which seems to be not happening, at least for the degree. 9. Material and methods should be extended to detail the mathematical approaches to fit the distribution and goodness of fit analysis. Discussion and Conclusions. There is no apparent difference between the message written in the discussion and the one written in the conclusion section. I would suggest being more specific or merging both sections. Reproducibility of the article. While the data has been shared, I would recommend to also sharing the scripts used during calculations. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

31 May 2022 Response to Reviewers Reviewer 1 1. The authors analyze the topological properties of phonological and ortho- graphic networks of 12 languages by adopting a multiplex formalism. They perform a comprehensive analysis of these graphs and uncover interesting patterns of language use. Strength * The report is well-prepared, and the analysis is comprehensive. Weakness * The figures and captions are out of place. The authors should consider redesigning all the figures to fit within the text, even if it involves splitting the figures into separate subfigures. I recommend the authors fix the figures (see above) as a minor revision to make this fit for publication. Response: Thank for your comments and suggestions. We have redesigned all the figures as suggested. Reviewer 2 1. My main criticism of the article is that it seems that a series of very interesting analyzes have been carried out but without a clear objective. For example, in the abstract, it is said: “We report that from the analysis of topological properties of networks, there are different levels of local and global similarity when comparing. . . ” This result somehow would be ex- pected. The interesting question would be univocally framing those results with other linguistic research, phylogenetic approaches, information the- ory or sociology hypothesis of language. I believe that the article has great potential, but it should make it more straightforward about the hypotheses, objectives, and substantial contributions beyond an exciting series of anal- yses. Response: We welcome comments on the submitted version. These observations have been very constructive and are greatly appreciated. We have revised and added corrections according to the opinions expressed. Particularly, we have clarified the 1 scope of the study in terms of objectives, hypotheses and con- tributions. 2. Additionally, I would like to propose the following comments: Introduction. Some sentences are vague, e.g. “The complexity of natural language has been studied from different perspectives of scientific research”. This kind of sentence can be more concrete and explicative, including references to previous works. Response: We have re-written these sentences in a more detailed way and added citations. 3. Line 9-13: This should be referenced. Response: We have added the corresponding citations. 4. Line 13.15: However, it was later fitted to a Weibull distribution. Response: Thank you for the annotation. In previous reports, the case of phonological networks constructed from similarities in sounds led to truncated power law type degree distributions (we have added ref. ([20])). For our part, the similarity is based entirely on the Damerau-Levenhstein distance, which, although relatively easier to calculate, tends to identify greater similarity between words, hence there is a difference between the functions that best approximate the distributions. 5. Related to ref 5 and the relationship between ortographic and phonology I would suggest to consider the reference: Torre, I. G., Luque, B., Lacasa, L., Kello, C. T., Hern ´andez-Fern ´andez, A. (2019). On the physical origin of linguistic laws and lognormality in speech. Royal Society open science, 6(8), 191023; where the authors have addressed this kind of statistical regularities of language. Response: Thank you very much for bringing us closer to the work of Torre et. al, which is very interesting in the context of the universality and origin of linguistic laws. We have added the corresponding citation. 6. The relationship between orthographic and phonology networks could be of great interest to computer science, where automatic speech recognition systems may fail to transcribe into written text. This somehow could be related to the transfer learning between languages: e.g. a neural network that can recognize a language could work well in another language depend- ing on the similarity of the multiplex network. Response: We appreciate this interesting suggestion on the use- fulness of phonetic-orthographic networks. We have added a few sentences related to this as motivation in the Introduction (see lines 50-55) 2 7. Limitation of ortographic translation of Damerau-Levenshtein. The or- thographic transcription is, in some way, a discrete representation of the spoken word, but the orality is a continuum spectrum. In linguistics, the vowels are represented in the IPA vowel trapezium, some of them physi- cally closer to others. The Damerau-Levenshtein distance does not recover this fact. Response: We agree that the similarity based on the Damerau- Levenhstein distance has limitations that impact especially in phonetic network, where this affinity is overestimated because the discretization of sounds leads to loss of structure. We have added a few sentences pointing out these limitations. 8. Material and methods, and Results Degree distributions are fitted with a Weibull-type function. This hypoth- esis should be supported by literature and compared to another type of distribution function, e.g. power-law or exponential. Response: We have extended our description and justification of fitting with a Weibull function (see lines xx-xx) 9. The fitting to a probability distribution function, particularly a power law, has been a hot topic in recent years, so it should be rigorously addressed. The article should include the methodology used, the estimators and the goodness of fit. Some useful references: • Clauset, A., Shalizi, C. R., Newman, M. E. (2009). Power-law distributions in empirical data. SIAM review, 51(4), 661-703. • Navas-Portella, V., Gonz ´alez, ´A., Serra, I., Vives, E., Corral, ´A. (2019). Universality of power-law exponents by means of maximum-likelihood estimation. Physical Review E, 100(6), 062106. Response: We appreciate the suggestions for a more rigorous justification of the selection of the best fit to the distributions. We have added the corresponding citations and included a de- scription of the tests used. 10. Line 144. The Pearson correlation test hypothesis should be revisited. It usually requires a normal distribution of the values, which seems to be not happening, at least for the degree. Response: Thank you for the observation of the distribution requirement. We have reviewed the presence of correlations in terms of Spearman rank correlation coefficient. Unlike the case of Pearson’s correlation, it does not require the distributions to be normal. The results are relatively similar. 11. Material and methods should be extended to detail the mathematical ap- proaches to fit the distribution and goodness of fit analysis. Response: We have added details of our procedure to fit the dis- tributions and the goodness of fit analysis (see changes in lines 309-318) 3 12. Discussion and Conclusions. There is no apparent difference between the message written in the discus- sion and the one written in the conclusion section. I would suggest being more specific or merging both sections. Response: We have reorganized these parts and merged both Sections 13. Reproducibility of the article. While the data has been shared, I would recommend to also sharing the scripts used during calculations. Response: The scripts of the most representative calculations are now shared in the FigShare link. Submitted filename: ResponsesPlosOne2022.pdf Click here for additional data file. 23 Jun 2022

PONE-D-21-38860R1

A multiplex analysis of phonological and orthographicnetworks PLOS ONE Dear Dr. Guzmán-Vargas, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Although the Authors made a thorough effort to comply with the Reviewer 2's criticisms, there are still some pointed issues which need to be addressed regarding the methodologies and analysis of the empirical data used, in particular the characterisation of the network degree distribution using the least-squares method, which is not recommended at all when the distributions present large fluctuations in their tail. The results of the fittings with most convenient methods to the two already used distributions (or other potential hypothesis) should be provided. Please submit your revised manuscript by Aug 07 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Irene Sendiña-Nadal Academic Editor PLOS ONE [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Does the manuscript adhere to the experimental procedures and analyses described in the Registered Report Protocol? If the manuscript reports any deviations from the planned experimental procedures and analyses, those must be reasonable and adequately justified. Reviewer #1: Yes Reviewer #2: Partly ********** 2. If the manuscript reports exploratory analyses or experimental procedures not outlined in the original Registered Report Protocol, are these reasonable, justified and methodologically sound? A Registered Report may include valid exploratory analyses not previously outlined in the Registered Report Protocol, as long as they are described as such. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Are the conclusions supported by the data and do they address the research question presented in the Registered Report Protocol? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. The conclusions must be drawn appropriately based on the research question(s) outlined in the Registered Report Protocol and on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The revised manuscript addresses the concerns raised by R2. However, the issues raised by R1 regarding the misplaced figures still exists. This could be an artifact of TeX-diff, in which case, the authors should ignore the above concerns. Reviewer #2: First of all, I would like to thank the authors for the time and consideration taken in responding to these suggestions. Although, in general, the authors have reviewed the comments and worked on the article, I have doubts about one of the points: the adjustments of the network degree distribution. As was stated in the previous version, there has been a lot of research and discussion on this topic during the last decade. Therefore, the fit of a degree distribution by the least-squares method is not recommended at all ((A Clauset · 2007, 2009), Anna Deluca & Álvaro Corral, 2015; among others). One option is applying the maximum likelihood method to adjust the parameters and detect the lower cut-off points, but other statistical methods are available. Furthermore, different candidate distributions should be considered when the slopes are not very steep (as seems to be the case), typically Weibull, lognormal, power law, gamma or exponential. MLE numerical results should be provided with the values of loglikelihood, BIC or AIC. It should be considered that some of these distributions have a different number of variables. The best candidate distribution should be selected with loglikelihood, AIC or BIC. Finally, the goodness of fit should also be provided. The results of those fitting should be added to the main document or supplementary material, and ideally, the programming scripts should also be provided. Chattopadhyay, S., Murthy, C. A., & Pal, S. K. (2014). Fitting truncated geometric distributions in large-scale real-world networks. Theoretical Computer Science, 551, 22-38. Clauset, A., Shalizi, C. R., & Newman, M. E. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661-703. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 12 Aug 2022 Response to Reviewers Reviewer 1 1. The revised manuscript addresses the concerns raised by R2. However, the issues raised by R1 regarding the misplaced figures still exists. This could be an artifact of TeX-diff, in which case, the authors should ignore the above concerns. Response: We have corrected the Figures. Reviewer 2 1. Although, in general, the authors have reviewed the comments and worked on the article, I have doubts about one of the points: the adjustments of the network degree distribution. As was stated in the previous version, there has been a lot of research and discussion on this topic during the last decade. Therefore, the fit of a degree distribution by the least-squares method is not recommended at all (A Clauset · 2007, 2009, Anna Deluca ´Alvaro Corral, 2015; among others). One option is applying the maximum likelihood method to adjust the parameters and detect the lower cut-off points, but other statistical methods are available. Furthermore, different candidate distributions should be considered when the slopes are not very steep (as seems to be the case), typically Weibull, lognormal, power law, gamma or exponential. MLE numerical results should be provided with the values of loglikelihood, BIC or AIC. It should be considered that some of these distributions have a different number of variables. The best candidate distribution should be selected with loglikelihood, AIC or BIC. Finally, the goodness of fit should also be provided. The results of those fitting should be added to the main document or supplementary material, and ideally, the programming scripts should also be provided. Response: We appreciate the suggestion to explore in more de- tail the problem of fitting empirical degree distributions. We have adopted this strategy and several functions have been con- sidered. Our approach has been based, as suggested, on the AIC and BIC indices, to identify the best fit. In addition, we have used the reason methodology proposed by A. Clauset et al. to confirm that the candidate distribution is better com- pared to the other distributions (see lines 134-157 of the main article and the Supplementary Material in FigShare ( https: //doi.org/10.6084/m9.figshare.14668593 ). Additional details, such as the results of the fittings, test values and programming scripts performed are also shown in the material available on-line. Submitted filename: ResponsesRound3PlosOne2022.pdf Click here for additional data file. 1 Sep 2022 A multiplex analysis of phonological and orthographicnetworks PONE-D-21-38860R2 Dear Dr. Guzmán-Vargas, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Irene Sendiña-Nadal Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Does the manuscript adhere to the experimental procedures and analyses described in the Registered Report Protocol? If the manuscript reports any deviations from the planned experimental procedures and analyses, those must be reasonable and adequately justified. Reviewer #2: Yes ********** 2. If the manuscript reports exploratory analyses or experimental procedures not outlined in the original Registered Report Protocol, are these reasonable, justified and methodologically sound? A Registered Report may include valid exploratory analyses not previously outlined in the Registered Report Protocol, as long as they are described as such. Reviewer #2: Yes ********** 3. Are the conclusions supported by the data and do they address the research question presented in the Registered Report Protocol? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. The conclusions must be drawn appropriately based on the research question(s) outlined in the Registered Report Protocol and on the data presented. Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: The authors have addressed all recommendations and included all requested information about the statistical analysis and fittings. Besides, they have included additional details in Supplementary Information. Finally, I just want to congratulate the authors on this interesting scientific paper and acknowledge their efforts. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No ********** 6 Sep 2022 PONE-D-21-38860R2 A multiplex analysis of phonological and orthographic networks Dear Dr. Guzmán-Vargas: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Irene Sendiña-Nadal Academic Editor PLOS ONE

26 in total

A multiplex analysis of phonological and orthographic networks.

Introduction

Orthographic and phonological network

Construction of the multiplex language network.

Databases

Results

Ratio (phonological/orthographic) values for some metrics shown in Table 1.

Degree distributions for phonological and orthographic networks.

Language similarity evaluated by the Jensen-Shannon distance between layers orthographic (horizontal) and phonological (vertical).

Correlations between some structural properties of the orthographic and phonological networks.

Average Jaccard index (J), modularity ratio and similarity F1*.

Robustness of the networks.

Average component size differences between phonological and orthographic networks.

Discussion and conclusion

Methods

Damerau-Levenshtein distance

Network metrics

Fitting of degree distributions

Jensen-Shannon distance and agglomerative clustering

Communities and F1-score

Jaccard index

1. Emergence of scaling in random networks

2. Assortative mixing in networks.

3. Multinetwork of international trade: a commodity-specific analysis.

4. Return interval distribution of extreme events and long-term memory.

5. The dynamics of norm change in the cultural evolution of language.

6. Diffusion dynamics on multiplex networks.

7. Remote synchronization reveals network symmetries and functional modules.

8. Centralities in simplicial complexes. Applications to protein interaction networks.

9. Recurrence Networks in Natural Languages.

10. Cross-language distributions of high frequency and phonetically similar cognates.