| Literature DB >> 33267231 |
Edgar Baeza-Blancas1,2, Bibiana Obregón-Quintana3, Candelario Hernández-Gómez1, Domingo Gómez-Meléndez4, Daniel Aguilar-Velázquez2, Larry S Liebovitch5,6,7, Lev Guzmán-Vargas2.
Abstract
We present a study of natural language using the recurrence network method. In our approach, the repetition of patterns of characters is evaluated without considering the word structure in written texts from different natural languages. Our dataset comprises 85 ebookseBooks written in 17 different European languages. The similarity between patterns of length m is determined by the Hamming distance and a value r is considered to define a matching between two patterns, i.e., a repetition is defined if the Hamming distance is equal or less than the given threshold value r. In this way, we calculate the adjacency matrix, where a connection between two nodes exists when a matching occurs. Next, the recurrence network is constructed for the texts and some representative network metrics are calculated. Our results show that average values of network density, clustering, and assortativity are larger than their corresponding shuffled versions, while for metrics like such as closeness, both original and random sequences exhibit similar values. Moreover, our calculations show similar average values for density among languages which that belong to the same linguistic family. In addition, the application of a linear discriminant analysis leads to well-separated clusters of family languages based on based on the network-density properties. Finally, we discuss our results in the context of the general characteristics of written texts.Entities:
Keywords: natural languages; patterns repetition; recurrence networks
Year: 2019 PMID: 33267231 PMCID: PMC7515007 DOI: 10.3390/e21050517
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Recurrence symmetric matrix for the beginning of Hamlet’s famous soliloquy: To-be-or-not-to-be. Here and we set . The resulting matrix has 16 rows and columns.
|
| To_ | o_b | _be | be_ | e_o | _or | or_ | r_n | _no | not | ot_ | t_t | _to | to_ | ob_ | _be |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| To_ |
| 0 | 0 |
| 0 |
|
| 0 | 0 |
|
|
| 0 |
| 0 | 0 |
| o_b | 0 |
| 0 | 0 |
| 0 |
|
| 0 | 0 |
|
| 0 | 0 |
| 0 |
| _be | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
|
| be_ |
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
| e_o | 0 |
| 0 | 0 |
| 0 | 0 |
|
| 0 | 0 |
|
| 0 |
| 0 |
| _or | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
|
| 0 | 0 |
|
| 0 |
|
| or |
|
| 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
|
| 0 |
| r_n | 0 |
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
| 0 |
| _no | 0 | 0 |
| 0 |
|
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
|
| not |
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 |
| 0 |
| 0 |
| 0 | 0 |
| ot_ |
|
| 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 |
|
|
| 0 |
| t_t |
|
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 |
|
| 0 |
| _to | 0 | 0 |
| 0 |
|
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
|
| to_ |
| 0 | 0 |
| 0 |
|
| 0 | 0 |
|
|
| 0 |
| 0 | 0 |
| o_b | 0 |
| 0 | 0 |
| 0 |
|
| 0 | 0 |
|
| 0 | 0 |
| 0 |
| _be | 0 | 0 |
| 0 | 0 |
| 0 | 0 |
| 0 | 0 | 0 |
| 0 | 0 |
|
Figure 1Log-linear plot of density vs. the distance r for several values of the pattern length m. Here we show the cases and r runs from 1 to , where . The fit corresponds to the case , which yields to .
Figure 2Representative metrics of recurrence-pattern networks for different languages. (a) Density for languages grouped by linguistic families. (b) Average clustering coefficient . (c) Closeness centrality. (d) Assortativity coefficient. Vertical bars indicate the standard deviation of the data.
Figure 3Mean nearest-neighbor connectivity as a function of the degree for (a) Germanic, (b) Romance, (c) Slavic, and (d) Uralic linguistic families. For each language, we also show the values of corresponding to shuffled texts. A scaling behavior is observed for all cases of the form . We estimate the scaling exponent for degree values , yielding the average values and for the original and random data, respectively. As a guide for the eye, the dashed line corresponds to the slope = .
Figure 4Results of classification analysis applied to European languages. (a) Results of the linear discriminant method. Here we show the projection of density values from pattern lengths . For each m-value and for each language, we considered ten segments with length to obtain ten values. Next, languages were labeled in classes according to the linguistic family to which they belong (Romance, Germanic, Slavic, Uralic). (b) Results of the application of the k-nearest- neighbor classification method to data in panel a) but assigning the same label to languages of the same family. We used neighbors in the classifier. We observe that the families are segregated, except in the case of the Uralic family, which led to two disjoint regions. (c) Results of the confusion matrix. The system makes a clear distinction between almost all family languages, except the case of Uralic, where we observe a problem distinguishing this family from Slavic and Romance.