| Literature DB >> 35756089 |
Abstract
Lexical complexity of homogeneous texts, especially when produced by an institutional author over time, exhibits a generally observed increasing trend with local random fluctuations. Such an irreversible entropic process fits very cogently into the dynamical complexity system theory, where the social, economic, and cultural missions such texts set to serve constitute the underlying driving momentum for the texts to adapt themselves from low to high complexity. Structural equations have been shown effective in modeling such macroscopic behavior of the entropic process of the homogeneous texts. The current work formulates the problem from a time series modeling approach applied to a large sociolinguistic corpus in written Chinese. The findings show that such an alternative approach not only produces as valid models with strong goodness of fit as the structural equation approach, but also exhibits, by design, additional benefits in explaining the entropic process of homogeneous texts in the dynamical complexity system framework. Some technical challenges, such as phase change in model calibration, are also solved with less cost using the newly proposed approach. Further directions are pointed out to more fully compare these approaches in the setup of the current study and corpus linguistics in general.Entities:
Keywords: ARIMA model; Dynamical complexity; Entropy; Homogeneous texts; Time series
Year: 2022 PMID: 35756089 PMCID: PMC9206426 DOI: 10.1007/s11135-022-01451-4
Source DB: PubMed Journal: Qual Quant ISSN: 0033-5177
Descriptive statistics of the CGWR texts in the current study
| Time series tags | Total number of texts | 52 |
| Frequency of delivery | Annual | |
| Serial correlation | Yes | |
| Type | 2212 | |
| Average lexical statistics (in Chinese words) | Token | 10,500 |
| Entropy | 6.621 | |
| Genre | Institutional writing | |
| Domain | Socio-economic, public, world affairs | |
| Dicourse paramters | Medium | Public archive, internet |
| Register | High formal |
Fig. 1Types and tokens in words for the corpus under study
Fig. 2Word cloud example of the CGWR for the years 2007–2008
Fig. 3Plot of the CGWR entropic process and the linear trend fitting
Fig. 4ACF plot of the CGWR entropic process
Unit root test statistics to the entropic process in original series (maximum number of lags included: 12; test mode: nc)
| Unit root criterion | ADF statistic | CV (significance level 0.01) | CV (significance level 0.05) | CV (significance level 0.10) | |
|---|---|---|---|---|---|
| AIC | 1.9491 | 0.9887 | – 2.6150 | – 1.9479 | – 1.6122 |
| BIC | 0.7414 | 0.8746 | – 2.6119 | – 1.9475 | – 1.6124 |
Unit root test statistics to the first order differenced entropy series (maximum number of lags included: 12; test mode: c)
| Unit root characterization | ADF statistic | CV (significance level 0.01) | CV (significance level 0.05) | CV (significance level 0.10) | |
|---|---|---|---|---|---|
| AIC | – 5.7360 | 6.4348e-07 | – 3.5778 | – 2.9253 | – 2.6008 |
| BIC | – 12.4612 | 3.4024e-23 | – 3.5685 | – 2.9214 | – 2.5987 |
Fig. 5ACF and PACF plots of the first-order difference of the entropic process
Fig. 6First-order difference process and the fitted curve
ARMAmodel estimation and significance (max iteration 100,000)
| Model and parameters | Regression value | |
|---|---|---|
| Model (in F-test statistic) | – 7.0117 | 6.8959e-10 |
| 0.0128 | 1.1696e-26 | |
| – 1.5310 | 2.9077e-29 | |
| – 0.8100 | 1.4995e-14 | |
| 0.6561 | 1.6017e-09 | |
| – 0.6561 | 4.6939e-07 | |
| – 0.9999 | 4.7785e-21 |
Out-sample forecasting statistics
| MSE | MAE | MAPE | Theil | |
|---|---|---|---|---|
| 1-step-ahead forecast | 0.0243 | 0.1195 | 0.0172 | 0.5283 |
| 2-step-ahead forecast | 0.009 | 0.0629 | 0.0091 | 0.2955 |
| 3-step-ahead forecast | 0.0107 | 0.0816 | 0.0118 | 0.3148 |
Fig. 7QQ normality plot