| Literature DB >> 28245249 |
Rudolf Hanel1, Bernat Corominas-Murtra1, Bo Liu1, Stefan Thurner1,2,3,4.
Abstract
Most standard methods based on maximum likelihood (ML) estimates of power-law exponents can only be reliably used to identify exponents smaller than minus one. The argument that power laws are otherwise not normalizable, depends on the underlying sample space the data is drawn from, and is true only for sample spaces that are unbounded from above. Power-laws obtained from bounded sample spaces (as is the case for practically all data related problems) are always free of such limitations and maximum likelihood estimates can be obtained for arbitrary powers without restrictions. Here we first derive the appropriate ML estimator for arbitrary exponents of power-law distributions on bounded discrete sample spaces. We then show that an almost identical estimator also works perfectly for continuous data. We implemented this ML estimator and discuss its performance with previous attempts. We present a general recipe of how to use these estimators and present the associated computer codes.Entities:
Mesh:
Year: 2017 PMID: 28245249 PMCID: PMC5330461 DOI: 10.1371/journal.pone.0170920
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Decision tree of questions that should be clarified before estimating power-law exponents from data.
The tree shows under which conditions the fitting algorithms developed in this paper r_plfit and r_plhistfit can be used.
Fig 3Comparison of the three power-law exponent estimators, LS, MLCSN, and ML*.
For 400 values of λ in the range between 0 and 4, we sample N = 10,000 events from Ω = {1, ⋯, 1,000}, from a power-law probability distribution p(x|λ, Ω) ∝ x−λ. The estimated exponents λest for the estimators LS (red), the MLCSN (green, ), and the new ML* (black, λest = λ*), are plotted against the true value of the exponent λ of the probability distribution samples are drawn from. Clearly, below λ ∼ 1.5 the MLCSN estimator no longer works reliably. MLCSN and ML* work equally well in a range of 1.5 < λ < 3.5. Outside this range ML* performs consistently better than the other methods. The inset shows the mean-square error σ2 of the estimated exponents. The LS-estimator has a much higher σ2 over the entire region, than the ML*-estimator. The blue dot represents the ML* estimate for the Zipf exponent of C. Dickens’ “A tale of two cities”. Clearly, this exponent could never reliably be obtained from the rank ordered distribution using MLCSN, whereas ML* works fine even for values of λ ∼ 0.
Fig 2The four types of distribution functions.
Data is sampled from a power-law distribution p(x) ∝ x−λ with an exponent λ = 0.7 (red line). The relative frequencies f are shown for N = 10000 sampled data points according to their natural (prior) ordering that is associated with p (blue). The rank-ordered distribution (posterior) is shown in yellow, where states i are ordered according to their observed relative frequencies f. The rank-ordered distribution follows a power-law, except for the exponential decay that starts at rank∼500. A low frequency cut-off should be used to remove this part for estimating exponents. The inset shows the frequency distribution ϕ(n) that describes how many states x appear n times (green). The frequency distribution has a maximum and a power-law tail with exponent α = 1 + 1/λ ∼ 2.43. To estimate α, one should only consider the tail of the frequency distribution function.
Comparison of the estimators ML* and MLCSN on empirical data sets that were used in [23].
These include the frequency of surnames, intensity of wars, populations of cities, earthquake intensity, numbers of religious followers, citations of scientific papers, counts of words, wealth of the Forbes 500 firms, numbers of papers authored, solar flare intensity, terrorist attack severity, numbers of links to websites, and forest fire sizes. We added the word frequencies in the novel “A tale of two cities” (C. Dickens). The second column states if α or λ were estimated. The exponents reported in [23] are found in column CSN1, those reproduced by us applying their algorithm to data [23, 34–37] is shown in column CSN2. The latter correspond well with the new ML* algorithm. For values λ < 1.5, CSN can not be used. We list the corresponding values for Kolmogorov-Smirnov test for the two estimators, KSCSN and KS*.
| exp. | CSN1 | CSN2 | ML* | KSCSN | KS* | |
|---|---|---|---|---|---|---|
| blackouts | λ | 2.3 | 2.27 | 2.25 | 0.061 | 0.031 |
| surnames | 2.5 | 2.49 | 2.66 | 0.041 | 0.019 | |
| int. wars | λ | 1.7 | 1.73 | 1.83 | 0.078 | 0.076 |
| city pop. | λ | 2.37 | 2.36 | 2.31 | 0.019 | 0.016 |
| quake int. | λ | 1.64 | 1.64 | 1.88 | 0.092 | 0.085 |
| relig. fol. | λ | 1.8 | 1.79 | 1.61 | 0.091 | 0.095 |
| citations | λ | 3.16 | 3.16 | 3.10 | 0.010 | 0.018 |
| words | 1.95 | 1.95 | 1.99 | 0.009 | 0.015 | |
| wealth | λ | 2.3 | 2.34 | 2.30 | 0.063 | 0.066 |
| papers | λ | 4.3 | 4.32 | 3.89 | 0.079 | 0.082 |
| sol. flares | λ | 1.79 | 1.79 | 1.81 | 0.009 | 0.021 |
| terr. attacks | λ | 2.4 | 2.37 | 2.36 | 0.018 | 0.017 |
| websites | λ | 2.336 | 2.12 | 1.72 | 0.025 | 0.056 |
| forest fires | λ | 2.2 | 2.16 | 2.46 | 0.036 | 0.034 |
| Dickens novel | λ | - | - | 1.04 | - | 0.017 |