Literature DB >> 33286315

The Lorenz Curve: A Proper Framework to Define Satisfactory Measures of Symbol Dominance, Symbol Diversity, and Information Entropy.

Abstract

Novel measures of symbol dominance (dC1 and dC2), symbol diversity (DC1 = N (1 - dC1) and DC2 = N (1 - dC2)), and information entropy (HC1 = log2 DC1 and HC2 = log2 DC2) are derived from Lorenz-consistent statistics that I had previously proposed to quantify dominance and diversity in ecology. Here, dC1 refers to the average absolute difference between the relative abundances of dominant and subordinate symbols, with its value being equivalent to the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability; dC2 refers to the average absolute difference between all pairs of relative symbol abundances, with its value being equivalent to twice the area between the Lorenz curve and the 45-degree line of equiprobability; N is the number of different symbols or maximum expected diversity. These Lorenz-consistent statistics are compared with statistics based on Shannon's entropy and Rényi's second-order entropy to show that the former have better mathematical behavior than the latter. The use of dC1, DC1, and HC1 is particularly recommended, as only changes in the allocation of relative abundance between dominant (pd > 1/N) and subordinate (ps < 1/N) symbols are of real relevance for probability distributions to achieve the reference distribution (pi = 1/N) or to deviate from it.

Entities: Disease Gene Species

Keywords: Camargo statistics; Lorenz curve; Rényi’s entropy; Shannon’s entropy; information entropy; symbol diversity; symbol dominance

Year: 2020 PMID： 33286315 PMCID： PMC7517034 DOI： 10.3390/e22050542

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Following the early use of Shannon’s [1] entropy (H) by some theoretical ecologists during the 1950s [2,3,4], H has been extensively used in community ecology to quantify species diversity. Ecologists have considered the relative abundance or probability of the ith symbol in a message or sequence of N different symbols whose meaning is irrelevant [1,5,6] as the relative abundance or probability of the ith species in a community or assemblage of S different species whose phylogeny is irrelevant (i.e., all species are considered taxonomically equally distinct) [4,7,8]. This use of H implies that the concept of species diversity is directly related to the concept of information entropy, basically representing the amount of information or uncertainty in a probability distribution defined for a set of N possible symbols [1] or a set of S possible species [4]. H takes values from 0 to log2 N or log2 S and is properly expressed in bits, but it can also be expressed in nats or dits (also called bans, decits, or Hartleys) if the natural logarithm or the decimal logarithm is calculated [1,4,5,6,7,8]. In recent decades, several ecologists have, however, claimed that H is a unsatisfactory diversity index because species diversity actually takes values from 1 to S and is ideally expressed in units of species (i.e., in the same units as S). Keeping this perspective in mind, and only considering the number of different symbols as the number of different species and the relative abundances of symbols as the relative abundances of species, Hill [9] proposed the exponential form of Shannon’s [1] entropy (H) and the exponential form of Rényi’s [10] second-order entropy (H) = the reciprocal of Simpson’s [11] concentration statistic (λ) as better alternatives to quantify species diversity, thereby assuming that the amount of information or uncertainty in a probability distribution defined for a set of S possible species was mathematically equivalent to the logarithm of its related species diversity. Similarly, we can assume that the amount of information or uncertainty (expressed in bits) in a probability distribution defined for a set of N possible symbols is mathematically equivalent to the binary logarithm of its related symbol diversity. Additionally, we can assume that symbol dominance characterizes the extent of relative abundance inequality among different symbols, particularly between dominant and subordinate symbols, and that symbol diversity equals the number of different symbols (N) or maximum expected diversity in any given message with equiprobability. On the basis of these working assumptions, I first use the Lorenz curve [12] as the key framework to assess symbol dominance, symbol diversity, and information entropy. The contrast between symbol dominance and symbol redundancy is also highlighted. Subsequently, novel measures of symbol dominance (d1 and d2), symbol diversity (D1 and D2), and information entropy (H1 and H2) are derived from Lorenz-consistent statistics that I had previously proposed to quantify dominance and diversity in community ecology [13,14,15,16,17] and landscape ecology [18]. Finally, Lorenz-consistent statistics (d1, d2, D1, D2, H1, and H2) are compared with H-based and H-based statistics (d, d, D, D, H, and H) to show that the former have better mathematical behavior than the latter when measuring symbol dominance, symbol diversity, and information entropy in hypothetical messages. In this regard, I recently found that the corresponding versions of d1, d2, D1, and D2 exhibited better mathematical behavior than the corresponding versions of d, d, D, and D when measuring land cover dominance and diversity in hypothetical landscapes [18]. This better mathematical behavior was inherent to the compatibility of d1 and d2 with the Lorenz-curve-based graphical representation of land cover dominance [18]. The Lorenz curve [12] was introduced in the early twentieth century as a graphical method to assess the inequality in the distribution of income among the individuals of a population. Subsequently, this graphical method and Lorenz-consistent indices of income inequality, such as Gini’s [19,20] index and Schutz’s [21] index, have become popular in the field of econometrics (see reviews in [22,23]). More recently, owing to the increasing economic inequality during the present market globalization [24], some authors have supported the use of Bonferroni’s [25] curve and Zenga’s [26] curve and related indices to better assess poverty, as these inequality measures are oversensitive to lower levels of the income distribution [27,28,29]. To me, however, the Lorenz curve represents the best and most logical framework to define satisfactory indices of inequality (dominance) and associated measures of diversity or entropy.

2. Materials and Methods

2.1. Assessing Symbol Dominance, Symbol Diversity, and Information Entropy within the Framework of the Lorenz Curve

In econometrics, the Lorenz curve [12] is ideally depicted within a unit (1 × 1) square, in which the cumulative proportion of income (the vertical y-axis) is related to the cumulative proportion of individuals (the horizontal x-axis), ranked from the person with the lowest income to the person with the highest income. The 45-degree (diagonal) line represents equidistribution or perfect income equality. Income inequality may be quantified as the maximum vertical distance from the Lorenz curve to the 45-degree line of equidistribution if only differences in income between the rich and the poor are of interest (this measure being equivalent to the value of Schutz’s inequality index), or as twice the area between the Lorenz curve and the 45-degree line of equidistribution if differences in income among all of the individuals are of interest (this measure being equivalent to the value of Gini’s inequality index), with both measures exhibiting the same value whenever income inequality occurs only between the rich and the poor (see reviews in [22,23]; also see [18]). Therefore, in any given population with M individuals, income inequality takes a minimum possible value of 0 when every person has the same income (= total income/M, including M = 1) and a maximum possible value of 1 − 1/M when a single person has all the income and the remaining M − 1 people have none, as persons with no income can exist in a population. If we assume that symbol dominance characterizes the extent of relative abundance inequality among different symbols, particularly between dominant and subordinate symbols, then the Lorenz-curve-based graphical representation of symbol dominance is given by the separation of the Lorenz curve from the 45-degree line of equiprobability, in which every symbol i has the same relative abundance (p = 1/N, with N = the number of different symbols). This separation may be quantified as the maximum vertical distance from the Lorenz curve to the 45-degree line if only differences in relative abundance between dominant and subordinate symbols are of interest, or as twice the area between the Lorenz curve and the 45-degree line if differences in relative abundance among all symbols are of interest, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols. In any given message with equiprobability the relative abundance of each different symbol equals 1/N, meaning a symbol may be objectively regarded as dominant if its probability (p) > 1/N and as subordinate if its probability (p) < 1/N. I had already used an equivalent method to discriminate between dominant and subordinate species [13,14,15,16,17] and between dominant and subordinate land cover types [18]. Thus, symbol dominance takes a minimum possible value of 0 when every different symbol has the same relative abundance (= 1/N, including N = 1), and approaches a maximum possible value of 1 – 1/N when a single symbol has a relative abundance very close to 1 and the remaining N −1 symbols have minimum relative abundances (>0), as symbols with no abundance or zero probability do not exist in a message. In addition, if we assume that symbol diversity equals the number of different symbols or maximum expected diversity (N) in any given message with equiprobability (symbol dominance = 0 because p = 1/N), then symbol diversity in any given message with symbol dominance > 0 must equal the maximum expected diversity minus the impact of symbol dominance on it; that is, symbol diversity = N – (N × symbol dominance) = N (1 – symbol dominance). This Lorenz-consistent measure of symbol diversity is a function of both the number of different symbols and the equal distribution of their relative abundances (i.e., symbol diversity is a probabilistic concept free of semantic attributes), taking values from 1 to N (maximum diversity if p = 1/N) and being properly expressed in units of symbols. Therefore, symbol diversity/N = 1 – symbol dominance (i.e., symbol dominance triggers the inequality between symbol diversity and its maximum expected value). It should also be evident that the reciprocal of symbol diversity refers to the concentration of relative abundance in the same symbol, and consequently may be regarded as a Lorenz-consistent measure of symbol redundancy = 1/(N (1 − symbol dominance)). This redundancy measure is a function of both the fewness of different symbols and the unequal distribution of their relative abundances, taking values from 1/N to 1 (maximum redundancy if N = 1). Thus, symbol dominance (relative abundance inequality among different symbols) and symbol redundancy are distinct concepts, although the value of the former affects the value of the latter. Lastly, if we assume that information entropy is mathematically equivalent to the binary logarithm of its related symbol diversity, then the resulting Lorenz-consistent measure of information entropy = log2 (N (1 − symbol dominance)). This entropy measure takes values from 0 to log2 N (maximum entropy if p = 1/N) and is properly expressed in bits, quantifying the amount of information or uncertainty in a probability distribution defined for a set of N possible symbols. Obviously, the degree of uncertainty attains a minimum value of 0 as symbol redundancy reaches a maximum value of 1.

2.2. Deriving Measures of Symbol Dominance, Symbol Diversity, and Information Entropy from Lorenz-Consistent Statistics

Following the theoretical approach of assessing symbol dominance, symbol diversity, and information entropy within the framework of the Lorenz curve, novel measures of symbol dominance (d1 and d2), symbol diversity (D1 and D2), and information entropy (H1 and H2) are derived from Lorenz-consistent statistics, which I had previously proposed to quantify species dominance and diversity [13,14,15,16,17] and land cover dominance and diversity [18]. In this derivation the number of different species or land cover types is considered as the number of different symbols, and the probabilities of species or land cover types are considered as the probabilities of symbols: where N is the number of different symbols or maximum expected diversity, p > 1/N is the relative abundance of each dominant symbol, p < 1/N is the relative abundance of each subordinate symbol, p and p are the relative abundances of two different symbols in the same message, L is the number of dominant symbols, G is the number of subtractions between the relative abundances of dominant and subordinate symbols, and K = N (N − 1)/2 is the number of subtractions between all pairs of relative symbol abundances. The dominance statistic d1 refers to the average absolute difference between the relative abundances of dominant and subordinate symbols (Equation (1)), with its value being equivalent to the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability (see also [18]). Accordingly, the value of D1 equals the number of different symbols minus the impact of symbol dominance (d1) on the maximum expected diversity (Equation (2)). The binary logarithm of this subtraction is the associated measure of information entropy (Equation (3)). Likewise, the dominance statistic d2 refers to the average absolute difference between all pairs of relative symbol abundances (Equation (4)), with its value being equivalent to twice the area between the Lorenz curve and the 45-degree line of equiprobability (see also [18]). Accordingly, the value of D2 equals the number of different symbols minus the impact of symbol dominance (d2) on the maximum expected diversity (Equation (5)). The binary logarithm of this subtraction is the associated measure of information entropy (Equation (6)). Despite the above dissimilarities between Lorenz-consistent statistics of symbol dominance, symbol diversity, and information entropy, d1 = d2 = 0, D1 = D2 = N, and H1 = H2 = log2 N if there is equiprobability (p = 1/N, including N = 1); and d1 = d2 > 0, D1 = D2 < N, and H1 = H2 < log2 N whenever relative abundance inequality occurs only between dominant and subordinate symbols. In this regard, it is worth noting that d1 is comparable to Schutz’s [21] index of income inequality (also known as the Pietra ratio or Robin Hood index) and d2 is comparable to Gini’s [19,20] index of income inequality. In fact, Gini’s index and Schutz’s index take the same value whenever income inequality occurs only between the rich and the poor (see reviews in [22,23]; also see [18]). However, there is a particular difference between the measurement of symbol dominance (d1 and d2) and the measurement of income inequality (Schutz’s index and Gini’s index): income inequality can reach a maximum value of 1 − 1/M when a single person has all the income and the remaining M – 1 people have none (as individuals with no income are considered to measure income inequality), but symbol dominance can only approach a maximum value of 1 – 1/N when a single symbol has a relative abundance very close to 1 and the remaining N – 1 symbols have minimum relative abundances (as symbols with no abundance or zero probability cannot be considered to measure symbol dominance). Additionally, because the reciprocal of symbol diversity refers to the concentration of relative abundance in the same symbol (as already explained in Section 2.1), two Lorenz-consistent statistics of symbol redundancy are R1 = 1/D1 and R2 = 1/D2. R1 and R2 take values from 1/N to 1 (maximum redundancy if N = 1), and therefore their mathematical behavior can considerably differ from the mathematical behavior of Gatlin’s [30] classical redundancy index (R = 1 – H/log2 N). Indeed, since R takes a maximum value of 1 if N = 1 and a minimum value of 0 if p = 1/N [30], R should be regarded as a combination of redundancy and dominance (see also [15]).

2.3. Comparing Lorenz-Consistent Statistics with HS-Based and HR-Based Statistics

Lorenz-consistent statistics of symbol dominance (d1 and d2), symbol diversity (D1 and D2), and information entropy (H1 and H2) are compared with statistics based on Shannon’s [1] entropy (H) and Rényi’s [10] second-order entropy (H). More specifically, on the basis of Hill’s [9] proposals for measuring diversity and Camargo’s [17] proposals for measuring dominance, the H-based and H-based statistics are: where p is the relative abundance or probability of the ith symbol in a message or sequence of N different symbols. Although d1 = d2 = d = d = 0, D1 = D2 = D = D = N, and H1 = H2 = H = H = log2 N whenever there is equiprobability, differences in mathematical behavior between Lorenz-consistent statistics and H-based and H-based statistics were examined by computing all these statistics for the ten probability distributions (I–X) described as hypothetical messages in Table 1. As we can see, the hypothetical message V is the primary or starting distribution, having two different symbols with probabilities of 0.6 and 0.4. From distribution V to I, the probabilities of all different symbols are successively halved by doubling their number, with the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability remaining steady (= 0.1). From distribution V to X, only the probabilities of subordinate symbols are successively halved by doubling their number, with the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability approaching the probability of the single dominant symbol (= 0.6). Accordingly, the degree of dominance in each dominant symbol is given by the positive deviation of its probability (p) from the expected equiprobable value of 1/N, while the degree of subordination in each subordinate symbol is given by the positive deviation of its probability (p) from 1/N. Thus, in each probability distribution or hypothetical message, symbol dominance = symbol subordination = the average absolute difference between the relative abundances of dominant and subordinate symbols (Equation (1)) = the whole relative abundance of dominant symbols that must be transferred to subordinate symbols to achieve equiprobability (P values in Table 1).

Table 1

Ten probability distributions (I–X) are described as hypothetical messages: N = the number of different symbols; p1–p33 = the relative abundances of symbols (symbol probabilities); P = the whole relative abundance of dominant symbols (p > 1/N) that must be transferred to subordinate symbols (p < 1/N) to achieve equiprobability (p = 1/N, including N = 1).

	I	II	III	IV	V	VI	VII	VIII	IX	X
N	32	16	8	4	2	3	5	9	17	33
p ₁	0.0375	0.075	0.15	0.3	0.6	0.6	0.6	0.6	0.6	0.6
p ₂	0.0375	0.075	0.15	0.3	0.4	0.2	0.1	0.05	0.025	0.0125
p ₃	0.0375	0.075	0.15	0.2		0.2	0.1	0.05	0.025	0.0125
p ₄	0.0375	0.075	0.15	0.2			0.1	0.05	0.025	0.0125
p ₅	0.0375	0.075	0.1				0.1	0.05	0.025	0.0125
p ₆	0.0375	0.075	0.1					0.05	0.025	0.0125
p ₇	0.0375	0.075	0.1					0.05	0.025	0.0125
p ₈	0.0375	0.075	0.1					0.05	0.025	0.0125
p ₉	0.0375	0.05						0.05	0.025	0.0125
p ₁₀	0.0375	0.05							0.025	0.0125
p ₁₁	0.0375	0.05							0.025	0.0125
p ₁₂	0.0375	0.05							0.025	0.0125
p ₁₃	0.0375	0.05							0.025	0.0125
p ₁₄	0.0375	0.05							0.025	0.0125
p ₁₅	0.0375	0.05							0.025	0.0125
p ₁₆	0.0375	0.05							0.025	0.0125
p ₁₇	0.025								0.025	0.0125
p ₁₈	0.025									0.0125
p ₁₉	0.025									0.0125
p ₂₀	0.025									0.0125
p ₂₁	0.025									0.0125
p ₂₂	0.025									0.0125
p ₂₃	0.025									0.0125
p ₂₄	0.025									0.0125
p ₂₅	0.025									0.0125
p ₂₆	0.025									0.0125
p ₂₇	0.025									0.0125
p ₂₈	0.025									0.0125
p ₂₉	0.025									0.0125
p ₃₀	0.025									0.0125
p ₃₁	0.025									0.0125
p ₃₂	0.025									0.0125
p ₃₃										0.0125
P_transfer	0.1	0.1	0.1	0.1	0.1	0.267	0.4	0.489	0.541	0.57

In addition, disparities in mathematical behavior between Lorenz-consistent statistics and H-based and H-based statistics were examined by computing all these statistics for the ten probability distributions (XI–XX) described as hypothetical messages in Table 2, where differences in relative abundance or probability occur not only between dominant and subordinate symbols (as in Table 1), but also among dominant symbols and among subordinate symbols. However, because the P value equals 0.25 in all hypothetical messages, only changes in the allocation of relative abundance between dominant and subordinate symbols (but not among dominant symbols or among subordinate symbols) seem to be of real significance for probability distributions to achieve the reference distribution (involving equiprobability) or to deviate from it. The reasons for this are evident: in the case of a dominant symbol increasing its relative abundance at the expense of other dominant symbols (Table 2, relative abundances p1–p5 in probability distributions XVI–XIX), the resulting proportional abundance of all the dominant symbols is the same as before the transfer, since the increase in the probability of a dominant symbol (becoming more dominant) is compensated by an equivalent decrease in the probability of other dominant symbols (becoming less dominant); similarly, in the case of a subordinate symbol increasing its relative abundance at the expense of other subordinate symbols (Table 2, relative abundances p6–p10 in probability distributions XII–XV), the resulting proportional abundance of all the subordinate symbols is the same as before the transfer, since the increase in the probability of a subordinate symbol (becoming less subordinate) is compensated by an equivalent decrease in the probability of other subordinate symbols (becoming more subordinate or rare).

Table 2

Ten probability distributions (XI–XX) are described as hypothetical messages: N = the number of different symbols; p1–p10 = the relative abundances of symbols (symbol probabilities); P = the whole relative abundance of dominant symbols (p > 1/N) that must be transferred to subordinate symbols (p < 1/N) to achieve equiprobability (p = 1/N, including N = 1).

	XI	XII	XIII	XIV	XV	XVI	XVII	XVIII	XIX	XX
N	10	10	10	10	10	10	10	10	10	10
p ₁	0.15	0.15	0.15	0.15	0.15	0.19	0.19	0.19	0.19	0.19
p ₂	0.15	0.15	0.15	0.15	0.15	0.14	0.17	0.17	0.17	0.17
p ₃	0.15	0.15	0.15	0.15	0.15	0.14	0.13	0.15	0.15	0.15
p ₄	0.15	0.15	0.15	0.15	0.15	0.14	0.13	0.12	0.13	0.13
p ₅	0.15	0.15	0.15	0.15	0.15	0.14	0.13	0.12	0.11	0.11
p ₆	0.05	0.09	0.09	0.09	0.09	0.05	0.05	0.05	0.05	0.09
p ₇	0.05	0.04	0.07	0.07	0.07	0.05	0.05	0.05	0.05	0.07
p ₈	0.05	0.04	0.03	0.05	0.05	0.05	0.05	0.05	0.05	0.05
p ₉	0.05	0.04	0.03	0.02	0.03	0.05	0.05	0.05	0.05	0.03
p ₁₀	0.05	0.04	0.03	0.02	0.01	0.05	0.05	0.05	0.05	0.01
P_transfer	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25	0.25

Probability distributions in Table 1 and Table 2 were selected to better assess differences in mathematical behavior between Lorenz-consistent statistics (Camargo’s indices) and H-based and H-based statistics. Otherwise, when using probability distributions that were chosen at random, we could obtain results that do not allow us to appreciate significant differences between the respective mathematical behaviors.

3. Results and Discussion

The Lorenz-curve-based graphical representation of symbol dominance (relative abundance inequality among different symbols) is shown in Figure 1. Estimated values of symbol dominance are 0.1 (I–V, with five Lorenz curves perfectly superimposed), 0.267 (VI), 0.4 (VII), 0.489 (VIII), 0.541 (IX), and 0.57 (X), with all these dominance values being equivalent to the respective Ptransfer values in Table 1. Additionally, estimated values of symbol diversity are 28.8 (I), 14.4 (II), 7.2 (III), 3.6 (IV), 1.8 (V), 2.199 (VI), 3.0 (VII), 4.599 (VIII), 7.803 (IX), and 14.19 (X) symbols, and estimated values of information entropy are 4.848 (I), 3.848 (II), 2.848 (III), 1.848 (IV), 0.848 (V), 1.137 (VI), 1.585 (VII), 2.202 (VIII), 2.964 (IX), and 3.827 (X) bits.

Figure 1

The cumulative proportion of abundance is related to the cumulative proportion of symbols, ranked from the symbol with the lowest relative abundance to the symbol with the highest relative abundance, for the ten probability distributions (I–X) described as hypothetical messages in Table 1. The reference distribution is depicted by the 45-degree line of equiprobability, where every symbol has the same relative abundance = 1/N, symbol dominance = 0, and symbol diversity = the number of different symbols (N). Symbol dominance may be estimated as the maximum vertical distance from the Lorenz curve to the 45-degree line, or as twice the area between the Lorenz curve and the 45-degree line, with both measures giving the same value whenever relative abundance inequality occurs only between dominant and subordinate symbols (as shown in this figure). In addition, symbol diversity = N (1 – symbol dominance), symbol redundancy = 1/symbol diversity, and information entropy = log2 symbol diversity.

Differences in mathematical behavior between Lorenz-consistent statistics (d1, d2, D1, D2, H1, and H2) and H-based and H-based statistics (d, d, D, D, H, and H) are shown in Table 3. Because d1, d2, D1, D2, H1, and H2 are Lorenz-consistent, their estimated values match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1. In fact, estimated values of d1 and d2 are equivalent to the respective P values in Table 1. By contrast, estimated values of d, d, D, D, H, and H do not match estimated values of symbol dominance, symbol diversity, and information entropy concerning Figure 1, while d and d exhibit values even greater than the upper limit for symbol dominance (= 0.6). Consequently, D and D can underestimate symbol diversity when differences in relative abundance between dominant and subordinate symbols are large or can overestimate it when such differences are relatively small.

Table 3

Measures of symbol dominance (d1, d2, d, and d), symbol diversity (D1, D2, D, and D), and information entropy (H1, H2, H, and H) are computed for the ten probability distributions (I–X) described as hypothetical messages in Table 1. H = log2 N = maximum expected entropy; H1/H, H2/H, H/H, and H/H = normalized entropies. All statistics are explained in the text.

	I	II	III	IV	V	VI	VII	VIII	IX	X
d_C ₁	0.100	0.100	0.100	0.100	0.100	0.267	0.400	0.489	0.541	0.570
D_C ₁	28.800	14.400	7.200	3.600	1.800	2.199	3.000	4.599	7.803	14.190
H_C ₁	4.848	3.848	2.848	1.848	0.848	1.137	1.585	2.202	2.964	3.827
H_C₁/H_max	0.970	0.962	0.949	0.924	0.848	0.717	0.683	0.695	0.725	0.759
d_C ₂	0.100	0.100	0.100	0.100	0.100	0.267	0.400	0.489	0.541	0.570
D_C ₂	28.800	14.400	7.200	3.600	1.800	2.199	3.000	4.599	7.803	14.190
H_C ₂	4.848	3.848	2.848	1.848	0.848	1.137	1.585	2.202	2.964	3.827
H_C₂/H_max	0.970	0.962	0.949	0.924	0.848	0.717	0.683	0.695	0.725	0.759
d_R	0.038	0.038	0.038	0.038	0.038	0.242	0.500	0.708	0.841	0.917
D_R	30.768	15.384	7.692	3.846	1.923	2.273	2.500	2.632	2.703	2.740
H_R	4.943	3.943	2.943	1.943	0.943	1.184	1.322	1.396	1.434	1.454
H_R/H_max	0.989	0.986	0.981	0.972	0.943	0.747	0.569	0.440	0.351	0.288
d_S	0.020	0.020	0.020	0.020	0.020	0.138	0.317	0.500	0.650	0.762
D_S	31.360	15.680	7.840	3.920	1.960	2.586	3.413	4.503	5.942	7.841
H_S	4.971	3.971	2.971	1.971	0.971	1.371	1.771	2.171	2.571	2.971
H_S/H_max	0.994	0.993	0.990	0.985	0.971	0.865	0.763	0.685	0.629	0.589
H_max	5.000	4.000	3.000	2.000	1.000	1.585	2.322	3.170	4.087	5.044

The observed shortcomings in the measurement of symbol dominance (using d and d) and symbol diversity (using D and D) seem to be a consequence of the mathematical behavior of the associated entropy measures (H and H). As we can see in Table 3, from distribution V to I, where the P value remains relatively small = 0.1 (Table 1), inequalities between entropy measures result in H values > H values > H1 and H2 values. On the contrary, from distribution VII to X, where the P approaches a higher value of 0.6 (Table 1), inequalities between entropy measures result in H1 and H2 values > H values > H values. In fact, whereas the normalized entropies of H1 and H2 increase from distribution VII to X, the normalized entropies of H and H decrease markedly. This remarkable finding would indicate that H1 and H2 can quantify the amount of information or uncertainty in a probability distribution more efficiently than H and H, particularly when differences between higher and lower probabilities are maximized by increasing the number of small probabilities (as shown in Table 3 regarding data in Table 1). After all, within the context of classical information theory, the information content of a symbol is an increasing function of the reciprocal of its probability [1,5,6,10] (also see [31,32]). Other relevant disparities in mathematical behavior regarding measures of symbol dominance, symbol diversity, and information entropy are shown in Table 4. The respective values of d1, D1, and H1 remain identical from distribution XI to XX, since d1 is sensitive only to differences in relative abundance between dominant and subordinate symbols. On the other hand, because d2 is sensitive to differences in relative abundance among all different symbols, the respective values of d2, D2, and H2 do not remain identical from distribution XI to XX, even though they are equal in XII and XVI, in XIII and XVII, in XIV and XVIII, and in XV and XIX, as in each of these distribution pairs changes in the allocation of relative abundance among dominant symbols and among subordinate symbols are equivalent. A similar pattern of values is observed concerning d, D, and H, but not regarding d, D, and H, whose respective values remain distinct from distribution XI to XX.

Table 4

Measures of symbol dominance (d1, d2, d, and d), symbol diversity (D1, D2, D, and D), and information entropy (H1, H2, H, and H) are computed for the ten probability distributions (XI–XX) described as hypothetical messages in Table 2. H = log2 N = maximum expected entropy; H1/H, H2/H, H/H, and H/H = normalized entropies. All statistics are explained in the text.

	XI	XII	XIII	XIV	XV	XVI	XVII	XVIII	XIX	XX
d_C ₁	0.250	0.250	0.250	0.250	0.250	0.250	0.250	0.250	0.250	0.250
D_C ₁	7.500	7.500	7.500	7.500	7.500	7.500	7.500	7.500	7.500	7.500
H_C ₁	2.907	2.907	2.907	2.907	2.907	2.907	2.907	2.907	2.907	2.907
H_C₁/H_max	0.875	0.875	0.875	0.875	0.875	0.875	0.875	0.875	0.875	0.875
d_C ₂	0.250	0.270	0.282	0.288	0.290	0.270	0.282	0.288	0.290	0.330
D_C ₂	7.500	7.300	7.180	7.120	7.100	7.300	7.180	7.120	7.100	6.700
H_C ₂	2.907	2.868	2.844	2.832	2.828	2.868	2.844	2.832	2.828	2.744
H_C₂/H_max	0.875	0.863	0.856	0.852	0.851	0.863	0.856	0.852	0.851	0.826
d_R	0.200	0.213	0.220	0.224	0.225	0.213	0.220	0.224	0.225	0.248
D_R	8.000	7.870	7.800	7.760	7.750	7.870	7.800	7.760	7.750	7.520
H_R	3.000	2.976	2.963	2.956	2.954	2.976	2.963	2.956	2.954	2.911
H_R/H_max	0.903	0.896	0.892	0.890	0.889	0.896	0.892	0.890	0.889	0.876
d_S	0.122	0.137	0.149	0.157	0.161	0.128	0.131	0.133	0.134	0.173
D_S	8.779	8.628	8.512	8.431	8.387	8.724	8.691	8.675	8.662	8.269
H_S	3.134	3.109	3.089	3.076	3.068	3.125	3.120	3.117	3.115	3.048
H_S/H_max	0.943	0.936	0.930	0.926	0.924	0.941	0.939	0.938	0.937	0.918
H_max	3.322	3.322	3.322	3.322	3.322	3.322	3.322	3.322	3.322	3.322

4. Concluding Remarks

This theoretical analysis has shown that the Lorenz curve is a proper framework for defining satisfactory measures of symbol dominance, symbol diversity, and information entropy (Figure 1 and Table 3 and Table 4). The value of symbol dominance equals the maximum vertical distance from the Lorenz curve to the 45-degree line of equiprobability when only differences in relative abundance between dominant and subordinate symbols are quantified which is equivalent to the average absolute difference between the relative abundances of dominant and subordinate symbols = d1 (Equation (1)) or equals twice the area between the Lorenz curve and the 45-degree line of equiprobability when differences in relative abundance among all symbols are quantified, which is equivalent to the average absolute difference between all pairs of relative symbol abundances = d2 (Equation (4)). Symbol diversity = N (1 – symbol dominance) (i.e., D1 = N (1 – d1) and D2 = N (1 – d2)) and information entropy = log2 symbol diversity (i.e., H1 = log2 D1 and H2 = log2 D2). Additionally, the reciprocal of symbol diversity may be regarded as a satisfactory measure of symbol redundancy (i.e., R1 = 1/D1 and R2 = 1/D2). This study has also shown that Lorenz-consistent statistics (d1, d2, D1, D2, H1, and H2) have better mathematical behavior than H-based and H-based statistics (d, d, D, D, H, and H), exhibiting greater coherence and objectivity when measuring symbol dominance, symbol diversity, and information entropy (Table 3 and Table 4). However, considering that the 45-degree line of equiprobability (Figure 1) represents the reference distribution (p = 1/N), and that only changes in the allocation of relative abundance between dominant and subordinate symbols (but not among dominant symbols or among subordinate symbols) seem to have true relevance for probability distributions to achieve the reference distribution or to deviate from it (Table 2), the use of d1, D1, and H1 may be more practical and preferable than the use of d2, D2, and H2 in measuring symbol dominance, symbol diversity, and information entropy. In this regard, it should be evident that if the number of different symbols (N) is fixed in any given message, increasing differences in relative abundance between dominant and subordinate symbols necessarily imply decreases in symbol diversity and information entropy, whereas decreasing differences in relative abundance between dominant and subordinate symbols necessarily imply increases in symbol diversity and information entropy, with these two variables taking a maximum if p = 1/N. By contrast, increasing or decreasing differences in relative abundance among dominant symbols or among subordinate symbols will not affect symbol diversity and information entropy, since the decrease or increase in the information content of a dominant or subordinate symbol is compensated by an equivalent increase or decrease in the information content of other dominant or subordinate symbols.

2 in total

1. New diversity index for assessing structural alterations in aquatic communities.

Authors: J A Camargo
Journal: Bull Environ Contam Toxicol Date: 1992-03 Impact factor: 2.151

2. Revisiting the relation between species diversity and information theory.

Authors: Julio A Camargo
Journal: Acta Biotheor Date: 2008-07-11 Impact factor: 1.774

2 in total