Literature DB >> 25295296

The assignment of scores procedure for ordinal categorical data.

Abstract

Ordinal data are the most frequently encountered type of data in the social sciences. Many statistical methods can be used to process such data. One common method is to assign scores to the data, convert them into interval data, and further perform statistical analysis. There are several authors who have recently developed assigning score methods to assign scores to ordered categorical data. This paper proposes an approach that defines an assigning score system for an ordinal categorical variable based on underlying continuous latent distribution with interpretation by using three case study examples. The results show that the proposed score system is well for skewed ordinal categorical data.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25295296 PMCID： PMC4176904 DOI： 10.1155/2014/304213

Source DB: PubMed Journal: ScientificWorldJournal ISSN： 1537-744X

1. Introduction

Ordinal data often occur during sampling survey and experimental design; therefore, it is difficult to get the interval data. The obtained data are usually “categorical data” or “ordinal categorical data,” which are collected based on a scale of “strongly agree,” “agree,” “have no opinion,” “disagree,” and “strongly disagree.” Because most data in traditional statistical methods are interval data, researchers often assign these ordinal categorical data a score first, convert them into interval data, and then conduct further statistical analyses, such as factor analysis, principal analysis, and discriminate analysis. One method of assigning a score to these ordinal categorical data is to assign a score to ordinal categorical data subjectively (e.g., 5 for strongly agree, 4 for agree, 3 for no opinion, 2 for disagree, and 1 for strongly disagree). However, the original scale is an ordinal scale, without the concept of distance. After assigning a score from 5 to 1, the scale becomes an interval scale and thus has the concept of distance. The distance between strongly agree (5) and no opinion (3) is the same as that between agree (4) and disagree (2), which exaggerates the information provided by the data. Other score-assignment methods assign the data-generated scores objectively. These methods include the Ridit score relatively to an identified distribution [1], the Conditional Median under a given cumulative distribution function [2], Conditional Mean scoring functions based on the underlying distribution [3], and the normal scores [4]. In many applications, treating the latent variable models for ordinal categorical data requires the Bayesian model to calculate parameters [5]. Another two score-assignment methods can be referred to in testing for 2 × k ordered tables. For processing this problem of the sensitivity of the linear rank test on the scores, Kimeldorf et al. suggested the min-max scoring [6] and Gautam et al. suggested the iso-chi-square approach for the 2 × k ordered tables [7]. However, this approach may be detailed and involves complex computations of the prime assumption. This paper aims to provide an alternative scoring system based on an underlying continuous latent variable to determine the scores of ordinal categorical data and explain the results by using three examples. The remainder of this paper is organized as follows: Section 2 introduces the scoring system and relevant theories; Section 3 describes how scores are assigned to ordinal categorical data, the main theorem, and the relevant corollary; Section 4 gives three examples to explain the effects of scoring results with the formula of Theorem 1; and lastly, Section 5 offers a conclusion and provides suggestions on score assignment for ordinal categorical data. Some property details are provided in the Appendix.

2. The Scoring System

For an ordinal categorical random variable Y with the probabilities (p 1,…, p ), k denotes the number of categories. A scoring system is a systematic method for assigning numerical values to ordinal categories [8]. The scores are computed from (p 1,…, p ). Let s = h (j, p 1,…, p ) be the scores assigned to the jth category, and let S = {h (j, p 1,…, p )} denote the scoring system determined by the scoring functions h (j, p 1,…, p ). For ordinal categorical data, Bross introduced a scoring system, which he called Ridit scores [1]. Let π = ∑ p . Bross defended the Ridit score for category j by r = (1/2)(π + π ). Brockett defended a Conditional Median Score under G [2], where G denotes some given cumulative distribution functions selected either in accordance with some theoretical latent distribution of the categorical variable under study or in accordance with the desirable properties for the planned method of analysis. For example, if the categorical variable represents income levels, G may represent a Pareto family distribution function. Let s = h (j, p 1,…, p ) represent the scores assigned and let F be the cumulative distribution function corresponding to this scoring system (i.e., F(s ) = π ). Brockett found a scoring system {s }, s = G −1(r ), j = 1,…, k, that satisfies the distance and minimizes d(F, G) = max⁡|F(x) − G(x)|, where r is the Ridit score for the category j (Figure 1).

Figure 1

The correlation plot of s and r .

Fielding suggested a scoring function f based on the conditional mean of a category, assuming that the data are generated by an assumed distributional form G [3]. Consider the following: The next section will introduce a scoring system based on given cumulative distribution function satisfying some condition.

3. Scoring Procedure for Ordinal Categorical Data

For an ordinal categorical random variable Y with the probabilities (p 1,…, p ), k denotes the number of categories. Let an unobserved continuous variable underlie Y [9], and let Z denote the underlying latent variable. Suppose that −∞ = c 0 < c 1 < ⋯ In other words, Y falls in assigned score a when the latent variable falls in the jth interval of values (Figure 2). This section introduces a scoring system for Y based on the underlying latent variable of Z satisfying EY = EZ.

Figure 2

The plot of assigned score a and underlying latent variable.

Theorem 1 .

Let Y be an ordinal categorical response variable with the probabilities (p 1,…, p ), where k denotes the number of categories. Assume that Z is a continuous underlying distribution of Y with the distribution function of G and probability density function g and assume that EZ exists. Suppose that −∞ = c 0 < c 1 < ⋯

Corollary 2 .

If the underlying distribution Z is U(0,1), then a = r , where r is the Ridit score.

Corollary 3 .

Let r = (1/2)(π + π ) (Ridit score), then one has a ≈ G −1(r ), where Appendix shows the proofs of all the properties.

Remark 4 .

Assume that Z is a continuous underlying distribution of Y with the distribution function of G is known; therefore, the cut point c does not need to be given in advance.

Remark 5 .

The score a defined in this study fulfills Brockett's Postulate 2 (Branching Property) [2]: suppose there are more than two categories, and for statistical or computation reasons we wish to combine two adjacent categories. In this case, the scores of the unaffected categories remain unchanged. Symbolically, if the i and (i + 1)st categories are combined, then This postulate states that there is consistency within the scoring system as k changes.

Remark 6 .

Agresti introduced a score v , and let v = Φ−1(r ), where Φ is a cumulative distribution function for standard normal distribution and r is the Ridit score in category j [4]. Then, by Corollary 3, when G = Φ, we have a ≈ v .

4. Examples

Example 1 .

This example is a prospective study of maternal drinking and congenital malformations [10]. Table 1 presents a summary of the questionnaire results for alcohol consumption as completed by women who have passed their first trimester. Results show whether the newborns suffered from congenital malformations after birth. The average number of drinks per day was used to measure alcohol consumption, which was an explanatory variable of an ordinal categorical nature.

Table 1

Presence or absence of congenital sex organ malformation categorized by alcohol consumption of the mother [10].

Malformation	Alcohol consumption (average # drinks/day)
Malformation	0	<1	1-2	3–5	≧6
Absent	17066	14464	788	126	37
Present	48	38	5	1	1

Total	17114	14502	793	127	38

This study examines the correlation between the mothers' level of alcohol consumption and congenital malformation in newborns. The traditional approach is to use a contingency table. However, this study assigns scores to the level of alcohol consumption and uses a statistical value M 2 = (n − 1)r 2 to test the correlation, where r is a coefficient of correlation. The square root of M 2 has an approximately standard normal distribution under the null hypothesis. The P value is the right-tail probability above the observed value [11]. Different assigned scores are used to calculate the M 2 and the P value. As Table 2 shows, the values of M 2 and P value of the method by midpoints P value of 0.0104 and the proposed method with exponential score have the significant P value of 0.018572, indicating that they are close to each other, whereas the midpoints and midranks (Ridit score) have a large difference. And the proposed method with lognormal score P value of 0.002318 has the smallest significant P values that indicates it is well fit for this skewed data.

Table 2

Alternative scoring systems for ordinal categories with exact one-sided P values.

	Alcohol consumption (average # drinks/day)
	0	<1	1-2	3–5	≧6
Midpoints	0	0.5	1.5	4.0	7.0
Standardized	−0.9	−0.72	−0.38	−0.48	1.52
	M ² = 6.570134 P value = 0.0104(∗)

Equally spaced	1.0	2.0	3.0	4.0	5.0
Standardized	−1.26	−0.63	0.00	0.63	1.26
	M ² = 1.827816 P value = 0.1764

Midranks	8557.5	24365.5	32013.0	32473.0	32555.5
Standardized	−1.69	−0.16	0.58	0.63	0.63
	M ² = 0.351438 P value = 0.2860

Ridit score	0.262694	0.747989	0.982762	0.996884	0.999417
Standardized	−1.68566	−0.15734	0.582024	0.626497	0.634473
	M ² = 0.351438 P value = 0.5533

Normal score	−0.63502	0.668423	2.116563	2.739277	3.253699
Standardized	−0.64932	0.58412	1.660698	2.141198	2.550635
	M ² = 1.455888 P value = 0.113793

Exponential score	0.304753	1.378283	4.060658	5.771211	7.446831
Standardized	−1.17343	−0.81223	0.090276	0.665807	1.229585
	M ² = 4.343807 P value = 0.018572(∗)

Logistic score	−1.03201	1.087917	4.04327	5.76809	7.446247
Standardized	−1.30621	−0.69014	0.168719	0.669972	1.157663
	M ² = 2.220069 P value = 0.068113

Lognormal score	0.529903	1.950675	8.285174	15.41468	25.71126
Standardized	−0.94672	−0.81014	−0.20121	0.484139	1.473942
	M ² = 8.01653 P value = 0.002318(∗)

*Significant at 5%.

In this case, Graubard and Korn noted that the results of the trend test applied to this data set are sensitive to the choice of scores and the P value for equally spaced scores is 0.1764. The Ridit score gave a P value of 0.5533. Using the midpoints scores, we found the P values corresponding to the exponential score value are close to each other [10]. Therefore, we suggest that using the proposed method with exponential scores or lognormal score could be well in this example.

Example 2 .

This example is from Agresti, who used several data sets from the General Social Survey (GSS) [4]. Table 3 shows the results of 2,387 responses from the GSS to a question on whether heaven exists where the data presents a skewed property.

Table 3

Responses about belief in heaven [11].

	Definitely	Probably	Probably not	Definitely not	Total
Count	1546	498	205	138	2387
Proportion	0.648	0.208	0.086	0.058	1.0
Ridit score	0.324	0.752	0.899	0.971

Table 4 presents a comparison of the results to examine the proposed normal scores based on Ridits with the method of Remark 5 and the formula of Theorem 1. As in Table 4, the Agresti normal score v and the proposed normal score a are close. This table also shows the proposed score a , including the exponential, logistic, and lognormal scores. The computation for scores is illustrated. Let π be the cumulative relative frequency; that is, π 1 = 0.648, π 2 = 0.856, π 3 = 0.942, π 4 = 1.0 and π 0 = 0. Then, we apply function (b) in (3) of Theorem 1 to compute the score value a with distribution G to be standard normal, exponential, logistic, and lognormal, respectively. The result also indicates the relatively larger gap in lognormal score that has good fit for this skewed data.

Table 4

The results of responses about belief in heaven with different formulas.

	Definitely	Probably	Probably not	Definitely not
Count	1546	498	205	138
Proportion	0.648	0.209	0.086	0.058
Agrestic normal score v _j	−0.457	0.681	1.277	1.897
Normal score a _j	−0.45699	0.680765	1.277267	1.897112
Exponential score a _j	0.391322	1.394286	2.295073	3.543686
Logistic score a _j	−0.73619	1.109254	2.188874	3.514354
Lognormal score a _j	0.633184	1.975389	3.586822	6.666614

Example 3 .

This example is from Snedecor and Cochran [12]. In this example, patients with leprosy were divided into those with little infiltration and those with much infiltration, based on a measure of a certain type of skin damage. Their health status was also classified into five levels after the 48-week treatment (Table 5). This study uses the formula of Theorem 1 and that proposed by Fielding to assign scores and to compare the results [3]. As Table 6 shows, the values are close to each other. In addition, Figures 3(a)–3(d) show the results of scores under the different distribution with the formula of Theorem 1. The distribution pattern in these figures shows that the shapes of the scores computed from different underlying distribution are different.

Table 5

196 patients classified according to change in health and degree of infiltration [12].

Degree of infiltration	Change in health					Total
	Improvement			Stationary	Worse
	Marked	Moderate	Slight	Stationary	Worse
Little	11	27	42	53	11	144
Much	7	15	16	13	1	52

Total	18	42	58	66	12	196

Table 6

Results of using different formulae under the same distribution scores.

	Worse	Stationary	Slight	Moderate	Marked
Total frequencies	12	66	58	42	18

Proportions	0.061224	0.336735	0.295918	0.214286	0.091837

Ridit scores	0.030612 (0.03043)	0.229592 (0.228588)	0.545918 (0.545036)	0.80102 (0.800381)	0.954082 (0.953808)

Normal scores	−1.87187 (−1.83849)	−0.74019 (−0.73567)	0.115356 (0.114767)	0.845272 (0.839809)	1.685788 (1.66233)

Logistic scores	−3.45526 (−3.78093)	−1.21062 (−1.3212)	0.184192 (0.18643)	1.392684 (1.439734)	3.033884 (3.338361)

Lognormal scores	0.153836 (0.146667)	0.477022 (0.481287)	1.122272 (1.15068)	2.32861 (2.444555)	5.396699 (6.654603)

Figure 3

Proability plots comparing the results of scores under the different distributions with the formula of Theorem 1. (a) Equal Space, (b) normal distribution, (c) logistic distribution, and (d) lognormal distribution.

5. Conclusion

In this paper, we provide alternative methods of assigning scores to ordinal categorical variables based on the underlying continuous distribution. These procedures are simpler and easier ways to assign scores. We cite three real case studies to explain the process and results of the calculations and propose that the score systems for ordinal variables are easy to perform, effective, and operationally useful, similar to the Ridit score or Agresti scores. The Equal Space or Rank methods are generally used as scores (i.e., midranks or Ridit scores in Example 1) for processing ordinal categorical data. However, if the data are right-skewed or left-skewed or if some categories have many more observations than other categories, the result is obviously poor. This paper uses several underlying distributions as the alternatives of scores (e.g., the M 2 obtained from exponential score is closest to the midpoint in Example 1). We propose that if underlying distribution exists, these methods are also helpful for improving the development of traditional statistical techniques and software applications. By the three illustrations, this study suggests that the lognormal score can be applied well when the ordinal categorical data is skewed, and the normal score may be used when the data is relatively balanced among categories. There are many methods for processing ordinal categorical data. However, not all of these methods require score-assignment methods (e.g., Cumulative Logit Models and Proportional Odds Models) to convert ordinal categorical data into interval data for analysis. However, if independent variables are categorical ordinal variables, they are considered categorical data and processed as dummy variables in the traditional (general) statistical method. In addition, if many response variables exist in categorical ordinal variables, it is advisable to assign scores to variables to convert them into an interval scale for further statistical analysis. The benefits of this process to independent variables are as follows: (1) the degree of freedom is 1 (which is k − 1 originally) and (2) the characteristics of ordinals can also be used, indicating that related computational analysis for variables may be less complicated.

1 in total

1. Choice of column scores for testing independence in ordered 2 X K contingency tables.

Authors: B I Graubard; E L Korn
Journal: Biometrics Date: 1987-06 Impact factor: 2.571

1 in total

4 in total

1. Assigning scores for ordered categorical responses.

Authors: Daniel Fernández; Ivy Liu; Roy Costilla; Peter Yongqi Gu
Journal: J Appl Stat Date: 2019-10-09 Impact factor: 1.416

2. A Bounded Integer Model for Rating and Composite Scale Data.

Authors: Gustaf J Wellhagen; Maria C Kjellsson; Mats O Karlsson
Journal: AAPS J Date: 2019-06-06 Impact factor: 4.009

3. Human impacts as the main driver of tropical forest carbon.

Authors: Marcela Venelli Pyles; Luiz Fernando Silva Magnago; Vinícius Andrade Maia; Bruno X Pinho; Gregory Pitta; André L de Gasper; Alexander C Vibrans; Rubens Manoel Dos Santos; Eduardo van den Berg; Renato A F Lima
Journal: Sci Adv Date: 2022-06-17 Impact factor: 14.957

4. Anticoagulation Knowledge Tool (AKT): Further evidence of validity in the Italian population.

Authors: Arianna Magon; Cristina Arrigoni; Tiziana Roveda; Paola Grimoldi; Federica Dellafiore; Marco Moia; Kehinde O Obamiro; Rosario Caruso
Journal: PLoS One Date: 2018-08-14 Impact factor: 3.240

4 in total