Literature DB >> 30543662

A Bayesian approach for analysis of ordered categorical responses subject to misclassification.

Ashley Ling¹, El Hamidi Hay², Samuel E Aggrey^3,4, Romdhane Rekaya^1,3,5.

Abstract

Ordinal categorical responses are frequently collected in survey studies, human medicine, and animal and plant improvement programs, just to mention a few. Errors in this type of data are neither rare nor easy to detect. These errors tend to bias the inference, reduce the statistical power and ultimately the efficiency of the decision-making process. Contrarily to the binary situation where misclassification occurs between two response classes, noise in ordinal categorical data is more complex due to the increased number of categories, diversity and asymmetry of errors. Although several approaches have been presented for dealing with misclassification in binary data, only limited practical methods have been proposed to analyze noisy categorical responses. A latent variable model implemented within a Bayesian framework was proposed to analyze ordinal categorical data subject to misclassification using simulated and real datasets. The simulated scenario consisted of a discrete response with three categories and a symmetric error rate of 5% between any two classes. The real data consisted of calving ease records of beef cows. Using real and simulated data, ignoring misclassification resulted in substantial bias in the estimation of genetic parameters and reduction of the accuracy of predicted breeding values. Using our proposed approach, a significant reduction in bias and increase in accuracy ranging from 11% to 17% was observed. Furthermore, most of the misclassified observations (in the simulated data) were identified with a substantially higher probability. Similar results were observed for a scenario with asymmetric misclassification. While the extension to traits with more categories between adjacent classes is straightforward, it could be computationally costly. For traits with high heritability, the performance of the methodology would be expected to improve.

Entities: Disease Species

Mesh：

Year: 2018 PMID： 30543662 PMCID： PMC6292639 DOI： 10.1371/journal.pone.0208433

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Error in the measurement and recording of data, especially discrete data, is a prevalent issue that has the potential to affect the results of any analysis through the reduction of the statistical power and the increase in bias [1-6]. While measurement errors can occur in continuous and categorical variables, it has been shown that their impact on inference is more severe in discrete data [7]. When error occurs in categorical variables it is commonly called misclassification and is a topic that has drawn considerable research due to its negative effect on the inference. More specifically, misclassification occurs with the assignment of a discrete response to a class that does not accurately reflect its true state [8]. For complex and heterogeneous diseases, several subclasses with varying phenotypes often exist. Using typical diagnostic technologies, the different subclasses can be difficult to differentiate for various reasons; samples may look alike under microscopic analysis, individuals may share the same symptoms or markers used in the diagnostic, or one or more subclasses of the disease may simply be unknown [5]. This similarity can consequently lead to misdiagnosis [9]. It is well documented in the literature that misclassification in a response variable will have two primary inferential implications: 1) biased parameter estimates and 2) a reduction of statistical power [1,6,10-11]. Practically, this will manifest as inaccurate estimates of the model parameters and a reduced ability to identify truly influential explanatory factors. In recent years, there has been increased discussion of a lack of reproducible research across a wide range of fields. While many factors are likely to contribute to irreproducibility, inaccurate measurements have been implicated as a potential culprit [12-14]. Error can be introduced at various stages of data collection and analysis, and it has been estimated that error rates in many datasets can be expected to fall between 1–5% if measures to prevent them are not taken [15]. For human disease data, much higher error rates have been reported. In a 2010 report by the Pennsylvania Patient Safety Authority, diagnostic errors in acute care ranged between 3.1 and 49.8% [16]. The challenges presented by misclassification are not overcome by a large sample size as might be expected [17-18]. Therefore, as “Big Data” becomes more accessible and utilized, it will be essential to consider the potential for misclassification in these datasets [19-20]. It has even been suggested that some large datasets may be more susceptible to misclassification depending on the method and/or purpose of collection, such as electronic medical records [20]. Given the negative effects of misclassification on inference, ignoring its presence during analysis would not be a reasonable approach. A variety of methods for dealing with misclassification in a response have been reported [21-23]. The majority of these methods rely on the estimation of misclassification rates (e.g. false positives and false negatives). A double sampling approach has been suggested [10,24] that estimates rates of misclassification through the use of two tests: 1) an inexpensive fallible classifier for the full sample and 2) a more expensive infallible classifier (or “gold standard”) for a partial sample. In many situations, though, a perfect or even high accuracy classifier will either be inaccessible or nonexistent. Thus, misclassification has become largely a statistical problem and several methods for addressing it have been presented over the years. Several likelihood-based methods have been suggested for dealing with misclassification [3–4, 25–29]. Furthermore, [30] presented an elegant parametric and semi-parametric approach for analyzing speaking fluency of immigrants that contemplates errors due to subjective evaluations and scaling across individuals. Although these likelihood-based approaches are able to handle the misclassification problem, they seem not to perform well when the misclassification probabilities are small. Advances in Markov Chain Monte Carlo (MCMC) simulation techniques have greatly facilitated the use of Bayesian methods in misclassification problems. Hierarchical Bayesian modeling with conjugate informative priors has become a method of choice for analysis of misclassified binary data. Several studies by our group [5-6,11] and others [31-32] have been successful in dealing with misclassification errors in binary data under various sampling models. However, limited research has been carried out to scale those methods to the multinomial response situation. In this study, the methods presented by our group to deal with misclassified binary traits will be extended to multinomial response situations using a hierarchical Bayesian framework. Although our proposed approach was developed and implemented before [33] published their paper on dealing with unidirectional misclassification of ordinal covariates, both methods overlap on their approach to the problem. However, our method is unique as it addresses misclassification on response variables and it allows for asymmetric misclassification probabilities within a full Bayesian implementation.

Materials and methods

Let = (y, y,…, y)’ be a vector of ordered categorical responses collected on n individuals with y taking the value of one of C mutually exclusive discrete responses. This observed data is considered to be a noisy representation of true unobserved data, = (r, r,…, r)’. The distribution of y conditional on the probability of observing each response, is given by where = (θ, θ,…, θ)’ is the vector of probabilities that observation y for individual i is observed as class k, p(y = k), taking a value for each of the C discrete classes of the noisy data; and I(y = k) is an indicator function that takes the value of one if (y = k), otherwise equaling zero. Obviously, the observed probabilities of the multinomial distribution in [1] are linear combinations of the true probabilities, θ, and the misclassification probabilities, π, which is the probability that an observation with true class j is classified as observed class k. Misclassification will occur when j does not equal k. Thus, for class k, θ could be presented as where θ is the probability of observing class k for individual i, θ is the true probability of observing the discrete response class j for individual i, and π is the conditional probability of observing class k when the true class is j. When j is different from k, π is the misclassification probability from true class j to observed class k. Thus, In matrix notation and generalizing to all elements of the vector, , Eqs [2 and 3] can be re-written as where the matrix assembles the conditional probabilities, π, between all classes. The vectors and are specific to each individual and they could be modeled as a function of systematic and random effects. However, the matrix is common to all individuals. The matrix does not have to be symmetric and the only requirement is that columns sum to 1. In the absence of misclassification, the matrix will be equal to the identity matrix, as expected, and = . Assuming conditional independence, the joint distribution of the true data, , conditional on the vector of true probability, = (θ, θ,…, θ)’ is given by where I(r = j) is an indicator function that takes the value of one if (r = j) and otherwise is equal to zero. Let = (α, α, …, α)’, where α is an unknown random variable indicating if the observed discrete response for individual i arose from a switching (misclassification) event. Assuming that the observed (and potentially misclassified) discrete response of individual i is k (k ϵ {0,1,…,C-1}), then αi can be modeled following a multinomial process with probability vector = (π π … π … π), where h is an index taking C discrete values and I(α = h) is an indicator function that takes the value of one if (α = h) and otherwise is equal to zero. The joint distribution of and given and can be written as The joint distribution in Eq [7] depends on the true discrete data, , which is not available. However, such data could be generated as a function of the observed contaminated discrete responses, and the vector . Let f() be a link function that relates the true discrete data to the noisy responses and the vector of indicator variables For example, if the observed and potentially noisy discrete response for individual i is y = 0 then the true unobserved response, r, can be generated using the relationship with α taking the value of 0 (no misclassification), 1 (switching from r = 1 to y = 0) and 2 (switching from r = 2 to y = 0). A similar relationship could be built when the observed discrete response y is different from zero. Using the relationship in Eq [8], the joint distribution in Eq [7] could be re-written as a function of the observed data, where I(f(y, α) = j) is an indicator function that takes the value of one if (f(y, α) = j) and otherwise is equal to zero. It is often the case that the probability of observing a specific outcome of the multinomial process, θ, is a function of a set of systematic () and random () effects. Thus, from hereafter θ will be denoted by θ(, ) to indicate this relationship. To finalize the Bayesian formulation, prior distributions are needed for the unknown parameters in the model. For the systematic and random effects the following priors were specified: where and are known hyperparameters, is a known symmetric covariance matrix between the elements of the vector , and σ is an unknown dispersion component for which a scaled inverse chi-squared with v degrees of belief and scaling factor s was used as the prior. and are hyperparameters to be specified by the researcher based on published estimates in the literature, based on the researcher’s own experience of the range of values the parameter might take, or by choosing an interval large enough to contain all reasonable values of the parameter. For the misclassification probabilities vector, , a Dirichlet distribution was assumed, where = (τ0k, τ1k, …, τ(c-1)k) > 0 is the vector of concentration parameters of the Dirichlet distribution, B() is the multivariate Beta function that plays the role of a normalizing constant, and π is the probability of misclassification (switching) from discrete response class j to class k as defined previously in Eqs [2] and [3]. The joint posterior distribution of all unknowns in the models is proportional to the product of the expressions in Eqs [9-13], where q is the number of elements in the vector . In genetics studies, q will be equal to the number of individuals in the pedigree file, a file listing the ID of each animal in the population along with its sire and dam ID (where a zero is used for unknown sires and dams). Inclusion of pedigree information in the model is essential as it relates information concerning genetic merit of an individual through its relatives and information on the relatives. A data augmentation algorithm as described by [34] and [5] will facilitate the implementation of the model in Eq [14]. It consists of assuming the existence of an unknown continuous random variable, l, that relates to the discrete response via the following relationship: where t is an arbitrary threshold value. If y can take a value in one of C mutually exclusive ordered categorical responses, then there will be C+1 thresholds (t, t, t, …, t) where t and t are often set equal to negative and positive infinity. The underlying liability for individual i could be modeled as a function of the systematic and random effects, Where and are known incidence matrices relating the liability of individual i to the set of systematic and random effects, respectively, and e is the error term. Augmenting the model in Eq [7] with the vector of liabilities results in a lack of identifiability. To make the augmented model identifiable [34], two restrictions are needed. It was assumed that t = 0 and var(e) = 1. At the liability scale, the full conditional distributions of and σ are identical to those obtained in a standard analysis of discrete data in the absence of misclassification and can be found in [34], [5] and [35]. These conditional distributions, needed for a Bayesian implementation of the model via the Gibbs sampler, were in closed form, and were truncated normal for the liabilities, normal for the position parameters, and scaled inverse chi-squared for σ. The conditional of α is a multinomial distribution, where is the vector without the i position. For the misclassification probabilities, , its conditional distribution is proportional to where is the matrix without the kth row and n is the number of individuals with an observed discrete response equal to the kth category (class). where γ is the number of discrete responses switched (misclassified) from true class j to the observed class k. Thus, is distributed as Di(; + ) with = (γ γ γ). The vector is easily obtained as a by-product of the sampling process.

Data

Calving ease, a measure of how easily a cow delivers her calf, is an important trait in beef cattle production. Selection on animals with a high growth rate is of importance to the industry for the sake of yielding a higher carcass weight per animal. However, growth rate and birth weight are positively correlated traits, and so selection on a higher growth rate indirectly selects for higher birth weights. It has been found that high birth weight is the largest contributing factor in rates of dystocia, or calving difficulty, an issue that leads to several health and economic concerns, such as increased mortality for calf and cow and additional veterinary costs. It is therefore of interest for producers to consider calving ease directly as a trait in their selection models. However, without strict protocols on the farm, measuring calving ease may involve a significant degree of subjectivity. This study was largely motivated by a real beef cattle dataset. The data consisted of calving ease and related measurements in a composite breed of beef cattle obtained from the USDA Fort Keogh Range and Research Laboratory in Miles City, Montana [36-37]. After quality control, the USDA data included 955 calving ease phenotypes (or observations). Calving ease was scored on a four-category scale in the raw data, with a score of 1 for the easiest calving and 4 for the most problematic (Caesarian section required); however, category 4 contained few observations and so categories 3 and 4 were combined into a single bin to avoid an extreme-case problem (ECP), when the discrete response is the same for all individuals within a given systematic effect class. The three resulting categories were scored as 0 for no difficulty, 1 for moderate difficulty and 2 for severe difficulty. The distribution of calving ease scores showed an unexpected pattern of few calvings for category 1 (moderate difficulty). Given the subjectivity of the trait (scoring carried out by farm employees) and the fact that some cows calve overnight under standard ranch conditions with limited presence or absence of ranch employees, it was reasonable to assume that the collected data could be subject to potential misclassification. In order to test the adequacy of the proposed method to deal with noisy ordered categorical data and to evaluate its robustness in the presence of a small sample size, two datasets of large and small (D1 and D2, respectively) sample size were simulated.

Simulated data

Both datasets (D1 and D2) were simulated following the structure of a real calving ease dataset. For both datasets (D1 and D2), a discrete trait with three categories (C = 3), similar to calving ease, was simulated. The number of animals with phenotype (n) ranged between 8,939 and 9,042 for D1 and 1,393 and 1,425 for D2. The pedigree files included 10,000 and 1,563 animals for D1 and D2, respectively. A mixed linear model that includes three systematic effects and one random effect in addition to the error terms was used to model the liabilities, where is a vector of liabilities of size n, and are vectors of fixed and random (animal) effects, respectively, is a vector of residuals, and and are known incidence matrices with the appropriate dimensions. In this model, the random effect accounts for the animal’s additive genetic contribution to the trait and is commonly called the animal additive effect. The genetic covariance between animals is accounted for through the average relationship matrix, commonly denoted as , which is computed based on the known pedigree information. The element a of the matrix is simply the expected additive relationship between animals i and j. A detailed algorithm for constructing the relationship matrix can be found in [38]. Solving the mixed model equations requires inversion of the relationship matrix. While the relationship matrix is relatively sparse, it’s much more computationally efficient to directly build the inverse of the relationship matrix, using simple rules derived in [39]. Each systematic effect was drawn from a normal distribution. The number of levels and distributional parameters of each systematic effect are summarized in Table 1. The random effects and residual terms were generated from the following distributions: where is a known matrix of expected additive relationships between animals computed based on the information in the pedigree files [39] and σ and σ were set equal to 0.1 and 1, respectively, resulting in a heritability of 0.091.

Table 1

Number of classes and distributions used to simulate the systematic effects for the large (D1) and small (D2) datasets.

Fixed Effect	Number of Levels		Distribution
Fixed Effect	D1	D2	Distribution
1	20	5	N(-0.7,0.05)
2	10	5	N(-0.1,0.5)
3	5	5	N(0.1,0.2)

The simulated liability was discretized to generate an ordered categorical trait with three categories (C = 3) using the following relationship: where y is the discrete response for animal i, taking the value of 0, 1, or 2; and = (t = -∞, t, t, t = ∞) is a vector of threshold values with t and t arbitrarily set to 0 and 1, respectively. Incidence rates of the three categories (0, 1, and 2) were on average (0.39, 0.38, and 0.23) and (0.36, 0.46, and 0.18) for D1 and D2, respectively. All scenarios were replicated 10 times and ECPs were avoided. Misclassification was introduced to the simulated data by switching a subset of observations from true class y = j to observed class y = k for (j, k) ϵ (0, 1, 2) and j ≠ k. Two scenarios of misclassification were investigated, symmetrical misclassification, where the rate of misclassification (π) between two classes is equal in both directions (π = π), and non-symmetrical misclassification, where the rate of misclassification from class j to class k is allowed to differ from class k to class j (π ≠ π). For the symmetrical misclassification, a misclassification rate of 2.5% for each direction was introduced to the data (π = 0.025 for all j ≠ k). This approach yielded a non-differential net rate of misclassification in the simulated data of 5%. For the nonsymmetrical scenario, a misclassification rate was set equal to 1, 3, 1.5, 1, 0.1, and 0.1% for π, π, π, π, π, and π, respectively.

Real data

The real data consisted of calving ease records from 955 first parity cows from a Composite Gene Combination breed (CGC; 50% Red Angus, 25% Charolais, 25% Tarentaise; [36-37]) born between 2002 and 2011 at USDA-ARS, Fort Keogh Livestock and Range Research Laboratory, Miles City, MT. The pedigree file was comprised of 1,357 animals, including 82 sires and 651 dams. Calving ease was scored on a discrete scale with three categories (0 = no difficulties; 1 = moderate difficulties; and 2 = severe difficulties), where the most severe category was composed of two separate classes that were collapsed together. The systematic effects consisted of sex (2 classes), feed treatment (2 classes) and year of birth (10 classes).

Data analysis

Each of the simulated data sets for the symmetric scenario was analyzed using the following models: true simulated discrete data analyzed with a classical threshold model (M1), noisy simulated data analyzed with a classical threshold model that does not contemplate misclassification (M2), noisy simulated data analyzed with a threshold model that contemplates misclassification following our proposed method and misclassification probabilities assumed known (M3), noisy simulated data analyzed with a threshold model that contemplates misclassification following our proposed method and misclassification probabilities assumed unknown (M4). An additional analysis was carried out to test the validity of our approach under a null model (when no misclassification is present in the data). For that purpose, the true simulated data was analyzed with a threshold model that contemplates misclassification following our proposed method and misclassification probabilities assumed unknown (M5). The simulated data generated under the non-symmetric scenario was analyzed only with M4. Similarly, the real data was analyzed only with M2. Implementation was carried out using the Gibbs sampler. A unique chain of 200,000 samples was implemented with the first 100,000 iterations discarded as burn-in period. The probability of each observation being misclassified was calculated as the ratio between the number of times the observation was found to be misclassified and the total number of iterations.

Results

Under the symmetric scenario, the estimate of the genetic variance, for the large simulated data and in the absence of misclassification in the discrete response (M1), was virtually equal to the true value used in the simulation (0.10). Furthermore, the 95% highest posterior density (HPD) interval is well-centered around the true value, indicating a reduction of bias (Table 2). However, when 5% of the discrete responses were artificially misclassified but no measures to address misclassification taken during the analysis (M2), the genetic variance was grossly underestimated with a posterior mean of roughly half the true value used in the simulation (0.052). Additionally, the 95% HPD interval did not include the true genetic variance (0.10), indicating significant bias in the estimate of the parameter (Table 2). This result demonstrates that when appropriate measures are not taken to correct or account for misclassification, there will be substantial bias in parameter estimation even with low to moderate noise in the input data.

Table 2

Posterior means, posterior standard deviations and the 95% highest posterior density interval (HPD) of the genetic variance (true value = 0.1) under different models and datasets.

	Mean		Standard Deviation		95% HPD Interval
	D1	D2	D1	D2	D1	D2
M1	0.106	0.135	0.0255	0.0670	0.0586–0.170	0.0469–0.319
M2	0.0521	0.0919	0.0143	0.0422	0.0288–0.0846	0.0364–0.198
M3	0.112	0.151	0.0355	0.0881	0.0540–0.191	0.0460–0.380
M4	0.0998	0.106	0.0188	0.0239	0.0688–0.142	0.0681–0.161
M5	0.113	0.108	0.0226	0.0251	0.0755–0.163	0.0692–0.167

2 D1: large dataset, D2: small dataset

1 M1: True data analysis with a classical threshold model, M2: Noisy data analyzed with a classical threshold model, M3: noisy data analyzed with the proposed method assuming the misclassification probability is known, M4: same as M3 except the misclassification probability is assumed unknown, M5: Noise free data analyzed using our proposed method (null model); 2 D1: large dataset, D2: small dataset When the same noisy data used in M2 was analyzed using the proposed methodology that accounts for potential misclassification, bias in estimates of genetic variance was significantly reduced. This was true both when the probability of misclassification, π, between each pair of classes j (j = 1,2,3) and k (k = 1,2,3), was assumed known and equal to the true value used in the simulation (0.025) (M3) and when it was assumed unknown and estimated during the analysis (M4). In fact, the posterior mean of genetic variance was 0.112, and 0.0998, for M3, M4, respectively. In each of these cases, the true value of the parameter (0.10) was well within the 95% HPD interval, indicating a reduction of bias. For the non-symmetric simulation scenario, the posterior mean (SD) of the genetic variance was equal to 0.118 (0.0196) and 0.103 (0.0243) for D1 and D2, respectively. These estimates are very similar to the true value used in the simulation (0.10) and based on the 95% HPD interval (information not presented), these estimates have minimal bias. Estimates of the probabilities of misclassification under model M4 for the symmetric and non-symmetric simulation scenarios are presented in Table 3. For the symmetric scenario, the misclassification probabilities between categories 1 and 2 (π12) and 2 and 3 (π23) of the discrete responses were underestimated with posterior means of 0.0146 and 0.0135, respectively, compared to the true value of 0.025. However, the misclassification probability between categories 2 and 3 (π23) was estimated with minimal bias and the posterior mean (0.0239) is very similar to the true value (0.025). In spite of the underestimation of some of the misclassification probabilities, the results clearly indicate the ability of the proposed methodology to adjust for the misclassification in the data whether or not the true rate of misclassification is known. Similar performance was observed using data from the non-symmetric simulation scenario. In fact, five out of the six misclassification probabilities were accurately estimated and only the misclassification probability between categories 2 and 1 (π) was under estimated (Table 3).

Table 3

Posterior mean of the misclassification probability between the different categories of the discrete responses and datasets.

	Symmetrical Misclassification
	True Value	M4		M5
Parameter		D1	D2	D1	D2
π₁₂	0.025	0.0146	0.0128	0.0112	0.0124
π₂₃	0.025	0.0135	0.0062	0.0065	0.0057
π₁₃	0.025	0.0239	0.0067	0.0019	0.0029
	Nonsymmetrical Misclassification
	True Value	M4
		D1		D2
π₁₂	0.01	0.010		0.0095
π₂₁	0.03	0.013		0.0100
π₂₃	0.015	0.011		0.0098
π₃₂	0.01	0.011		0.0099
π₁₃	0.001	0.0013		0.0013
π₃₁	0.001	0.0014		0.0011

1 D1: large dataset, D2: small dataset; π is the misclassification probability between categories i and j

1 D1: large dataset, D2: small dataset; π is the misclassification probability between categories i and j To further evaluate the adequacy of the proposed methodology, we tested its performance under a null model (no misclassification in the data) under the symmetric simulation scenario. Thus, when the noise free data used in M1 was reanalyzed using our proposed model that contemplates potential misclassification (M5), the estimate of the genetic variance (0.113) was similar to the true value and minimal bias was observed (Table 2). Most importantly, the estimated misclassification probabilities were negligible, indicating that very few observations were detected as potentially misclassified (Table 3). This was expected given that the data was free of noise. In order to evaluate the ability of the proposed method to correctly identify misclassified observations, we calculated the posterior probability of misclassification of each observation in the data set. The average misclassification probability of miscoded observations was over twenty times higher than that of the non-miscoded observations. For the 95% truly non-misclassified and 5% misclassified observations, their average misclassification probability was 0.026 and 0.554, respectively. Fig 1 presents the distribution of misclassification probability for the miscoded observations (Fig 1A) and the correctly coded observations (Fig 1B) for one replicate. The 85th percentile of correctly classified observations was 0.02613, while the 15th percentile of misclassified observations was 0.0297. This is important as it shows that the algorithm was able to distinguish between the two groups and the miscoded records were detected with a high probability.

Fig 1

Misclassification probability densities for the a. miscoded observations and b. correctly coded observations.

Misclassification probability densities for the a. miscoded observations and b. correctly coded observations. The correlation between the true and estimated breeding values under the different models for the symmetric scenario is presented in Table 4. As expected, the highest correlation (0.378) was obtained when the data was free of misclassification (M1), and the minimum (0.300) when the data was noisy and misclassification was ignored during the analysis (M2). Using the noisy data and the proposed methods, the correlation was 0.350 and 0.348 for M3 and M4, respectively. Under the null model (M5), the correlation was the same as in M1. Under the non-symmetric simulation scenario, the correlation between true and estimated breeding values was equal to 0.36 and 0.356 for D1 and D2, respectively.

Table 4

Pearson correlation between true and estimated breeding values under different models and datasets.

	D1	D2
M1	0.378	0.370
M2	0.300	0.314
M3	0.350	0.346
M4	0.348	0.347
M5	0.377	0.370

2 D1: large dataset, D2: small dataset

1 M1: True data analysis with a classical threshold model, M2: Noisy data analyzed with a classical threshold model, M3: noisy data analyzed with the proposed method assuming the misclassification probability is known, M4: same as M3 except the misclassification probability is assumed unknown, M5: Noise free data analyzed using our proposed method (null model); 2 D1: large dataset, D2: small dataset For the small simulated data, due to the fact that the true parameters of the model were estimated with large uncertainty, the performance of the different methods was compared to M1 as reference. This choice is motivated by the fact that results obtained using M1 (noise-free data and classical threshold model) are the best we can expect from the analysis of the small simulated data. Under the symmetric simulation scenario, methods M2, M4, and M5 underestimated the genetic variance by 32, 21, and 20% compared to M1, respectively, and M3 resulted in an overestimation of 12%. Although the proposed methods (M3, M4) for dealing with misclassification were unable to reproduce the results obtained using the noise-free data (M1), they reduced the bias by more than half compared to M2. For M4, misclassification probabilities were also underestimated with posterior means equal to 0.013, 0.007, and 0.007 for π12, π23, and π13, respectively. Correlations between true and predicted breeding values (Table 4) followed the same trend as those for the large data set, although they were a bit smaller in magnitude. The real data was analyzed assuming nonsymmetric misclassification probabilities between categories. The estimated genetic variance was 0.326 and 0.118 using a classical threshold model (M1) and our proposed method, respectively. Although there is no true value to compare with, estimates obtained using our method are more realistic for a trait characterized by low heritability [40-41]. Misclassification probabilities were around zero (0.001) for π31 and π13, and approximately ten-fold higher (0.0098–0.0103) for π12, π21, π23, and π32. These results are expected given the unlikely misclassification between the third category (major calving difficulty) and the other two categories and any potential misclassification is likely the result of a posterior assessment of the calving event.

Discussions

It has been well-established in the literature that misclassification in the response variable, especially discrete responses, will lead to biased estimation of parameters in the model [1]. The results of this study showed clearly that a small noise level (5%) in an ordered categorical response (3 categories) resulted in a substantial bias in the estimation of genetic variance and breeding values. In animal and plant improvement programs such inaccurate estimates will negatively impact the efficiency of selection. The methodology discussed here is an extension to the work carried out by our group on the analysis of binary data subject to misclassification [5-6,11]. It proposes to address misclassification by identifying individual observations that are miscoded followed by switching (reassignment) to their true class. In doing so, a ‘true’ dataset is produced from which estimates of parameters are inferred. While it is unlikely that all misclassified observations will be successfully detected in any practical application, the results presented here show that bias can be significantly reduced by the proposed method. When the rate of misclassification was assumed known, as may be the case in some diagnostic applications, bias in estimates of genetic variance was nearly eliminated and the resulting posterior mean (0.112) was very similar to the estimate obtained for the true large sample data (Table 2). In most applications there may be reason to suspect misclassification in the data, although its extent may not be known. Even in such a scenario, the proposed method was able to substantially reduce the bias in the estimation of the genetic variance (M4; Table 2). Both when the misclassification probabilities were known or unknown, estimates of the breeding values were virtually identical to those obtained using the true noise-free data set. This is crucial for phenotype prediction and efficiency of selection. When the data was small, the proposed method was not able to completely eliminate the bias in the estimation of the genetic variance, but it helped reduce it by almost a half. This could be due to the complexity of estimating the genetic variance with small data even in the absence of misclassification. Additionally, with a small data set and a low noise level (5%) the number of misclassified observations is limited (in our case an average of 7 misclassified observations between each pair of categories), which complicates the learning of the true data generating process needed to identify potentially misclassified responses. However, the breeding values were estimated with an accuracy similar to the large data set (Table 4). For the large data set, the probability of misclassification between classes 1 and 3 (π13) was accurately estimated. However, the probability of misclassification between classes 1 and 2 (π12) and 2 and 3 (π23) were underestimated. This is not surprising due to the marked differences in liabilities between observations in more distant categories. Some observations in adjacent categories tend to have very similar liabilities (close to the separation threshold), which enormously complicates the detection of misclassified observations and facilitates the switching of truly classified responses between categories. This behavior is illustrated in Fig 2, where the probability of misclassification for correctly classified observations is highest when the true liability is located near a threshold value separating two classes. In the case of misclassified observations, the probability of misclassification is highest (almost 1) when non-adjacent categories are involved (i.e., classes 1 and 3) as indicated in Fig 3. However, when the misclassification occurs between adjacent classes, the misclassification probability is lower, especially when the liability of the switched observation is closer to the threshold separating its true and observed classes (Fig 3).

Fig 2

Probability of misclassification of correctly classified observations.

Dashed red lines indicate the threshold values separating the three categories.

Fig 3

Probability of misclassification of miscoded observations.

Probability of misclassification of correctly classified observations.

Dashed red lines indicate the threshold values separating the three categories.

Probability of misclassification of miscoded observations.

Dashed red lines indicate the threshold values separating the three categories. Individual plots are separated by the observed (miscoded) class of observations. Observations are plotted along the x-axis by their true liability, indicating their true class. Results of the analysis of the true dataset indicate the possibility of a low level of misclassification. This is supported by the significant change in the estimation of the genetic variance when misclassification was contemplated in the model. Furthermore, estimates of the misclassification probabilities are well supported by husbandry practices. In practice, it is unlikely that a misclassification could occur between category 3 (difficult calving) and the other two categories (1 and 2), but owevemisclassification is likely between classes 1 and 2, which are more subjective. In real applications, there is no absolute certainty about the presence of misclassification in a given data set. Thus, we tested our proposed methodology under a null model (no misclassification in the data). Under small and large data set scenarios, the estimated genetic variance and breeding values were virtually the same as when the true data was analyzed using a classical threshold model (Tables 2 and 4). Additionally, the probabilities of misclassification were small to negligible, indicating the absence of misclassification as expected. In this study, the proposed methodology was tested using simulated datasets for an ordered categorical response with three categories and a heritability of 0.10. We assumed that the misclassification probabilities were symmetric between adjacent categories in order to facilitate the implementation. While the extension to traits with more categories and asymmetric misclassification between adjacent classes is straightforward, it could be computationally costly. For traits with higher heritability, and therefore higher predictability, performance of the methodology would be expected to improve.

Conclusions

Large data sets are being collected in several fields of research from human heath to precision agriculture. Although this data will undoubtedly contribute towards answering complex scientific questions, the unavoidable noise, including the misclassification of discrete response variables, may result in biased inference, loss of statistical power and ultimately a less efficient decision-making process. The methodology proposed in this study provides an effective tool to at least reduce the negative impact of misclassification through the identification of potentially miscoded observations and the prediction of their true response class. Furthermore, it has been shown that the method is able to yield a similar reduction in bias whether the rate of misclassification in the data is known or to be estimated. As misclassification is seldom known before the analysis, any proposed method to deal with misclassification has to perform well in the presence of noise-free data. As indicated by the results of the null model, our method was adequate. While the application demonstrated in the current study is on simulated beef cattle production data, the methodology is flexible and would be adaptable to any ordinal response.

25 in total

1. Patterns of bias due to differential misclassification by case-control status in a case-control study.

Authors: Po-Huang Chyou
Journal: Eur J Epidemiol Date: 2007-01-17 Impact factor: 8.082

2. Effects of misdiagnosis in input data on the identification of differential expression genes in incipient Alzheimer patients.

Authors: Sandeep Joseph; Kelly Robbins; Wensheng Zhang; Romdhane Rekaya
Journal: In Silico Biol Date: 2008

3. Bias attenuation results for nondifferentially mismeasured ordinal and coarsened confounders.

Authors: Elizabeth L Ogburn; Tyler J Vanderweele
Journal: Biometrika Date: 2013 Impact factor: 2.445

4. The effects of misclassification on the estimation of relative risk.

Authors: B A Barron
Journal: Biometrics Date: 1977-06 Impact factor: 2.571

5. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data.

Authors: Jessie K Edwards; Stephen R Cole; Melissa A Troester; David B Richardson
Journal: Am J Epidemiol Date: 2013-04-04 Impact factor: 4.897

6. Binary regression with differentially misclassified response and exposure variables.

Authors: Li Tang; Robert H Lyles; Caroline C King; David D Celentano; Yungtai Lo
Journal: Stat Med Date: 2015-02-04 Impact factor: 2.373

7. A simulation study for the analysis of uncertain binary responses: application to first insemination success in beef cattle.

Authors: Robyn L Sapp; Matthew L Spangler; Romdhane Rekaya; J Keith Bertrand
Journal: Genet Sel Evol Date: 2005 Nov-Dec Impact factor: 4.297

8. Fixed effects in the formation of a composite line of beef cattle: I. Experimental design and reproductive performance.

Authors: S Newman; M D MacNeil; W L Reynolds; B W Knapp; J J Urick
Journal: J Anim Sci Date: 1993-08 Impact factor: 3.159

9. Genome wide association studies in presence of misclassified binary responses.

Authors: Shannon Smith; El Hamidi Hay; Nourhene Farhat; Romdhane Rekaya
Journal: BMC Genet Date: 2013-12-26 Impact factor: 2.797

10. Codifying healthcare--big data and the issue of misclassification.

Authors: Karim S Ladha; Matthias Eikermann
Journal: BMC Anesthesiol Date: 2015-12-15 Impact factor: 2.217