| Literature DB >> 30543662 |
Ashley Ling1, El Hamidi Hay2, Samuel E Aggrey3,4, Romdhane Rekaya1,3,5.
Abstract
Ordinal categorical responses are frequently collected in survey studies, human medicine, and animal and plant improvement programs, just to mention a few. Errors in this type of data are neither rare nor easy to detect. These errors tend to bias the inference, reduce the statistical power and ultimately the efficiency of the decision-making process. Contrarily to the binary situation where misclassification occurs between two response classes, noise in ordinal categorical data is more complex due to the increased number of categories, diversity and asymmetry of errors. Although several approaches have been presented for dealing with misclassification in binary data, only limited practical methods have been proposed to analyze noisy categorical responses. A latent variable model implemented within a Bayesian framework was proposed to analyze ordinal categorical data subject to misclassification using simulated and real datasets. The simulated scenario consisted of a discrete response with three categories and a symmetric error rate of 5% between any two classes. The real data consisted of calving ease records of beef cows. Using real and simulated data, ignoring misclassification resulted in substantial bias in the estimation of genetic parameters and reduction of the accuracy of predicted breeding values. Using our proposed approach, a significant reduction in bias and increase in accuracy ranging from 11% to 17% was observed. Furthermore, most of the misclassified observations (in the simulated data) were identified with a substantially higher probability. Similar results were observed for a scenario with asymmetric misclassification. While the extension to traits with more categories between adjacent classes is straightforward, it could be computationally costly. For traits with high heritability, the performance of the methodology would be expected to improve.Entities:
Mesh:
Year: 2018 PMID: 30543662 PMCID: PMC6292639 DOI: 10.1371/journal.pone.0208433
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of classes and distributions used to simulate the systematic effects for the large (D1) and small (D2) datasets.
| Number of Levels | |||
|---|---|---|---|
| D1 | D2 | ||
| 1 | 20 | 5 | |
| 2 | 10 | 5 | |
| 3 | 5 | 5 | |
Posterior means, posterior standard deviations and the 95% highest posterior density interval (HPD) of the genetic variance (true value = 0.1) under different models and datasets.
| Mean | Standard Deviation | 95% HPD Interval | ||||
|---|---|---|---|---|---|---|
| D1 | D2 | D1 | D2 | D1 | D2 | |
| M1 | 0.106 | 0.135 | 0.0255 | 0.0670 | 0.0586–0.170 | 0.0469–0.319 |
| M2 | 0.0521 | 0.0919 | 0.0143 | 0.0422 | 0.0288–0.0846 | 0.0364–0.198 |
| M3 | 0.112 | 0.151 | 0.0355 | 0.0881 | 0.0540–0.191 | 0.0460–0.380 |
| M4 | 0.0998 | 0.106 | 0.0188 | 0.0239 | 0.0688–0.142 | 0.0681–0.161 |
| M5 | 0.113 | 0.108 | 0.0226 | 0.0251 | 0.0755–0.163 | 0.0692–0.167 |
1 M1: True data analysis with a classical threshold model, M2: Noisy data analyzed with a classical threshold model, M3: noisy data analyzed with the proposed method assuming the misclassification probability is known, M4: same as M3 except the misclassification probability is assumed unknown, M5: Noise free data analyzed using our proposed method (null model);
2 D1: large dataset, D2: small dataset
Posterior mean of the misclassification probability between the different categories of the discrete responses and datasets.
| Symmetrical Misclassification | |||||
| True Value | M4 | M5 | |||
| Parameter | D1 | D2 | D1 | D2 | |
| 0.025 | 0.0146 | 0.0128 | 0.0112 | 0.0124 | |
| 0.025 | 0.0135 | 0.0062 | 0.0065 | 0.0057 | |
| 0.025 | 0.0239 | 0.0067 | 0.0019 | 0.0029 | |
| True Value | M4 | ||||
| D1 | D2 | ||||
| 0.01 | 0.010 | 0.0095 | |||
| 0.03 | 0.013 | 0.0100 | |||
| 0.015 | 0.011 | 0.0098 | |||
| 0.01 | 0.011 | 0.0099 | |||
| 0.001 | 0.0013 | 0.0013 | |||
| 0.001 | 0.0014 | 0.0011 | |||
1 D1: large dataset, D2: small dataset; π is the misclassification probability between categories i and j
Fig 1Misclassification probability densities for the a. miscoded observations and b. correctly coded observations.
Pearson correlation between true and estimated breeding values under different models and datasets.
| D1 | D2 | |
|---|---|---|
| M1 | 0.378 | 0.370 |
| M2 | 0.300 | 0.314 |
| M3 | 0.350 | 0.346 |
| M4 | 0.348 | 0.347 |
| M5 | 0.377 | 0.370 |
1 M1: True data analysis with a classical threshold model, M2: Noisy data analyzed with a classical threshold model, M3: noisy data analyzed with the proposed method assuming the misclassification probability is known, M4: same as M3 except the misclassification probability is assumed unknown, M5: Noise free data analyzed using our proposed method (null model);
2 D1: large dataset, D2: small dataset
Fig 2Probability of misclassification of correctly classified observations.
Dashed red lines indicate the threshold values separating the three categories.
Fig 3Probability of misclassification of miscoded observations.
Dashed red lines indicate the threshold values separating the three categories. Individual plots are separated by the observed (miscoded) class of observations. Observations are plotted along the x-axis by their true liability, indicating their true class.