| Literature DB >> 19461884 |
Andrey Rzhetsky1, Hagit Shatkay, W John Wilbur.
Abstract
Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology.Entities:
Mesh:
Year: 2009 PMID: 19461884 PMCID: PMC2678295 DOI: 10.1371/journal.pcbi.1000391
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Two stages of our analysis: annotation (I) and inference (II).
First, we used a loop design of experiments to generate annotation data and to estimate the annotator-specific correctness parameters (I). Second, we used the correctness parameter estimates obtained to resolve annotation conflicts and estimate the posterior probability associated with each alternative annotation (II). The probabilistic model is depicted as a dark prism. We had eight annotators grouped into three-annotator groups in such a way that each annotator participated in exactly three groups and all groups were different. This ensured that we could recover correctness estimates for all eight annotators even though some of them (for example, annotators 2 and 7) never annotated the same fragment of text. (Size of symbols representing hypothetical correctness parameter estimates is intended to indicate the magnitude of the corresponding value.)
Example of a sentence from the dataset, annotated by 5 independent annotators (sentence 10835394_70).
| Annotations | 3 fragments (A, B, C) | 5 annotators (A1–A5) |
|
| A1: | |
| A2: | ||
| A3: | ||
| A4: | ||
| A5: | ||
|
|
| A1: |
|
| A2: | |
|
| A3: | |
| A4: | ||
| A5: | ||
|
|
| A1: |
|
| A2: | |
|
| A3: | |
| A4: | ||
| A5: | ||
|
|
| A1: |
|
| A2: | |
|
| A3: | |
| A4: | ||
| A5: |
Annotations in the context of the real sentence are as follows:
The phenotypes of mxp19 ( ) |A2:**1SP3E3| and mxp170 (data not shown) homozygotes and hemizygotes (data not shown) are identical, |A3:**1SP3E3| |A4:**1SP3E3| |A5:**1GP3E3| suggesting that mxp19 and mxp170 are null alleles. |A1:**1SP3E3| |A2:**2SP3E1| |A3:**1SP2E0| |A4:**2SP2E0| |A5:**2GP2E3|
The minimum number of sentence fragments required to represent these annotations is three:
A = “The phenotypes of mxp19 (Fig 1B)”
B = “and mxp170 (data not shown) homozygotes and hemizygotes (data not shown) are identical,”
C = “suggesting that mxp19 and mxp170 are null alleles.”
Annotators' identities are concealed with codes A1, A2, A3, A4, and A5.
Figure 2Graphic outline of the two generative models of text annotations introduced in this study (A and B).
Figure 3Simulation-estimation experiments assuming Model A (A) and Model B (B).
We performed 1,000×2 computational experiments to generate annotation data under Model A (plot A) and under Model B (plot B), estimating parameters under both models in each case. Each of the 1,000 iterations per plot involved sampling a new set of the expected parameter values, generating artificial annotations using these expected values, and then estimating parameters from these artificial data. In each simulation iteration we generated 10,000 sets of artificial annotations imitating work of three annotators. Note that although Model B was not defined in terms of annotator-specific correctness parameters, these parameters can be expressed easily as a function of the native parameters of Model B. (A) Simulations under Model A: For each simulated data set we produced two different estimates, one with Model A and one with Model B. Model A estimation, starting with a random set of initial values with correctness parameter values>0.5 each, reliably recovered the correctness parameter values (yellow dots). Estimation under Model B (blue dots) yielded a significantly wider scatter of estimates, most likely because the hill-climbing algorithm used in this estimation got stuck in one of the numerous local optima on the surface of posterior probability under Model B. Each round of parameter estimation produced two sets of three-annotator-specific estimates, resulting in 6 plot data points. (B) Simulations under Model B: For each of the 1,000 simulated data sets we produced a triplet of estimates (random starts under Models A and B, and start under Model B at the expected values of parameters). When started in the global-optimum mode (black dots), estimation of Model B reliably resulted in near-perfect estimates of the correctness parameters, outperforming the estimated parameters for Model A (yellow dots). However, when started with random parameter values for estimating under Model B, the estimates were widely scattered (blue dots), corresponding to the numerous local optima associated with Model B. Each estimation round resulted in three sets of three-annotator-specific estimates, represented as 9 separate data points in the plot.
Figure 4Estimates of parameters defined under Model A from real data.
(A) Estimates of correctness parameter values for eight annotators across multiple annotation types. (B–E) Estimates of α-parameters (conditional probabilities of agreement patterns given the correctness pattern). (F–I) Estimates of ω-values (frequencies of the annotation codes).
Figure 5Estimates of parameters defined under Model B-with-thetas from real data.
(A) Estimates of correctness parameter values for eight annotators across multiple annotation types. While these values are different from those estimated under Model A, (Figure 4 A), the estimates are clearly consistent across the two models. (B–E) Estimates of γ-distributions, where γ is the probability that the i annotation code is correct. Note that γ-distributions are similar but not identical to distributions of ω-values shown in Figures 4 (F–I).
Comparison of two models, A and B-with-thetas, in terms of their efficiency of resolving three-way ties among three annotators.
| MAP coincides with the 3+5 majority vote (expected by chance) [ | Two highest | Total | |||
|
|
|
|
| ||
| Number of fragments | 19 *** |
|
| 29** | 31 |
| (31/3) | (31/3) | (62/3) | (62/3) | ||
| [0.00096] | [8.8×10−6] | [0.00038] | [0.0015] | ||
| Evidence | 62 |
| 114 |
| 157 |
| (157/3) | (157/3) | (314/3) | (314/3) | ||
| [0.1] | [0.0028] | [0.1] | [2.8×10−7] | ||
| Focus | 56**** |
| 76 |
| 108 |
| (108/3) | (108/3) | (216/3) | (216/3) | ||
| [4.5×10−5] | [0] | [0.4] | [3.6×10−8] | ||
| Polarity+Certainty |
| 52**** |
| 52 | 87 |
| (29) | (29) | (58) | (58) | ||
| [1.3×10−8] | [1.7×10−7] | [0.6] | [0.2] | ||
To test the models, we compare posterior distributions of correct annotations computed under each model with the majority vote obtained by combining the original 3 annotations with 5 additional annotations. The first pair of columns with numbers indicate matches of the maximum a posteriori (MAP) estimate of correct annotation with the 8-evaluator majority vote. The second pair of columns with numbers indicate matches between the two best MAP predictions and the majority vote. Numbers in parentheses indicate the number of matches expected if MAP predictions perform no better than random.
Note: * p<0.05; ** p<0.01; *** p<0.001; **** p<0.0001.