| Literature DB >> 29449872 |
M Valerio Giuffrida1,2, Feng Chen1, Hanno Scharr3, Sotirios A Tsaftaris1.
Abstract
BACKGROUND: Image-based plant phenotyping has become a powerful tool in unravelling genotype-environment interactions. The utilization of image analysis and machine learning have become paramount in extracting data stemming from phenotyping experiments. Yet we rely on observer (a human expert) input to perform the phenotyping process. We assume such input to be a 'gold-standard' and use it to evaluate software and algorithms and to train learning-based algorithms. However, we should consider whether any variability among experienced and non-experienced (including plain citizens) observers exists. Here we design a study that measures such variability in an annotation task of an integer-quantifiable phenotype: the leaf count.Entities:
Keywords: Agreement; Citizen-science; Crowdsourcing; Image-based; Observer; Phenotyping; Variability
Year: 2018 PMID: 29449872 PMCID: PMC5806457 DOI: 10.1186/s13007-018-0278-7
Source DB: PubMed Journal: Plant Methods ISSN: 1746-4811 Impact factor: 4.993
Fig. 1Annotation tool. Screenshots of the annotation tool and the web-page seen by users. A Screenshot of the customized, yet simplified, version of the leaf annotation tool in [21]. B An excerpt of the Zooniverse site used here showing annotations and the (single-choice) confidence question
Fig. 4Average longitudinal counts. Average longitudinal count curves (solid) of the two cultivars [red: col-0; blue: pgm] and 1 standard deviation (shaded area), shown in A relying on a single experienced (left: A1) or non-experienced observer (right: B1); B relying on all experienced (left: B1) or non-experienced (right: B2) observers; C relying on all together; and in D relying on the consensus citizen
Fig. 2Intra-observer variability. A Intra-observer variability of experienced (left: A1) or non-experienced (right: A2) observers in RPi. B Influence of the tool in intra-observer measurements in experienced (left: B1) or non-experienced (right: B2) observers in RPi
Measurement of agreement between experienced and non-experienced observers
| DiC ↓ | |DiC| ↓ | MSE ↓ | Alpha ↑ | ||
|---|---|---|---|---|---|
|
| |||||
| Experienced [The reference observer]a | 0.10 (0.54) | 0.29 (0.47) | 0.307 | 0.980 | 0.987 |
| Non-experienced | 0.13 (0.77) | 0.42 (0.65) | 0.600 | 0.960 | 0.981 |
|
| |||||
| Experienced | 0.00 (0.64) | 0.33 (0.55) | 0.415 | 0.970 | 0.986 |
| Non-experienced | 0.23 (0.82) | 0.46 (0.71) | 0.730 | 0.950 | 0.977 |
|
| |||||
| Experienced | 0.07 (0.65) | 0.37 (0.53) | 0.423 | 0.974 | 0.980 |
| Non-experienced | 0.49 (0.76) | 0.60 (0.67) | 0.815 | 0.962 | 0.962 |
|
| |||||
| Experienced | 0.55 (0.74) | 0.63 (0.68) | 0.861 | 0.969 | 0.959 |
| Non-experienced | 0.23 (0.63) | 0.37 (0.56) | 0.450 | 0.977 | 0.976 |
|
| |||||
| Experienced | 0.57 (0.87) | 0.68 (0.79) | 1.100 | 0.950 | 0.965 |
| Non-experienced | 0.40 (0.70) | 0.51 (0.62) | 0.650 | 0.973 | 0.977 |
|
| |||||
| Experienced versus consensus (average) | 0.53 (0.77) | 0.62 (0.69) | 0.869 | 0.962 | 0.960 |
| Experienced versus consensus (max) | 0.08 (0.82) | 0.45 (0.69) | 0.684 | 0.957 | 0.971 |
| Consensus (average) versus sing. random | 0.00 (0.78) | 0.42 (0.65) | 0.607 | 0.960 | 0.970 |
For shorthand definitions see text. For DiC and |DiC| average and standard deviation are reported. Note that these correspond also to bias and limits of agreement (when standard deviation is multiplied by 1.96) of the Bland–Altman plots reported. means lower is better, whereas the opposite
aThis experienced observer is noted and used as the reference observer for the remaining analysis throughout the paper
Fig. 3Inter-observer and influence of resolution. A Inter-observer variability among experienced (left: A1) or non-experienced (right: A2) observers in RPI; B same as in A but in Canon data; C Variability of experienced (left: C1) or non-experienced (right: C2) observers when comparing counts of the same observer in RPi and Canon data
F and p values for the ANOVA tests corresponding to the plots in Fig. 4
| Sum sq. | F | ||
|---|---|---|---|
| A single ExP | 47.816 | 43.775 | 0.000167 |
| A single NExP | 47.170 | 30.017 | 0.000588 |
| All ExP | 56.264 | 34.661 | 0.000367 |
| All NExP | 49.533 | 29.116 | 0.000649 |
| All observers | 53.219 | 32.280 | 0.000464 |
| Consensus citizen (average) | 66.923 | 19.044 | 0.0024 |
| Consensus citizen (max) | 76.855 | 23.713 | 0.0012 |
Only time*cultivar interaction is shown corresponding to the factor of interest (longitudinal trend). Results with ‘All’ and consensus citizen average (or max) across per-plant observations
A simulated citizen-powered experiment. p values corresponding to an ANOVA test randomizing the number of observations available per each plant at a specific time point
|
| Min | Max | Mean | Std | Kurtosis | |
|---|---|---|---|---|---|---|
| Any | 1 | 0.00003 | 0.00819 | 0.00124 | 0.00113 | 10.34 |
| Any | 2 | 0.00002 | 0.00729 | 0.00120 | 0.00112 | 8.98 |
| Any | 3 | 0.00010 | 0.00235 | 0.00061 | 0.00032 | 6.49 |
| ExP only | 1 | 0.00000 | 0.00726 | 0.00102 | 0.00103 | 9.58 |
| ExP only | 2 | 0.00004 | 0.00306 | 0.00057 | 0.00040 | 9.29 |
| ExP only | 3 | 0.00008 | 0.00150 | 0.00047 | 0.00021 | 5.35 |
| NExP only | 1 | 0.00008 | 0.00378 | 0.00100 | 0.00065 | 5.71 |
| NExP only | 2 | 0.00023 | 0.00174 | 0.00078 | 0.00028 | 3.49 |
| NExP only | 3 | 0.00033 | 0.00124 | 0.00069 | 0.00015 | 3.19 |
Process is repeated sampling from any of the observers (i.e. the sampling may contain a mix of experienced and non-experienced observers) or only from experienced (ExP) or non-experienced (i.e. NExP) ones
Fig. 5Citizen distribution and variability. A Number of images annotated per user (citizen); B Relationship between leaf count variation and average user confidence per plant; C Variability between the consensus citizen and the reference observer; D Variability between the consensus citizen and a random selection of counts (from the 3 available per-plant)
Algorithmic leaf counting results obtained using the method in [15]
| Algorithm versus annotator | Algorithm versus annotator | Annotator versus reference | |
|---|---|---|---|
| Training error | Testing error | Inter-observer error | |
| DiC ↓ | 0.00 (1.07) | − 0.04 (1.31) | 0.21 (0.75) |
| |DiC| ↓ | 0.61 (0.88) | 0.88 (0.96) | 0.46 (0.62) |
| MSE ↓ | 1.163 | 1.700 | 0.600 |
| R2↑ | 0.933 | 0.895 | 0.964 |
Four metrics are reported. We first compare between the algorithm and the 728 images in the training set (ie. how well the algorithm learns). Then we compare how well the algorithm predicts counts on a testing set of 130 images (also used in this study) comparing the algorithm with the counts of the annotator (that also was involved in deriving annotations for the training set). Lastly we compare the annotator (the data of which we used to train the algorithm and was not involved in this study) with the reference observer used throughout in this study