| Literature DB >> 35683522 |
Andrzej Grzybowski1,2, Piotr Brona3, Tomasz Krzywicki4, Magdalena Gaca-Wysocka3, Arleta Berlińska1, Anna Święch5.
Abstract
Poland has never had a widespread diabetic retinopathy (DR) screening program and subsequently has no purpose-trained graders and no established grader training scheme. Herein, we compare the performance and variability of three retinal specialists with no additional DR grading training in assessing images from 335 real-life screening encounters and contrast their performance against IDx-DR, a US Food and Drug Administration (FDA) approved DR screening suite. A total of 1501 fundus images from 670 eyes were assessed by each grader with a final grade on a per-eye level. Unanimous agreement between all graders was achieved for 385 eyes, and 110 patients, out of which 98% had a final grade of no DR. Thirty-six patients had final grades higher than mild DR, out of which only two had no grader disagreements regarding severity. A total of 28 eyes underwent adjudication due to complete grader disagreement. Four patients had discordant grades ranging from no DR to severe DR between the human graders and IDx-DR. Retina specialists achieved kappa scores of 0.52, 0.78, and 0.61. Retina specialists had relatively high grader variability and only a modest concordance with IDx-DR results. Focused training and verification are recommended for any potential DR graders before assessing DR screening images.Entities:
Keywords: deep learning; diabetic retinopathy grading; diabetic retinopathy screening; grader comparison; inter-grader variability
Year: 2022 PMID: 35683522 PMCID: PMC9180965 DOI: 10.3390/jcm11113125
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.964
Figure 1Patients selection flowchart.
Distribution of diabetic retinopathy (DR) grades per patient after final adjudication, based on human grading.
| DR Grade (4-Point Scale) | Number of Patients | Percentage of Patients |
|---|---|---|
| 0 | 245 | 73.13% |
| 1 | 54 | 16.12% |
| 2 | 29 | 8.66% |
| 3 | 7 | 2.09% |
The number of assessments of fundus images by proportion of agreement (5-point scale).
| Proportion | Number of Images | Percentage | Eyes | Percentage of All Eyes Studied |
|---|---|---|---|---|
| 1:1:1 (both eyes) | 72 | 5% | 28 | 4% |
| 1:1:1 (left eye) | 41 | 1% | 15 | 2% |
| 1:1:1 (right eye) | 31 | 1% | 13 | 2% |
| 2:1 (both eyes) | 588 | 39% | 256 | 38% |
| 2:1 (left eye) | 294 | 9% | 129 | 19% |
| 2:1 (right eye) | 294 | 8% | 127 | 19% |
| 3:0 (both eyes) | 841 | 56% | 385 | 58% |
| 3:0 (left eye) | 406 | 13% | 190 | 28% |
| 3:0 (right eye) | 435 | 13% | 195 | 29% |
Counts of images in each DR severity level with full grader agreement.
| DR Grade (5-Point Scale) | Number of Images | Percentage of Images | Number of Eyes | Percentage of All Eyes |
|---|---|---|---|---|
| 0 | 789 | 9.00 | 361 | 93.77 |
| 1 | 40 | 4.76 | 20 | 5.20 |
| 2 | 6 | 0.71 | 3 | 0.78 |
| 3 | 6 | 0.71 | 1 | 0.25 |
| 4 | 0 | 0.00 | 0 | 0.00 |
Count of patients in each DR stage with full grader agreement.
| DR Grade (4-Point Scale) | Number of Patients | Percentage |
|---|---|---|
| 0 | 108 | 96.42 |
| 1 | 2 | 1.79 |
| 2 | 0 | 0.00 |
| 3 | 2 | 1.79 |
Counts of images and eyes in each DR severity level with majority grader agreement (2 out of 3 graders agree on severity level).
| DR Grade (5-Point Scale) | Number of Images | Percentage of Images | Number of Eyes | Percentage of All Eyes |
|---|---|---|---|---|
| 0 | 331 | 49 | 149 | 58.20 |
| 1 | 142 | 21 | 62 | 24.22 |
| 2 | 94 | 14 | 37 | 14.45 |
| 3 | 15 | 2 | 5 | 1.95 |
| 4 | 6 | 1 | 3 | 1.17 |
The number of the highest assessments assigned to patients.
| Proportion | Number of Patients | Percentage |
|---|---|---|
| 1:1:1:1 | 4 | 1.09 |
| 2:1:1 | 80 | 21.80 |
| 2:1:1 (the assessment assigned by the IDx-DR is in majority graders’ assigned scores) | 7 | 1.91 |
| 2:2 | 40 | 10.90 |
| 3:1 | 99 | 26.97 |
| 3:1 (the assessment assigned by the IDx-DR is in majority human graders’ assigned scores) | 25 | 6.81 |
| 4:0 | 112 | 30.52 |
Breakdown of results for 4 patients with total disagreement between graders and IDx-DR. Letters A to D represent individual patients.
| Patient | Grader 1 | Grader 2 | Grader 3 | Adjudicated Grades for OD/OS | IDx-DR Patient Level Grade | |||
|---|---|---|---|---|---|---|---|---|
| A | 0 | 0 | 1 | 1 | 2 | 2 | 2/2 | 3 |
| B | 0 | 0 | 1 | 1 | 2 | 2 | 1/1 | 3 |
| C | 2 | 2 | 0 | 0 | 1 | 0 | 1/1 | 3 |
| D | 0 | 0 | 1 | 1 | 2 | 2 | 1/1 | 3 |
Figure 2Fundus image of patients (A,B), showing multiple retinal photocoagulation scars.
Figure 3Fundus images of patient C, showing peripapillary pigment changes, hard exudates in peripheral macula, and more subtle exudates near the fovea in the right eye, and low quality of images with a large shadow in the left eye.
Figure 4Fundus image of patient D, showing multiple small hemorrhages, microaneurysms, and multiple image artifacts.
Summary of studies reporting grader reliability statistics in grading for diabetic retinopathy.
| Study | Grading Level | Sample Size | Grading Details | Comparison | Kappa Scores |
|---|---|---|---|---|---|
| Per eye | 118 eyes | Four levels of grading based on ophthalmoscopy | Reading centre grading based on 7-field fundus photography | Unweighted kappa | |
| Per image | 400 images | Single-field digital fundus images read by ophthalmologists, retinal specialists, and non-physician staff | Interobserver variability | Overall, 0.34 for retinopathy severity, 0.28 for referral cases; for retinal specialists 0.58, for retinopathy severity, and 0.63 for referrals | |
| Per eye | 6902 and 3638 eyes | Eyes taken from two large studies—ACCORD and FIND, 5-level DR severity scale | Reading center | 0.42 and 0.65 for FIND and ACCORD, respectively | |
| Per eye | 7402 eyes | Detection of specific features: group 1–retinal haemorrhages, microaneurysms, hard exudates, new vessels, fibrous proliferations, and macular oedema; group 2—soft exudates, intraretinal microvascular abnormalities, venous beading; 7-field fundus images | Interobserver variability | Weighted-kappa—0.61–0.80 for group 1 features; 0.41–0.60 for group 2 | |
| Per image | Retinal specialists, ophthalmologists, and a deep-learning-based algorithm, initially graded at a 5-point scale, calculated for various DR severity cutoffs | Adjudicated consensus of retinal specialists | Quadratic-weighted kappa; retinal specialists—0.82–0.91; ophthalmologists—0.80–0.84; 0.84 for the algorithm | ||
| Per eye | 1589 images | Detection of specific features by individual graders, later computed into severity levels, comparison of different annotation protocols, and methods | Pair-wise intergrader variability calculated for each grader pair for feature and severity detection | Quadratic-weighted kappa; 0.217–0.863 for detection of specific DR features, 0.430–0.914 for DR severity |