Literature DB >> 35881576

The assessment of fundus image quality labeling reliability among graders with different backgrounds.

Kornélia Lenke Laurik-Feuerstein¹, Rishav Sapahia², Delia Cabrera DeBuc², Gábor Márk Somfai^3,4,5.

Abstract

PURPOSE: For the training of machine learning (ML) algorithms, correctly labeled ground truth data are inevitable. In this pilot study, we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality.
METHODS: Color fundus photographs were labeled using a Python-based tool using four image categories: excellent (E), good (G), adequate (A) and insufficient for grading (I). We enrolled 8 subjects (4 with and 4 without medical background, groups M and NM, respectively) to whom a tutorial was presented on image quality requirements. We randomly selected 200 images from a pool of 18,145 expert-labeled images (50/E, 50/G, 50/A, 50/I). The performance of the grading was timed and the agreement was assessed. An additional grading round was performed with 14 labels for a more objective analysis.
RESULTS: The median time (interquartile range) for the labeling task with 4 categories was 987.8 sec (418.6) for all graders and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. Cohen's weighted kappa showed moderate agreement (0.564) when using four categories that increased to substantial (0.637) when using only three by merging the E and G groups. By the use of 14 labels, the weighted kappa values were 0.594 and 0.667 when assigning four or three categories, respectively.
CONCLUSION: Image grading with a Python-based tool seems to be a simple yet possibly efficient solution for the labeling of fundus images according to image quality that does not necessarily require medical background. Such grading can be subject to variability but could still effectively serve the robust identification of images with insufficient quality. This emphasizes the opportunity for the democratization of ML-applications among persons with both medical and non-medical background. However, simplicity of the grading system is key to successful categorization.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35881576 PMCID： PMC9321443 DOI： 10.1371/journal.pone.0271156

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Medical imaging gains an ever-increasing role in modern medicine urging the need for human expertise for image interpretation and data evaluation [1-3]. In many clinical specialties there is a relative shortage of such expertise to provide timely diagnosis and referral. Machine Learning (ML) has emerged as an important tool for healthcare due to the rapid evolution of artificial intelligence (AI) in various medical fields, such as ophthalmology and radiology. The importance of high-quality data is pivotal when it comes to training new AI/ML models, using datasets that are mindfully designed and annotated. Moreover, such datasets should report on sociodemographic features of the patients, on the in- and exclusion criteria, the labelling process and the labelers themselves [4, 5]. Globally, sight-loss costs $3B annually, with an incidence that is projected to triple by 2050, 80% of which is preventable [6]. That is, severe vision loss is needlessly experienced by too many patients; regarding diabetic retinopathy (DR) alone, appropriate treatment can reduce the risk of blindness or moderate vision loss by more than 90% [7]. The compounding limitation of availability of retina specialists and trained human graders is also a major problem worldwide. Consequently, given the current population growth trends it is inevitable that automated applications with limited human interaction will expand over time [3]. Machine Learning has already provided multiple clinically relevant applications in ophthalmology which include image segmentation, automated diagnosis, disease prediction, and disease prognosis [8-10]. Ophthalmology is particularly suitable for ML because of the crucial role of imaging, where fundus photographs, optical coherence tomography, anterior segment photographs, and corneal topography can be applied to conditions such as diabetic retinopathy, age-related macular degeneration, glaucoma, papilledema, and cataracts [3, 8–10]. In order to provide sufficient data for the training and validation of ML algorithms, correctly labeled ground truth data are inevitable. In an effort to model the democratization of retinal image labeling, in this pilot study we assessed the performance of graders with different backgrounds in the labeling of retinal fundus image quality using a custom-built grading tool constructed in Python. We also assessed the ability to identify images with poor quality and obtained feedback from the graders to assess their impressions.

Materials and methods

Color fundus photographs were chosen from a total of 18,145 color fundus images intended for the future training of an AI based algorithm for image quality grading. Three datasets were used as a source for color fundus images: the dataset by Tao et al. [11], the EyePacs dataset [12] and finally a dataset of 984 color fundus images obtained from the Bascom Palmer Eye Institute. The images were taken with a variety of fundus cameras. The study was performed in adherence to the guidelines of the Declaration of Helsinki, ethics approval was obtained for the study from the local ethics committee of the University of Miami. The images were collected with the participants’ consent and were de-identified as required by local regulatory requirements (e.g., the Health Insurance Portability and Accountability Act). An image classification system was developed based on the previous work of Zapata et al., Fleming et al., Gulshan et al. and on the EyePACS image quality grading system [12-15]. The image grading criteria and their definitions can be seen in detail in Table 1. In short, four criteria were used to define image quality: focus, image field definition, brightness and artefacts. Based on the fulfillment of the criteria definitions, four groups were created: excellent (E, all criteria sufficiently fulfilled), good (G, maximum 2 criteria not fulfilled), adequate (A, 3–4 criteria not fulfilled but the retina is recognizable) and insufficient for grading (I, no retinal tissue visible on more than 50% of the image; no third generation branches of the retinal vasculature can be identified one-disc diameter to the fovea and optic nerve head) (S1 File). We randomly selected 200 images from a pool of 18,145 expert-labeled images with 50 images from each group. The images were previously labeled for image quality by two experts and served as the reference for the analysis using the standards described above. These experts were a board-certified ophthalmologist (KLLF) with experience in retinal imaging and a board-certified senior retina specialist (GMS) with retinal image grading experience in the telemedical screening of diabetic retinopathy. In case of any disagreement, GMS’s decision served as a reference [16]. A representative image for each image quality group, along with its labels can be seen on Fig 1.

Table 1

Image quality grading criteria.

Grading categories	Definition
Focus	Details are present up to a level allowing to grade smallest retinal alterations e.g. microaneurysm, intraretinal microvascular abnormalities. The small retinal vessels within one-disc diameter around the fovea are depicted sharply.
Illumination	The amount of source light incident on the retina is correct for the visualization of smallest retinopathy lesions. There are no washed-out or dark areas that interfere with detailed grading.
Image field definition	The primary image field includes the entire optic nerve head (ONH) and macula. There is at least one optic disc diameter retina nasally and temporally from the ONH and macula, respectively.
Artefacts	Artefacts in the image acquisition such as: dust spots, arc defects, fingerprints, camera reflexes or eyelash images regardless whether they hindered image grading

Fig 1

Representative color fundus images from the Excellent (A), Good (B), Adequate (C) and Insufficient for grading (D) categories. Image 1B fulfills criteria for “Good” due to the image field definition (decentered image) and peripheral artifacts; 1C qualifies as Adequate due to its poor illumination, off focus and due to insufficient image field definition (the image field does not contain enough of the retina temporal to the fovea). Image 1D was labelled as Insufficient as it neither depicts the optic nerve head nor makes it possible to visualize the third-generation vessel branches around the macula which, in turn, would not enable to detect retinal changes characteristic for diabetic retinopathy.

Image quality grading criteria.

Image quality was determined based on four relevant categories representing the standard for good quality color fundus images. All the following categories are necessary in order to grade retinopathy lesions in fullest. A tkinter based graphical user interface (GUI) tool was developed in Python by taking inspiration from the multiple open-source tools available. (https://docs.python.org/3/library/tkinter.html, last accessed 11. July, 2021) The main intent was to develop a simple, lightweight, on-prem image annotation tool with a smooth learning curve for medical professionals. A screenshot of the GUI of the tool is shown on Fig 2. The tool also enabled the timing of the grading of each image shown in a resolution of 900x1000 pixels, without the option of zooming in.

Fig 2

Screenshots of the image labeling tool developed in Python demonstrating its function.

Screenshots of the image labeling tool developed in Python demonstrating its function.

A) First, the folder containing the images is selected (1). In the next step the output of image labeling can be chosen (2) and finally the labels are specified (3). B) After setting the above parameters there are two options. Either navigate through with the “Prev” and “Next” buttons, or run a timed labeling round by selecting the option to “automatically show next image when labeled” and then the “Start” button. Upon closing the tool, a.csv and optionally.xls will be generated with the results. For the purposes of this study, the images were evaluated by 8 volunteers, 4 with medical background (3 ophthalmologists and 1 optometrist, aged 26–45 years–group M) and 4 without medical background (2 computer scientists, 1 lawyer and 1 teacher, aged 26–60 years–group NM). All volunteers had at least intermediate computer skills working with a PC on a daily basis. Prior to grading, the participants received a tutorial consisting of a detailed oral explanation of the task, supported by a PDF document describing retinal anatomy, the grading system with sample images including examples of different artefacts and the description of the GUI (S1 File). All graders were encouraged to keep the tutorial open or as a printout in order to facilitate the grading task (S1 File). The images were presented in the same, randomly selected order for all volunteers. In order to assess image quality in a more objective way, the graders were also asked to perform a second round of grading with the tool using 14 labels (Table 2). Using these labels, the same groups were generated following the assessment (S1 File). After completion of the grading task the participants were asked to give a written feedback regarding their experiences.

Table 2

Grading labels used for the objective grading round.

Grading categories	Labels used in the objective grading round
Focus	• Optimal• Unsharp
Illumination	• Optimal• Too dark• Too light
Image field definition	• Optimal• Missing Macula• Missing Optic disc
Artefacts	• No artefacts• Small pupil• Dust spots• Lash artefacts• Camera artefacts• Arc defects

Grading labels used for the objective grading round.

In order to decrease the inherent subjectivity of our study, our participants were asked to perform a second round of grading using predefined labels assigned to each category. In this round, the four categories were then complied similarly to the first round of grading. The time needed for the grading was recorded for each volunteer. To compare the times needed for the labeling according to the four categories, we used the medians and interquartile range (IQR) due to the low number of graders in both groups. For the analysis of the inter-rater agreement, Cohens’s weighted kappa was calculated in the first grading round for the four categories (E/G/A/I) and in the second, objective round of grading using 14 labels. The 4 categories were compiled according to the categories of the previous grading system (E/G/A/I). Cohens’s weighted kappa was also determined with the excellent and good categories merged [(E+G)/A/I] for both grading rounds. To assess the agreement in selecting poor quality images both in the first and second round, Cohen’s weighted kappa was calculated with merging groups E and G vs. A and I. The kappa values are presented as medians (IQR) for all the graders and for both groups (M, NM) in Table 3.

Table 3

Inter-rater agreement in the two setups with different image quality category groups.

Cohens’s weighted kappa was calculated for the grading using 4 image quality categories (Excellent (E)/ Good (G)/ Adequate (A)/ Insufficient (I)) and for the second round of grading using 14 labels. In the latter, the same 4 categories were compiled as in the first round of grading (E/G/A/I). Cohens’s weighted kappa was also determined with E and G merged for both grading rounds [(E+G)/A/I]. To assess the agreement when distinguishing between poor quality images both in the first and second round, Cohen’s weighted kappa was calculated with two merged groups (E and G vs. A and I). The kappa values are presented as medians (interquartile range) for all the graders and for both groups (medical, non-medical).

	4 Image quality grading criteria			14 Predefined labels
	4 groups	3 groups	2 groups	4 groups	3 groups	2 groups
Medical	0.590 (0.167)	0.657 (0.116)	0.715 (0.190)	0.598 (0.053)	0.669 (0.052)	0.708 (0.126)
Non-medical	0.554 (0.176)	0.627 (0.147)	0.625 (0.175)	0.568 (0.085)	0.612 (0.107)	0.581 (0.127)
Altogether	0.564 (0.163)	0.637 (0.096)	0.665 (0.178)	0.594 (0.60)	0.667 (0.80)	0.670 (0.151)

Inter-rater agreement in the two setups with different image quality category groups.

Results

The median time (IQR) for the labeling task was 987.8 sec (418.6) for all graders, and 872.9 sec (621.0) vs. 1019.8 sec (479.5) in the M vs. NM groups, respectively. The median time necessary for the first 50 images (262.8 sec [125.7]) was somewhat longer compared to that required for the last 50 images (195.8 sec [108.3]). Graders with medical training finished labeling the last 50 images in median 28.6 seconds faster than the first 50 images. Graders without former medical training finished the last 50 images in median 38.25 seconds faster than the first 50 images. The median time needed for the decision making per image (IQR) in the four categories was 3.35 sec (3.4), 3.8 sec (4.0), 4.0 sec (4.8) and 1.8 sec (2.0) for the E, G, A and I label, respectively. The median time for single decision was in all categories longer in the NM group (3.8 vs. 2.4 sec, 4.3 vs. 3.2 sec, 4.6 vs. 3.1 sec, 2.0 vs. 1.6 sec for the E, G, A and I in the NM vs. M groups respectively). The Cohen´s weighted kappa showed moderate agreement among our graders when using 4 categories (0.564 for all graders and 0.590 vs. 0.554, for the groups M and NM, respectively). This increased to substantial with merging the E and G groups (0.637 for all graders, and 0.657 vs. 0.627 for the groups M and NM, respectively). Cohen’s weighted kappa further increased with merging groups E and G vs. A and I (0.665 for all graders, and 0.715 vs. 0.625 for the groups M and NM, respectively) (Table 3). In the second round of grading using 14 labels, the 4 categories were compiled according to the categories of the previous grading system (E/G/A/I). Here the agreement by Cohen’s weighted kappa was moderate when using four categories (0.594 for all graders and 0.598 vs 0.568 for the groups M and NM, respectively), substantial when merging E and G groups (0.667 for all graders and 0.669 vs 0.612 for the groups M and NM, respectively), whereas merging the A and I groups only increased agreement minimally altogether and among medical graders (0.670 for all graders and 0.708 vs. 0.581, for the groups M and NM, respectively) (Table 3). Assessment of the post-task reports showed that all graders found using the Python image labeling tool easy and the grading task not complicated to handle. However, 2 of our 8 graders without former experience in programming with Python needed additional support to launch the grading tool for the first time. These 2 graders experienced difficulties while installing the package manager used for our application and had difficulties using command lines for opening the application (as opposed to the more common practice of launching applications via clicking an icon). Therefore, all our participants were also provided with a detailed “program launching supplement” and oral explanation in the second round of grading that eliminated these difficulties. Five graders found the size and resolution of the images somewhat small for grading retinal lesions. Our participants found the grading task and the written (S1 File) and oral tutorials comprehensible. None of the graders reported decision fatigue during the grading task. Five graders reported to have felt a noticeable improvement in grading even with 200 images and found the experience motivating. Our graders with medical background (three ophthalmologists and one optometrist) found a relatively high percentage of fundus images with an at least partially missing optic nerve head. In contrast, the role of the small artefacts without significant influence on the grading of potential retinopathy lesions was questionable. As for our image quality classification system, two graders (one from the M and one from the NM group) expressed their desire for more clarity on the definition of focus and illumination concerning the visibility of smaller retinal vasculature. Altogether, the feedback of our participants was positive regarding the labeling task and the tool itself.

Discussion

Our paper presents the results of a pilot study implemented via a Python-based image quality grading tool. In this study, we used randomly chosen fundus images from our study dataset that has already been labelled by two experts for image quality. We trained graders with and without ophthalmological background in a standardized fashion for the grading criteria and for the use of our image quality grading tool. Compared to current literature in the field, we did not only evaluate inter-rater repeatability but also analyzed the time spent on single decisions and the complex grading task by distinguishing image quality features with participants lacking medical background and undergoing only minimal training. The grading task of 200 images took approximately 17 minutes, graders with an ophthalmological background did the task faster than those without. The time necessary for the first 50 images was somewhat longer than for the last 50 images. Besides the learning effect, we explain this with a better handling of the application that some of the participants also mentioned in their feedbacks, while the variable proportion of images from the different categories (requiring different decision times) could also be in the background. Decision fatigue or apathy leading to faster response with less consideration might also be a factor for this, however, is rather unlikely due to the short time (less than 30 minutes in the first and around 45 minutes in the second round) necessary for the entire grading. The participants required the least time for the grading of insufficient images, almost half of that needed for the other groups. In each of our 4 groups (E/G/A/I) the grading took somewhat longer for the participants without medical background. Our results show that a high level of agreement [17] can be achieved already by a very short training period and smaller set of retinal photographs, even in the case of medically untrained people. The use of only three groups instead of four substantially increased the agreement of the grading, pointing towards the importance of simple grading categories in such a setting as it might be difficult to discern fine features in image quality that could be, in turn, of less importance when implementing robust deep learning techniques. In general, the participants were satisfied with the Python-based tool and expressed concerns that we are planning to incorporate in future developments of our grading software. Numerous semi-automated, computerized analysis techniques were developed to reduce the costs and the workload of expert image graders [18] and deep learning algorithms have recently been and are currently being adopted for the detection of different retinal diseases [19]. The Bhaktapur Retina Study showed moderate to almost perfect agreement between mid-level ophthalmologic personnel (n = 2) and ophthalmologist in grading color fundus photographs for hemorrhages and maculopathy [20]. Similarly, a study of the Center for Eye Research Australia showed that non-expert graders (2-months intensive training, n = 2) are able to grade retinal photos with at least moderate level of DR with high accuracy [21]. McKenna et al. found good accuracy among non-medical graders (1-month intensive training for grading purposes, n = 4) and relatively poor performance by rural ophthalmologists in a DR grading study in China [22]. A novel tool for generating ground truth data is via crowdsourcing where volunteers can perform the analysis online either for free or for a fee-for-service. Brady et al. have several reports on using Mechanical Amazon Turk as an effective and inexpensive means to rapidly identify ocular diseases such as DR or follicular trachoma on photographs [23, 24]. In consensus with the results of Brady et al. [23], graders with minimal training can contribute to a cost-effective and rapid grading of retinal photographs. In our study, it took non-medical graders in median 17 minutes to grade the 200 images. We can estimate that the labeling of a dataset containing 20,000 retinal images with 20 graders without previous education in the field of ophthalmology would take approximately 90 minutes. However, one must not underestimate grading fatigue as a possible influencing factor, as mentioned above. We believe, however, this was not the case in our current study due to the relatively short processing time (less than 30 minutes for the original task) and due to the total number of 200 images that would not reach the typical level of exhaustion associated with grader decision fatigue, as shown by Waite et al. [25]. Even better training results can be achieved with trained personnel; however, repeatability and accuracy (specificity and sensitivity) in this case should also be assessed along with images from different datasets. It is important to underline the potential weaknesses of our study. We included only 8 volunteers for the assessment of the labeling task that could be higher in order to simulate a real-world setting of crowdsourcing. However, we did not intend to go towards crowd-sourcing, we rather aimed to provide an analysis of the usability of our labeling tool and grading system, instead. Also, the size of the cohort is comparable or even larger than that reported earlier [23]. The size of the dataset consisting of 200 fundus images may be considered small. However, other similar studies used the same order of data [21]. Another weakness could be the assessment of the optic nerve head and the fovea which may play a decisive role in various pathologies. However, our primary goal with the development of our grading tool was the assessment of general image quality and thus, the optic nerve head or fovea has been graded within the image field definition as an equally important image quality factor as focus, illumination and artefacts. Another important motivation of our study was to investigate whether there was a considerable disadvantage in outsourcing the grading task to people without medical background. In concordance with our purpose, we considered the presence of an artefact enough to meet the criteria regardless of their clinical relevance or effect on the image grading possibilities. A particular strength of our study is that we used a pool of fundus images derived from different fundus cameras, representing a setting close to real life where the labeling of image quality should ideally be independent from the device used form imaging. In the future, we are planning to expand our work and based on the post-grading comments of our participants we will adjust our Python tool to optimize image resolution. The image grading protocols and checklists will also need to be reevaluated with special attention on the role of the missing optic nerve head, the visibility of the small macular blood vessels and the relevance of the small artefacts irrelevant for grading. To minimize susceptibility to handling difficulties and standardize the grading process, we are currently developing a web-based annotation tool that will allow additional zooming and illumination options. This will enable us to implement the learnings of our study on a broader basis, in a simpler setting. Our further purpose is to use our data for the training of a convolutional neural network for image quality assessment and grading of DR and other retinal diseases. With the democratization of AI, all its advantages and disadvantages will become available to a broad audience. The role of crowdsourcing increases in numerous projects that require harnessing human intelligence. This role, however, raises a new question in the context of health-care data: how safe is it to leave medically relevant decisions to “untrained” personnel? The significance of our work is the possible guidance on how to start producing meaningful ground truth data even with partially untrained participants for training and testing the new generations of deep learning algorithms. Our results show that it is sufficient to include fewer grading categories, that is, to work with a less detailed but more robust grading system, without the need of labeling single image quality criteria. Second, such a relatively simple tool can help to address the issue of poor-quality images, as all datasets, including the public repositories (such as EyePACS and MRL Eye) or the numerous smaller, institutional datasets, contain a certain amount of poor-quality images that are not gradable. Our approach could be an important asset not only in the grading of image quality but could also serve as a first screening step to filter out these images with poor quality. According to our results, this step (quality grading and filtering ungradable images) could be realized without expert level knowledge or longer training by simply employing “citizen scientists”. We hope that our findings and our application will help the training and development of robust artificial neural networks for the labeling of fundus image quality.

Detailed grading tutorial used for the training of our graders.

(PDF) Click here for additional data file.

Detailed annotation and time data used for the evaluation presented.

(XLSX) Click here for additional data file. 26 Dec 2021

PONE-D-21-35940

The assessment of fundus image quality labeling reliability among graders with different backgrounds

PLOS ONE Dear Dr. Somfai, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Feb 09 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Andrzej Grzybowski Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information. If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 4. Please note that in order to use the direct billing option the corresponding author must be affiliated with the chosen institute. Please either amend your manuscript to change the affiliation or corresponding author, or email us at plosone@plos.org with a request to remove this option. 5. We note that Figures 1A, 1B, 1C 1D, 2A and 2B in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: a. You may seek permission from the original copyright holder of Figures 1A, 1B, 1C, 1D, 2A and 2B to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an ""Other"" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Overview Thank you for giving me the opportunity to review the article. I hope the following suggestions will be helpful and will strengthen the work. The study evaluates the image quality assessment of color fundus photographs among graders with medical and non-medical backgrounds. The topic is interesting and relevant considering the scarcity of labeled data for the development of machine learning models and the importance of imaging quality for the accuracy and applicability of the algorithms. The manuscript is generally well presented, but there are several issues which needs to be addressed: Major comments Page 4, line 51-60. Quotation marks should be used for direct quotes. Consider rewriting the first paragraph of the Introduction as the following sentences were written using the same words as the references: - Page 4, line 51. “Medical imaging is expanding globally at an unprecedented rate leading to an ever-expanding quantity of data that requires human expertise and judgement to interpret and triage”. De Fauw et al (2018). - Page 4, line 55. “Benchmark datasets are essential for computational research in healthcare. These datasets should be created by intentional design that is mindful of social and health system priorities. If a deliberate and systematic approach is not followed, not only will the considerable benefits of clinical algorithms fail to be realized, but the potential harms may be regressively incurred across existing gradients of social inequity.” (T. Panch et al. 2020) The study was conducted “in an effort to model the democratization of retinal image labeling” (page 5, line 78). To do that, the authors use a phyton grading tool that requires some degree of programming knowledge (page 8, line 162), which could make it difficult for people without specific computing skills to participate in the image labeling process. The imaging grading criteria used in the study take into consideration anatomical structures (e.g., optic nerve head, fovea and macula), and imaging artefacts (e.g., eyelashes and fingerprints) whose identification require a clinical background or specific training and a certain degree of experience. According to the authors, “prior to grading the participants received a tutorial consisting of oral explanation of the task, supported by a pdf document describing the grading system with sample images and the description of the GUI.” (Page 6, line 120). In addition to the tutorial on image quality requirements, has any training been carried out to enable participants to recognize retinal structures and artefacts? If yes, detailed information on how the training was carried out should be provided (e.g., hours of training and grading guidelines). If not, the interpretation of the grading criteria by participants without medical background would be affected, as well as the results of the study. Page 7, line 130. Was the agreement calculated between study participants or between each participant and the ground truth (i.e., expert’ labels)? Was the weighted kappa applied to assess agreement when grading image quality (an ordinal outcome)? Page 9, line 186. “The time necessary for the first 50 images was higher than for the last 50 images which shows the learning curve of the grading task.” Based on the data presented, it is not possible to conclude that faster grading is a consequence of the learning curve as there are other possible explanations for this result (e.g., grading fatigue). Minor comments Page 2, line 28. “The performance of the grading was timed and the accuracy was assessed. Accuracy was also assessed with the excellent and good categories merged.” Since kappa is not a measure of accuracy, it is not possible to say that the accuracy was assessed. Page 2, line 32 and page 7, line 140. Please define the abbreviations “GM” and “GN” on first use in the abstract and in the main text. To follow a similar logic of the abbreviation of the terms “medical” and “non-medical” (i.e., N and NM, respectively), consider changing the abbreviation from GN to GNM. Page 2, line 35 and page 8, line 149. “The median time for single decision was in all categories longer in the GN group (5,28 [4.91-5.66] vs. 4.75 [4.31-5.18] sec, 6.33 [5.96-6.71] vs. 4.10 [3.95-4.26] sec, 5.74 [5.31-6.17] vs. 4.60 [4.27-4.93] sec, 3.01 [2.83-3.20] vs. 3.87 [3.67-4.06] for the E, G, A and I, in the GN and GM groups, respectively).”. Is the median time of 3.01 seconds related to the GM group? If yes, consider placing this number after the median time of 3.87 seconds, as in the previous categories (i.e., E, G and A) the median time of the GN group was mentioned first. Page 2, line 39. The Cohen’s kappa coefficient for NM mentioned in the text (i.e., 0.376) does not match the coefficient mentioned the Table 2 for the same group (i.e., 0.348). Please clarify. Page 4, line 54. “Machine Learning (ML) has emerged as an important tool for healthcare, particularly when it comes to training”. Please explain better this sentence. Page 4, line 73. There is no need to abbreviate “optical coherence tomography” as the authors don’t use the abbreviation elsewhere in the text. Page 6, line 99 and Table 1. Please explain better when the criteria “Artefacts” is fulfilled. Is the presence of any artefact enough to meet the criteria? Or, as in the EyePACS grading system, the criteria is fulfilled when the image is “sufficiently free of artifacts (such as dust spots, arc defects, and eyelash images) to allow adequate grading”? In the latter case, clinical knowledge would likely be needed to allow judgment of which artefacts are clinically significant to interfere with grading. Page 6, line 105. “The images were previously labeled for image quality by two experts (KLLF and GMS) using the standards described above.” It would be interesting to know more about the expertise level of the graders KLLF and GMS (e.g., are they ophthalmologists or retina specialists?) and how the disagreements between them were resolved (e.g., arbitration, adjudication). According to Krause et al (2018), when establishing a reference standard, these factors can affect grader variability. Page 6, line 103. Please explain better when the image is graded as “Insufficient for grading”. If, for a given image, all 4 quality criteria are not met, when will this image be classified as “insufficient” instead of “adequate”? Page 8, line 166. “There was no consensus among our graders on the importance of the missing optic nerve head”. Please clarify this sentence and why there was a lack of consensus as the optic disc head is one of the key anatomical features in the color fundus photograph and is part of the image field criteria. Page 9, line 175. Consider rewriting this sentence and using short statements to better express the information. Page 9, line 182. Consider removing “relatively high number of participants with non-medical background” as only 4 subjects were included in this group. Page 9, line 189. “In the good, adequate and insufficient groups the grading took longer for the participants without medical background.” Please clarify this sentence as in the results section (page 7, line 150) the authors mention that “the median time for single decision was in all categories longer in the GN group”, including the category “excellent”. Page 9, line 192. “Our results show that a high level of repeatability can be achieved already by a very short training period and smaller set of retinal photographs, even in the case of medically untrained people.” Repeatability may not be the most appropriate term because it refers to the variation in repeat measurements made by the same instrument or method over a short period of time (Barlett and Frost 2008). This does not seem to be the case of this study, in which the experiment (i.e., grading of multiple images) was performed only once. Page 10, line 202-215. Consider removing these two paragraphs as they contain information already mentioned in the introduction rather than the context and relevance of the results. Page 11, line 235. “In our study, it took non-medical graders in median 17 minutes to grade the first 200 images”. Please remove the word “first” as there were only 200 images in the dataset. Page 11, line 235 and 247. Based on the data presented, it is not possible to conclude that faster grading is a consequence of the learning curve. Page 11, line 236-238. Remove commas after “estimate” and “ophthalmology”. Page 11, line 242. “We included only 9 volunteers for the assessment of the labeling task”. Please clarify the number of study participants as in the method section the authors mention (Page 6, line 117) that the images were evaluated by 8 volunteers. Page 12, line 264-267. “With the increasing role of crowdsourcing in numerous projects harnessing human intelligence, in the context of health-care data the question arises how safe is it to leave diagnosing a medical condition to “untrained” personnel or rather use a machine to perform this task.” Consider rewriting this sentence and using short statements to better express the information. Page 12, line 270. Please add a conclusion supported by the study results as the last paragraph contains mainly general statements rather than a conclusion about the work itself. References Krause, Jonathan, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. “Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.” Ophthalmology 125 (8): 1264–72. Bartlett, J. W., and C. Frost. 2008. “Reliability, Repeatability and Reproducibility: Analysis of Measurement Errors in Continuous Variables.” Ultrasound in Obstetrics & Gynecology: The Official Journal of the International Society of Ultrasound in Obstetrics and Gynecology 31 (4): 466–75. Reviewer #2: The largest weakness of this study is in the task chosen to assess grader agreement. Image quality is inherently subjective. Even when provided with a grading guideline. Why not instead pick a more objective task to evaluate grader agreement? Furthermore you do not describe how ground truth was arbitrated or adjudicated between the two specialists, this crucial. Due to the inherent subjectivity of this task, the author's should have graders select each criteria as y/n (adequate focus, artefacts, etc), then compile and codify to your quality grades. Otherwise, I see this as a poor study design (as the graders likely are not following the guidelines well, letting their subjective "quality" biases dominate, and leading to a poor kappa). Much easier to pick a different task to evaluate agreement. We're there contrast and brightness adjustments in the grading tool? How was the ground truth set it arbitration adjudication etc Line 164 . Please explain what is meant exactly by: Each participant reported positively on the learning experience. Line 176 please clarify why this is semi automated here and throughout, is there active learning? I'm not sure how this is automated and would avoid this buzzword. Line 192, please remove repeatability, you did not test this here, you tested agreement Results: Please emphasize Kappas, they are not reported here.. that's the main outcome, not the times in my opinion. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Edward Korot [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 21 Apr 2022 Rebuttal letter PONE-D-21-35940 The assessment of fundus image quality labeling reliability among graders with different backgrounds Journal Requirements 5. We note that Figures 1A, 1B, 1C 1D, 2A and 2B in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: Authors’ response: We would like to state here that all images presented in the manuscript were produced by the authors and therefore no permission of the copyright holder is required. We would like to thank both reviewers for their insightful comments. Please find below our detailed answers for all points raised in the review. Reviewer #1 Thank you for giving me the opportunity to review the article. I hope the following suggestions will be helpful and will strengthen the work. The study evaluates the image quality assessment of color fundus photographs among graders with medical and non-medical backgrounds. The topic is interesting and relevant considering the scarcity of labeled data for the development of machine learning models and the importance of imaging quality for the accuracy and applicability of the algorithms. The manuscript is generally well presented, but there are several issues which needs to be addressed: Authors’ response: We would like to thank Reviewer #1 for his/her comprehensive revision and precise suggestions. We do feel they contributed a substantial improvement of our work. Major comments Page 4, line 51-60. Quotation marks should be used for direct quotes. Consider rewriting the first paragraph of the Introduction as the following sentences were written using the same words as the references: - Page 4, line 51. “Medical imaging is expanding globally at an unprecedented rate leading to an ever-expanding quantity of data that requires human expertise and judgement to interpret and triage”. De Fauw et al (2018). - Page 4, line 55. “Benchmark datasets are essential for computational research in healthcare. These datasets should be created by intentional design that is mindful of social and health system priorities. If a deliberate and systematic approach is not followed, not only will the considerable benefits of clinical algorithms fail to be realized, but the potential harms may be regressively incurred across existing gradients of social inequity.” (T. Panch et al. 2020) Authors’ response: We completely agree with Reviewer #1 and are thankful for raising these points. We found the above quotes from the given references of pivotal importance. Together with other points addressed in both reviews, we restructured the entire first paragraph of the introduction. (See page 3, lines 47-55) The study was conducted “in an effort to model the democratization of retinal image labeling” (page 5, line 78). To do that, the authors use a phyton grading tool that requires some degree of programming knowledge (page 8, line 162), which could make it difficult for people without specific computing skills to participate in the image labeling process. Authors’ response: We agree with Reviewer #1 on the controversy of the above statements; therefore, we provided a more detailed explanation of the mentioned problem. As for the need for specific computing skills, we stated that all our volunteers had at least intermediate level computer skills. (page 5, line 121) Most of them (6 of 8) had no previous experience with programming in python and the majority found its use easy and intuitive. In our opinion, in case of a real-world scenario most of the participants applying for such a grading task would most probably have at least as good computer (or even minimal programing skills) as our 8 participants had. In an effort to provide an even simpler way for grading and to be able to reach a potentially broader circle of volunteers we are currently working on a web-based annotation application, as well. According to the above, we applied certain changes to the text for clarification. (See page 8, line 181; page 12, line 287) The imaging grading criteria used in the study take into consideration anatomical structures (e.g., optic nerve head, fovea and macula), and imaging artefacts (e.g., eyelashes and fingerprints) whose identification require a clinical background or specific training and a certain degree of experience. According to the authors, “prior to grading the participants received a tutorial consisting of oral explanation of the task, supported by a pdf document describing the grading system with sample images and the description of the GUI.” (Page 6, line 120). In addition to the tutorial on image quality requirements, has any training been carried out to enable participants to recognize retinal structures and artefacts? If yes, detailed information on how the training was carried out should be provided (e.g., hours of training and grading guidelines). If not, the interpretation of the grading criteria by participants without medical background would be affected, as well as the results of the study. Authors’ response: We agree with the need for more transparency about grading instructions and training. We provided the participants a written tutorial that included detailed explanation on retinal anatomy as well as the definition of various imaging artefacts. One of our main purposes and the motivation of the study were to investigate whether there was a considerable disadvantage in outsourcing the grading task to people without medical background. We are confident that our participants, following the aforementioned tutorial, were able to distinguish between retinal structures such as macula, optic nerve head and retinal blood vessels and to identify different types of artefacts. As requested by Reviewer #2 we performed an additional analysis of each single grading criteria to provide a more objective evaluation. We applied the above changes to the manuscript (see page 5, line 122) and submitted our tutorial pdf as supplementary material. Page 7, line 130. Was the agreement calculated between study participants or between each participant and the ground truth (i.e., expert’ labels)? Was the weighted kappa applied to assess agreement when grading image quality (an ordinal outcome)? Authors’ response: Originally, we calculated Cohen’s Kappa between each participant and the expert labeled data. We agree with the Reviewer that Cohen’s weighted Kappa is more suitable for our study design and, therefore, recalculated our results using weighted Kappa values which are now presented in the revised version of the manuscript (See page 2, line 34; page 6, line 136; page 7, line 165). Page 9, line 186. “The time necessary for the first 50 images was higher than for the last 50 images which shows the learning curve of the grading task.” Based on the data presented, it is not possible to conclude that faster grading is a consequence of the learning curve as there are other possible explanations for this result (e.g., grading fatigue). Authors’ response: We agree with Reviewer #1 and changed the statements (See page 9, line 215; page 11, line 252) regarding the time difference between grading the first and last 50 images. We would, however, like to add that the relatively short processing time (approximately 20 minutes for the original task) and a total number of 200 images would not reach the typical level of exhaustion associated with grader decision fatigue, as shown by Waite et al. Waite S, Kolla S, Jeudy J, Legasto A, Macknik SL,Martinez-Conde S, Krupinski EA, Reede DL. “Tired in the Reading Room: The Influence of Fatigue in Radiology”J Am Coll Radiol.2016 https://doi.org/10.1016/j.jacr.2016.10.009 As we later mention, we also agree with Reviewer #1 that a learning curve cannot be observed, therefore, we omitted the aforementioned phrase. (See page 9, line 215; page 11, lines 252 and 269) Minor comments Page 2, line 28. “The performance of the grading was timed and the accuracy was assessed. Accuracy was also assessed with the excellent and good categories merged.” Since kappa is not a measure of accuracy, it is not possible to say that the accuracy was assessed. Authors’ response: We agree with Reviewer #1 on this comment and applied a correction to the text (See page 2, line 30). Page 2, line 32 and page 7, line 140. Please define the abbreviations “GM” and “GN” on first use in the abstract and in the main text. To follow a similar logic of the abbreviation of the terms “medical” and “non-medical” (i.e., N and NM, respectively), consider changing the abbreviation from GN to GNM. Authors’ response: We agree with the Reviewer and applied the recommended changes (See page 2, lines 27, 33; page 5, line 119; page 6, line 144; page 7, lines 152, 161-170; page 8, lines 175-178 and 197). Page 2, line 35 and page 8, line 149. “The median time for single decision was in all categories longer in the GN group (5.28 [4.91-5.66] vs. 4.75 [4.31-5.18] sec, 6.33 [5.96-6.71] vs. 4.10 [3.95-4.26] sec, 5.74 [5.31-6.17] vs. 4.60 [4.27-4.93] sec, 3.01 [2.83-3.20] vs. 3.87 [3.67-4.06] for the E, G, A and I, in the GN and GM groups, respectively).”. Is the median time of 3.01 seconds related to the GM group? If yes, consider placing this number after the median time of 3.87 seconds, as in the previous categories (i.e., E, G and A) the median time of the GN group was mentioned first. Authors’ response: We thank Reviewer #1 for having this pointed out. We applied the recommended change and rewrote the abstract. (See page 2 and page 7, line 159). Page 2, line 39. The Cohen’s kappa coefficient for NM mentioned in the text (i.e., 0.376) does not match the coefficient mentioned the Table 2 for the same group (i.e., 0.348). Please clarify. Authors’ response: We thank Reviewer #1 for this comment, we omitted this statement on page 2 and created a new table (Table 3 on Page 17) with the calculated weighted kappa values for the better transparency of our results. Page 4, line 54. “Machine Learning (ML) has emerged as an important tool for healthcare, particularly when it comes to training”. Please explain better this sentence. Authors’ response: We would like to refer to the entirely rewritten first paragraph of the introduction which also contains an explanation of this point. (See page 3, lines 47-55) Page 4, line 73. There is no need to abbreviate “optical coherence tomography” as the authors don’t use the abbreviation elsewhere in the text. Authors’ response: We thank Reviewer #1 for pointing this out, we now deleted the abbreviation. (See page 3, line 68) Page 6, line 99 and Table 1. Please explain better when the criteria “Artefacts” is fulfilled. Is the presence of any artefact enough to meet the criteria? Or, as in the EyePACS grading system, the criteria is fulfilled when the image is “sufficiently free of artifacts (such as dust spots, arc defects, and eyelash images) to allow adequate grading”? In the latter case, clinical knowledge would likely be needed to allow judgment of which artefacts are clinically significant to interfere with grading. Authors’ response: We extended the explanation in the manuscript (See page 12, line 274; page 14 – Table 1) and would like to refer to our detailed image grading tutorial containing exemplary images. (Suppl. 1) In concordance with our aims for this pilot, we considered the presence of the artefacts enough to meet the criteria regardless of their effect on the image grading possibilities. Page 6, line 105. “The images were previously labeled for image quality by two experts (KLLF and GMS) using the standards described above.” It would be interesting to know more about the expertise level of the graders KLLF and GMS (e.g., are they ophthalmologists or retina specialists?) and how the disagreements between them were resolved (e.g., arbitration, adjudication). According to Krause et al (2018), when establishing a reference standard, these factors can affect grader variability. Authors’ response: We thank the Reviewer for this important comment. We added a detailed explanation to our paper, accordingly and inserted the suggested reference. (See page 5, line 102) Page 6, line 103. Please explain better when the image is graded as “Insufficient for grading”. If, for a given image, all 4 quality criteria are not met, when will this image be classified as “insufficient” instead of “adequate”? Authors’ response: We would like to thank Reviewer #1 for this important question. We graded images as “insufficient” when a part of the image or the whole image was not adequate for grading retinal lesions at all (and thus all 4 quality criteria were not met). To make this clear, we also included a detailed explanation in our manuscript (see page 5 line 98). This was also explained in our grading tutorial (Suppl. 1): Insufficient: One or more retinopathy lesions cannot be graded. �  In case the third generation branches within one optic disk diameter near the fovea and the optic nerve head cannot be identified, the images should be considered inadequate for grading. �  Images lacking visibility of more than 50% of the depicted retinal field should be considered inadequate for grading. Driven by the above question, we also investigated merging the “adequate” and “insufficient” groups and found that it resulted in a further improvement of the inter-rater agreement (0.665 and 0.670 for the first and second grading round respectively). (See page 7 and 8, lines 165-178 and page 17 – Table 3) Page 8, line 166. “There was no consensus among our graders on the importance of the missing optic nerve head”. Please clarify this sentence and why there was a lack of consensus as the optic disc head is one of the key anatomical features in the color fundus photograph and is part of the image field criteria. Authors’ response: We thank Reviewer #1 for this important remark. The sentence can indeed be misunderstood and refers to our internal debate regarding the importance of the inclusion of the ONH on the images that could, in turn, enable the screening for glaucoma and increase the precision of DR screening. However, our primary goal was the assessment of general image quality and thus the ONH has been graded within the image field definition category as an equally important image quality factor as focus, illumination and artefacts. Accordingly, we made our point clearer in the text. (See page 8, line 193; page 11, line 270). Page 9, line 175. Consider rewriting this sentence and using short statements to better express the information. Authors’ response: We agree with Reviewer #1 and applied the suggested change (See page 9, line 204). Page 9, line 182. Consider removing “relatively high number of participants with non-medical background” as only 4 subjects were included in this group. Authors’ response: We agree with Reviewer #1 and applied the suggested change (See page 9, line 211). Page 9, line 189. “In the good, adequate and insufficient groups the grading took longer for the participants without medical background.” Please clarify this sentence as in the results section (page 7, line 150) the authors mention that “the median time for single decision was in all categories longer in the GN group”, including the category “excellent”. Authors’ response: We thank Reviewer #1 to point this out. We applied the suggested clarification. (See page 10, line 223) Page 9, line 192. “Our results show that a high level of repeatability can be achieved already by a very short training period and smaller set of retinal photographs, even in the case of medically untrained people.” Repeatability may not be the most appropriate term because it refers to the variation in repeat measurements made by the same instrument or method over a short period of time (Barlett and Frost 2008). This does not seem to be the case of this study, in which the experiment (i.e., grading of multiple images) was performed only once. Authors’ response: We agree with Reviewer #1 on this comment and changed the term repeatability for agreement. We also extended our list of references with the suggested paper. (See page 10, line 226) Page 10, line 202-215. Consider removing these two paragraphs as they contain information already mentioned in the introduction rather than the context and relevance of the results. Authors’ response: We agree with Reviewer #1 on this comment and deleted the suggested paragraphs. We also rewrote the entire first paragraph of the introduction in order to avoid repeating the same information. (See page 3, lines 47-55) Page 11, line 235. “In our study, it took non-medical graders in median 17 minutes to grade the first 200 images”. Please remove the word “first” as there were only 200 images in the dataset. Authors’ response: We agree with Reviewer #1 and applied the suggested change (See page 11, line 252). Page 11, line 235 and 247. Based on the data presented, it is not possible to conclude that faster grading is a consequence of the learning curve. Authors’ response: We agree with Reviewer #1 that a learning curve cannot be observed, therefore we omitted the mentioned sentences. (See page 11, lines 252 and 269) Page 11, line 236-238. Remove commas after “estimate” and “ophthalmology”. Authors’ response: We thank Reviewer #1 for pointing this out. We applied the suggested correction. (See page 11, lines 252 and 254) Page 11, line 242. “We included only 9 volunteers for the assessment of the labeling task”. Please clarify the number of study participants as in the method section the authors mention (Page 6, line 117) that the images were evaluated by 8 volunteers. Authors’ response: We thank Reviewer #1 for pointing out this typo that we corrected in the text. (See page 11, line 263) Page 12, line 264-267. “With the increasing role of crowdsourcing in numerous projects harnessing human intelligence, in the context of health-care data the question arises how safe is it to leave diagnosing a medical condition to “untrained” personnel or rather use a machine to perform this task.” Consider rewriting this sentence and using short statements to better express the information. Authors’ response: We agree with Reviewer #1 and simplified the sentence by using short statements. (See page 12, line 294) Page 12, line 270. Please add a conclusion supported by the study results as the last paragraph contains mainly general statements rather than a conclusion about the work itself. Authors’ response: We thank the Reviewer for this valuable remark and agree that a more effective paragraph with the conclusion of our work is necessary. In line with this we rewrote the last paragraph (See page 12, line 298). References Krause, Jonathan, Varun Gulshan, Ehsan Rahimy, Peter Karth, Kasumi Widner, Greg S. Corrado, Lily Peng, and Dale R. Webster. 2018. “Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.” Ophthalmology 125 (8): 1264–72. Bartlett, J. W., and C. Frost. 2008. “Reliability, Repeatability and Reproducibility: Analysis of Measurement Errors in Continuous Variables.” Ultrasound in Obstetrics & Gynecology: The Official Journal of the International Society of Ultrasound in Obstetrics and Gynecology 31 (4): 466–75. Authors’ response: We included these references in the manuscript in the above mentioned sections. Reviewer #2 Authors’ response: We would also like to thank Reviewer #2 for the important points raised and for his efforts to improve our work. Reviewer #2: The largest weakness of this study is in the task chosen to assess grader agreement. Image quality is inherently subjective. Even when provided with a grading guideline. Why not instead pick a more objective task to evaluate grader agreement? Due to the inherent subjectivity of this task, the author's should have graders select each criteria as y/n (adequate focus, artefacts, etc), then compile and codify to your quality grades. Otherwise, I see this as a poor study design (as the graders likely are not following the guidelines well, letting their subjective "quality" biases dominate, and leading to a poor kappa). Much easier to pick a different task to evaluate agreement. Authors’ response: We completely agree with Reviewer #1 regarding his concern about the inherent subjectivity of our study. For this reason, we conducted a second grading round with the same subjects based on the suggestion of the Reviewer. We adapted our grading guideline (Suppl. 1) and asked the participants to select each grading criterion for the four image quality groups. Thus, the graders could choose between 14 single criteria (e.g. “Focus optimal”, “Focus unsharp”, “Illumination optimal”, “Illumination too dark”, “Illumination too light/overexposed” etc.). The details of this extra round of assessment are described in the Methods and Results section, accordingly (See page 6, lines 126 and 134; Table 2 – page 16). According to our results (page 2, line 34; page 7, line 165; page 17 – Table 3) the above, more objective classification helped our graders reach higher agreement: median weighted kappa 0.564 vs. 0.594 (4 groups) and 0.637 vs. 0.667 (3 groups). However, it is important to point out two facts regarding our work. First, our study was merely a pilot in nature that should allow a basis for future assessments. Second, the ultimate benefit of our work should be the robust identification of images with poor quality in a screening setting. Based on our results, even with the inherent subjectivity taken into consideration, providing 3 categories for fundus image quality grading could be sufficient, simple and fast without the need of multiple predefined labels. We believe that images that are of poor quality and thus are certainly not sufficient for screening for retinal diseases can be reliably found and be eliminated that could, in turn, enable a better training for future AI algorithms. We're there contrast and brightness adjustments in the grading tool? Authors’ response: We thank the Reviewer for this important question. There were no zooming, brightness or contrast changing options available. Currently, we are working on a web-based application which would contain additional zoom and brightness adjustments. We also stated it in the revised manuscript. (See page 12, line 287) Furthermore you do not describe how ground truth was arbitrated or adjudicated between the two specialists, this crucial. How was the ground truth set it arbitration adjudication etc Authors’ response: We would like to thank Reviewer #2 for this comment. A detailed explanation was added to the manuscript (page 5, line 102): “The images were previously labeled for image quality by two experts and served as the reference for the analysis (KLLF – board certified ophthalmologist with experience in retinal imaging – and GMS – board certified senior retina specialist with retinal image grading experience in the telemedical screening of diabetic retinopathy) using the standards described above. In case of any disagreement, GMS’s decision served as reference.” Line 164 . Please explain what is meant exactly by: Each participant reported positively on the learning experience. Authors’ response: We added a detailed description of the feedback received from the participants. (See page 8, line 189) Line 176 please clarify why this is semi automated here and throughout, is there active learning? I'm not sure how this is automated and would avoid this buzzword. Authors’ response: We agree with Reviewer #2 and deleted the term “semi-automated” in order to avoid such a misunderstanding. (See page 9, line 204). Line 192, please remove repeatability, you did not test this here, you tested agreement Authors’ response: We agree with Reviewer #2 on this comment and applied a correction. (See page 10, line 226) Results: Please emphasize Kappas, they are not reported here.. that's the main outcome, not the times in my opinion. Authors’ response: We included a more detailed elaboration of the Kappas, including the assessment of weighted Cohen’s Kappa values. The changes can be found on page 2, line 34 and page 7, line 165. Submitted filename: Laurik Response to reviewers R1_21Apr22.docx Click here for additional data file. 27 Jun 2022 The assessment of fundus image quality labeling reliability among graders with different backgrounds PONE-D-21-35940R1 Dear Dr. Somfai, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Prof. Andrzej Grzybowski Academic Editor PLOS ONE Additional Editor Comments (optional): Thank you for submitting your valuable paper to our journal. Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: I appreciate the authors’ effort to address the suggestions. The adjustments (e.g., changes in statistical analysis and more details about the grading protocol) significantly improved the manuscript. Just one minor comment: Page 9, line 209. Please replace “repeatability” with “agreement” as “repeatability” was not tested in this study. Reviewer #2: Thanks for addressing all comments, the manuscript is now much improved. I appreciate the thoroughness of additional analyses. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Edward Korot ********** 13 Jul 2022 PONE-D-21-35940R1 The assessment of fundus image quality labeling reliability among graders with different backgrounds Dear Dr. Somfai: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Andrzej Grzybowski Academic Editor PLOS ONE

21 in total

1. Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables.

Authors: J W Bartlett; C Frost
Journal: Ultrasound Obstet Gynecol Date: 2008-04 Impact factor: 7.299

2. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.

Authors: Varun Gulshan; Lily Peng; Marc Coram; Martin C Stumpe; Derek Wu; Arunachalam Narayanaswamy; Subhashini Venugopalan; Kasumi Widner; Tom Madams; Jorge Cuadros; Ramasamy Kim; Rajiv Raman; Philip C Nelson; Jessica L Mega; Dale R Webster
Journal: JAMA Date: 2016-12-13 Impact factor: 56.272

3. Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy.

Authors: Jonathan Krause; Varun Gulshan; Ehsan Rahimy; Peter Karth; Kasumi Widner; Greg S Corrado; Lily Peng; Dale R Webster
Journal: Ophthalmology Date: 2018-03-13 Impact factor: 12.079

4. Beyond Performance Metrics: Automatic Deep Learning Retinal OCT Analysis Reproduces Clinical Trial Outcome.

Authors: Jessica Loo; Traci E Clemons; Emily Y Chew; Martin Friedlander; Glenn J Jaffe; Sina Farsiu
Journal: Ophthalmology Date: 2019-12-23 Impact factor: 12.079

Review 5. Tired in the Reading Room: The Influence of Fatigue in Radiology.

Authors: Stephen Waite; Srinivas Kolla; Jean Jeudy; Alan Legasto; Stephen L Macknik; Susana Martinez-Conde; Elizabeth A Krupinski; Deborah L Reede
Journal: J Am Coll Radiol Date: 2016-12-09 Impact factor: 5.532

6. Accuracy of trained rural ophthalmologists versus non-medical image graders in the diagnosis of diabetic retinopathy in rural China.

Authors: Martha McKenna; Tingting Chen; Helen McAneney; Miguel Angel Vázquez Membrillo; Ling Jin; Wei Xiao; Tunde Peto; Mingguang He; Ruth Hogg; Nathan Congdon
Journal: Br J Ophthalmol Date: 2018-07-04 Impact factor: 4.638

7. Rapid grading of fundus photographs for diabetic retinopathy using crowdsourcing.

Authors: Christopher J Brady; Andrea C Villanti; Jennifer L Pearson; Thomas R Kirchner; Omesh P Gupta; Chirag P Shah
Journal: J Med Internet Res Date: 2014-10-30 Impact factor: 5.428

8. Intra- and inter-rater agreement between an ophthalmologist and mid-level ophthalmic personnel to diagnose retinal diseases based on fundus photographs at a primary eye center in Nepal: the Bhaktapur Retina Study.

Authors: Raba Thapa; Sanyam Bajimaya; Renske Bouman; Govinda Paudyal; Shankar Khanal; Stevie Tan; Suman S Thapa; Ger van Rens
Journal: BMC Ophthalmol Date: 2016-07-18 Impact factor: 2.209

Review 9. "Yes, but will it work for my patients?" Driving clinically relevant research with benchmark datasets.

Authors: Trishan Panch; Tom J Pollard; Heather Mattie; Emily Lindemer; Pearse A Keane; Leo Anthony Celi
Journal: NPJ Digit Med Date: 2020-06-19

10. Artificial Intelligence to Identify Retinal Fundus Images, Quality Validation, Laterality Evaluation, Macular Degeneration, and Suspected Glaucoma.

Authors: Miguel Angel Zapata; Dídac Royo-Fibla; Octavi Font; José Ignacio Vela; Ivanna Marcantonio; Eduardo Ulises Moya-Sánchez; Abraham Sánchez-Pérez; Darío Garcia-Gasulla; Ulises Cortés; Eduard Ayguadé; Jesus Labarta
Journal: Clin Ophthalmol Date: 2020-02-13