Literature DB >> 34153076

Deep learning for classification of pediatric chest radiographs by WHO's standardized methodology.

Yiyun Chen¹, Craig S Roberts¹, Wanmei Ou¹, Tanaz Petigara¹, Gregory V Goldmacher¹, Nicholas Fancourt², Maria Deloria Knoll³.

Abstract

BACKGROUND: The World Health Organization (WHO)-defined radiological pneumonia is a preferred endpoint in pneumococcal vaccine efficacy and effectiveness studies in children. Automating the WHO methodology may support more widespread application of this endpoint.
METHODS: We trained a deep learning model to classify pneumonia CXRs in children using the World Health Organization (WHO)'s standardized methodology. The model was pretrained on CheXpert, a dataset containing 224,316 adult CXRs, and fine-tuned on PERCH, a pediatric dataset containing 4,172 CXRs. The model was then tested on two pediatric CXR datasets released by WHO. We also compared the model's performance to that of radiologists and pediatricians.
RESULTS: The average area under the receiver operating characteristic curve (AUC) for primary endpoint pneumonia (PEP) across 10-fold validation of PERCH images was 0.928; average AUC after testing on WHO images was 0.977. The model's classification performance was better on test images with high inter-observer agreement; however, the model still outperformed human assessments in AUC and precision-recall spaces on low agreement images.
CONCLUSION: A deep learning model can classify pneumonia CXR images in children at a performance comparable to human readers. Our method lays a strong foundation for the potential inclusion of computer-aided readings of pediatric CXRs in vaccine trials and epidemiology studies.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34153076 PMCID： PMC8216551 DOI： 10.1371/journal.pone.0253239

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Pneumonia is the leading infectious cause of death in children. Streptococcus pneumoniae, a gram positive bacterium, is a common cause of bacterial pneumonia in children. In 2015, almost 300,000 deaths due to pneumococcal pneumonia were estimated to have occurred in children less than 5 years, primarily in Africa and Asia [1]. Pneumococcal conjugate vaccines (PCV) are highly effective at preventing pneumococcal disease [2], but estimating their impact on pneumonia in children requires large studies and standardized case definitions. Currently available biological tests are insufficient to identify the etiology of pneumonia in children; as antigen tests lack specificity, blood cultures lack sensitivity, and lung aspirates are impractical to obtain [3]. Chest x-ray (CXR) findings of lobar consolidation is associated with bacterial pneumonia, while mild interstitial changes or infiltratesare associated with viral pneumonia. Lobar consoliation is considered to be a more specific outcome measure for pneumococcal pneumonia by the World Health Organization (WHO), leading the WHO to develop = a methodology to standardize the radiologic definitions of childhood pneumonia for use in pneumococcal vaccine efficacy trials and epidemiology studies [3]. Primary endpoint pneumonia (PEP), defined as the presence of consolidation or pleural effusion, has since been used as an endpoint in a number of vaccine efficacy and impact studies [4-8]. In addition to PEP, the WHO methodology has conclusions for other infiltrates, normal (i.e. no consolidation, other infiltrates, or effusion), and uninterpretable. Although the use of radiological endpoints is valuable for assessing vaccine efficacy and impact, it requires radiologist or physician engagement that is time- and cost-intensive. Most evaluations of PCV impact on radiological pneumonia have not used the WHO methodology for CXR interpretation [9]. Among studies that have used the WHO methodology, the number of CXR images analyzed varies given different resource constraints. Smaller studies have evaluated approximately 4,000 CXRs in the 3 years prior to and post-PCV introduction [10], while larger time series analyses have evaluated over 72,000 hospitalizations over 14 years (~10,000 PEP cases) [11], and 2.7 million visits over 9 years (~13,000 PEP cases) [8]. Automating the CXR reading task could not only standardize CXR interpretations across time and settings, but could also reduce the resources required to conduct these studies. In addition to resource constraints, the subjective nature of the reading process and the varying level of expertise among radiologists or physicians can lead to considerable inter and intra-observer variability [12]. The WHO methodology is designed to standardize interpretations of CXRs, which is important for accurately determining the impact of PCVs in clinical trials and observational studies. In a randomized controlled trial of the 7-valent PCV, per protocol vaccine efficacy against radiological pneumonia, read by a radiologist at the point of care, was 20.5% in children less than 5 years of age [4]. After reevaluating the CXRs using the WHO methodology, vaccine efficacy against radiological pneumonia increased to 30.3% due to the improved specificity of the endpoint [13]. This illustrates the potential for discordant interpretations by humans and the impact it can have on evaluating interventions as well as prevalence estimates and epidemiological trends in disease. Recent advancements in deep learning have enabled the automation of CXR reading at a performance comparable to experienced radiologists [14-17]. Automating the CXR reading task may improve sample efficiency by reducing discrepancies in interpretation and facilitating more widespread application of radiological endpoints in epidemiological research. In one previous study by Mahomed et al., the researchers automated the recognition of primary-endpoint pneumonia (PEP) using lung segmentation and texture analysis [18], using images from the same dataset as our current study. The analysis was run on data from the South African research site, which is only one of 7 sites in the PERCH (Pneumonia Etiology Research for Child Health) study. The study achieved an area under the receiver operator characteristic curve (ROC) of 0.85 (95%CI: 0.823–0.876) for PEP within the South African dataset. In this study, we were able to train the model on the entire dataset from all research sites. We opted to use the deep learning approach, which requires less feature engineering than classical image analysis that typically involves manual filter selection. The final model will be tested on additional external datasets to further validate its performance, and to better understand its limitations beyond the PERCH study.

Methods

The WHO methodology classifies pediatric CXRs into four endpoint conclusions: ‘PEP’ (primary endpoint pneumonia, i.e. consolidation or pleural effusion), ‘other (non-endpoint) infiltrates’, both ‘PEP and other infiltrates’, or ‘Normal’ (no consolidation, infiltrate, or effusion). The two PEP categoroies were merged into one in the analysis to represent “any PEP”. To train a deep learning algorithm for this classification task, we utilized transfer learning where we pretrained a model on CheXpert, a large public CXR dataset [16], and finetuned it on the smaller Pneumonia Etiology Research for Child Health (PERCH) dataset with CXRs labeled according to the WHO methodology [12], and then tested it on two pediatric CXR datasets released by the WHO [3, 19].

CheXpert dataset

The CheXpert dataset consists of 224,316 CXRs from 65,240 patients seen at Stanford Hospital inpatient and outpatient centers between October 2002 and July 2017 [16] CheXpert is primarily an adult CXR dataset containing 224,313 images from adults and 3 images from newborns. Natural language processing was used to extract text from radiology reports and label images as positive, negative, and uncertain for the presence of 14 common chest radiographic observations (S1 Fig). Further details can be found in Irvin, J. et al. [16].

PERCH dataset

The PERCH study was a seven-country case-control study of causes and risk factors of childhood pneumonia in Africa and Asia [20, 21]. Cases were children between 1–59 months of age who were hospitalized with WHO-defined (pre-2013 definition) severe or very severe pneumonia [22, 23]. A total of 4,172 CXR images were available from 4,232 cases enrolled between August 2011 and January 2014. The PERCH study protocol was approved by the Institutional Review Boards or Ethical Review Committees for each of the seven institutions and at The Johns Hopkins School of Public Health. Parents or guardians of participants provided written informed consent, and all data were fully anonymized [24]. PERCH images were labeled by a 14-person reading panel comprised of radiologists and pediatricians from 7 study sites, along with a 4-person arbitration panel, consisting of radiologists experienced with the WHO methodology. Each image was reviewed by two randomly selected reviewers; images that received a discordant interpretation were then reviewed by two randomly selected arbitrators who were blinded to the previous interpretation. Discordant interpretations during arbitration were resolved through a final consensus discussion. Images were classified as PEP, other infiltrates, both PEP and other infiltrates, normal or uninterpretable (Table 1). Further details can be found in Fancourt et al. [12, 25].

Table 1

Conclusions of CXR-reading by radiologists and pediatricians in training (PERCH) and testing (WHO) datasets.

Image Class	Training Dataset: PERCH (N = 4172)					Test Dataset: WHO (N = 431)
	Final Conclusion (N = 4,172)	Round-1 Conclusions by Primary Readers		Round-2 Conclusions by Arbitrators		WHO-Original (n = 222)	WHO-CRES (n = 209)
		(N = 4,172)		(n = 2,358)
		Concordant (n = 1,814)	Discordant (n = 2,358)	Concordant (n = 1,144)	Discordant (n = 1,214)
Primary Endpoint Pneumonia	1,075 (25.8%)	458(11%)	617(14.8%)	228(9.7%)	389(32%)	90 (40.5%)	71(34.0%)
Other Infiltrates	993 (23.8%)	361(8.7%)	632(15.1%)	276(11.7%)	356(29.3%)	44 (19.8%)	26 (12.4%)
Normal	1,692 (40.6%)	854(20.5%)	838(20.1%)	521(22.9%)	317(26.1%)	75 (33.8%)	106 (50.7%)
Uninterpretable	412 (9.8%)	141(3.4%)	271(6.4%)	119(5%)	152(12.5%)	13 (5.9%)	6 (2.9%)

WHO-original and WHO-CRES datasets

In 1997, WHO released a teaching dataset with 222 CXRs to support the standardized interpretation of radiological pneumonia in children [3]. Each image was read by 20 radiologists and clinicians and labeled as PEP, other infiltrates, normal or uninterpretable. In the released dataset, 124 images were labeled as high agreement images since more than two-thirds of readers agreed on a single conclusion for these images. We refer to the remaining images in the dataset as low agreement images. Two decades later, the WHO initiated the Chest Radiography in Epidemiological Studies (CRES) project to further clarify their classification methodology, with the objective of improving inter-observer agreement for each of the 3 endpoints [19]. The published WHO-CRES dataset contains only high agreement images (N = 209). Of these images, 176 were contributed by PERCH, including 14 uninterpretable images.

Training procedures

We first pretrained the model on CheXpert to classify CXRs as positive or negative for 14 radiological findings with ImageNet initialization. The model was then finetuned on PERCH to detect PEP, other infiltrates, and normal findings with CheXpert initialization (S1 Fig). Initializing the model training with pretrained weights allows the model to achieve high performance on a smaller dataset by leveraging knowledge from models trained on a larger dataset [26]. In several previous studies using CXR images, DenseNet121 was selected as the convolutional neural network architecture [15-17]. Following the approach of previous studies, we also tried multiple available network architectures that have shown top-ranked performance on image classification including ResNet50 [27], InceptionV3 [28],VGG16 [29], NASNetMobile [30], NASNetLarge, Xception [31], DenseNet121 [32], and InceptionResnetV2 [33]. The DenseNet121 and NASNetLarge produced the highest overall AUC scores across the 3 WHO-defined categories, but DenseNet121 had slightly better performance on PEP and its relatively smaller size posed less risk of overfitting on a small dataset. Therefore, DenseNet121 was the chosen architecture in this study. For pretraining on CheXpert, we follow the same process used in Irvin, J. et al. [16], where images were downscaled to 320 × 320 pixels, normalized based on the ImageNet training set, and augmented with a 50% random horizontal flipping and affine transformation such as rotate, shear, and translate by 10 degrees or 10%. Adam optimizer with default β parameters (β1 = 0.9, β2 = 0.999) was used. We fixed the learning rate at 1 × 10−4 throughout the training with a batch size of 16 images. Class imbalance was handled by reweighting the binary cross entropy losses of each class by its inverse class frequency. The models were trained for 3 epochs with model checkpoint being saved and evaluated on the default validation set after every epoch. In the CheXpert study, the researchers tried multiple approaches to handle the uncertain labels. We opted for simple binary mapping and coded all “uncertain” labels with 0, noting that variations in coding for uncertain labels during pretraining had minimal impact on the transfer learning results. Prior to finetuning on PERCH, images were manually cropped to focus on the pulmonary area and exclude body parts such as the abdomen, head, legs and arms (S2 Fig), in order to prevent the model from being adversely impacted by learning irrelevant features [34]. The manual cropping process also helps generate masks to train an image segmentation model to automate the cropping in the future (S3 Fig). The model was initiated with the weights from the best pretrained model on CheXpert. We used 10-fold cross-validation to reduce potential bias in model evaluation. The dataset was split into ten non-overlapping sets of images, and trained for 10 times with each set being held as a validation set each time. The area under receiver operating characteristic curves (AUCs) were calculated from the validation set and averaged across the 10 folds. The 95% confidence interval of the AUC was calculated using the non-parametric DeLong method [35]. During fine-tuning, we reduced the image size to 224 × 224 pixels, which yielded slightly better performance than 320 × 320, and kept the same augmentation as in the pre-training. We trained the networks with a batch size of 32 and used an initial learning rate of 1 × 10−4, which was reduced by a factor of 10 each time the loss plateaued on the validation set. Early stopping was performed by saving the model after every epoch and choosing the saved model with the lowest validation loss. Although freezing lower layers in transfer learning has previously yielded better results [36], we found that freezing any of the lower layers resulted in sub-optimal results compared to updating the entire model. For both pretraining and fine-tuning, the parameters of the network were initialized with parameters from the pretrained network, except for the final fully connected layer, which was replaced with a new fully connected layer producing an output with the same dimension as the number of outcome classes. The weight of the replace layer was initialized with Glorot/Xavier Uniform initializer, with bias terms being set to zero. The outputs were then activated using sigmoid function to produce predicted probabilities of the presence of each of the outcome classes.

Model evaluation

Uninterpretable images were removed from all analyses. AUCs were calculated separately for high and low agreement WHO test set images. We also compared model confidence between high and low agreement images for each outcome, using predicted probabilities as a measure of model confidence. In addition to the WHO test sets, We also evaluated the PERCH model on its own hold-out set. The hold-out set included concordant and discordant images, analogous to the high and low agreement images in the WHO datasets. To create the hold-out set, we selected 150 images (50 per class) with concordant interpretations at either the primary or arbitration readings and another 150 images with discordant interpretations during arbitration. The remaining 3,460 images were used for training. Discordant images were the most difficult for pediatricians and radiologists to interpret and required a consensus discussion to assign the final conclusion. To investigate how the model classified hard-to-interpret images, we used the same 150 discordant PERCH images described above as a hold-out test set and retrained the model on the remaining images (n = 3,610). Since these discordant images required arbitration, a total of 4 readings for each image (two reviewers and two arbitrators) were available. For each of the 4 readings, we computed its sensitivity (recall), specificity, and positive predictive value (precision) against the final conclusion. We plotted the operating points from each of these readings along with the model’s receiver operating characteristic (ROC) curve and Precision-Recall (PR) space to compare model results to that of the readers. We also trained a model on all PERCH images with concordant interpretations only at either the primary or arbitration readings (n = 2,698) and used it to predict conclusions for all images that received a discordant interpretation during the arbitration reading (n = 1,062). Given the disagreement between highly trained arbitrators, the certainty of conclusion for these images is assumed to be relatively low. It is of interest to see what conclusions neural networks would assign to these hard-to-interpret images after being trained on images for which the certainty of conclusion is relatively high. We assessed whether the model could correctly highlight the diseased area on a CXR image using guided Gradient-weighted Class Activation Mappings (Grad-CAMs). Grad-CAMs produce both a low resolution highlight (i.e. heat map) of the regions important to a class prediction and a high resolution class-discriminative visualization [37]. An small ablation experiment was conducted to evaluate the extent to which model performance was impacted by image cropping. We trained the model on uncropped images from the PERCH dataset and tested it on the WHO datasets.

Results

Table 2 shows the validation and test AUCs achieved by the PERCH model on the PERCH validation, PERCH test, and WHO test datasets. The validation AUCs were 0.928 for PEP, 0.780 for other infiltrates and 0.897 for normal. The model achieved better performance on the external test set (WHO-Original and WHO-CRES images); the test AUCs increased to 0.977 for PEP, 0.891 for other infiltrates and 0.951 for normal.

Table 2

AUROC scores (averaged across 10-fold) on the validation set, and WHO test set by level of inter-observer agreement of the image labels.

Category	Validation Results	Test Results
	PERCH*(n = 346)	PERCH		WHOAll(N = 410)	WHO-Original			WHO-CRES
	PERCH*(n = 346)	Concordant (n = 150)	Discordant (n = 150)	WHOAll(N = 410)	High (n = 120)	Low (n = 88)	High + Low (n = 208)	High (n = 203)
Primary Endpoint Pneumonia	0.928 (0.919,0.938)	0.944 (0.930,0.957)	0.859 (0.837,0.879)	0.977 (0.974,0.981)	0.993 (0.990,0.996)	0.845 (0.817,0.873)	0.952 (0.943,0.960)	0.996 (0.995,0.998)
Other Infiltrates	0.780 (0.764,0.797)	0.810 (0.788,0.832)	0.741 (0.715,0766)	0.891 (0.879,0.903)	0.969 (0.957,0.980)	0.726 (0.692,0.759)	0.856 (0.838,0.875)	0.935 (0.919,0.950)
Normal	0.897 (0.887,0.907)	0.896 (0.880,0.911)	0.788 (0.765,0.812)	0.951 (0.945,0.957)	0.995 (0.992,0.997)	0.749 (0.714,0.784)	0.921 (0.909,0.932)	0.974 (0.968,0.980)

* Average sample size of the 10-fold validation set.

* Average sample size of the 10-fold validation set. Further analysis showed that the increase in model performance was related to the larger proportion of (78.78%) of high-agreement images in the WHO dataset. The PERCH model predicted PEP almost perfectly on WHO high agreement images (AUC = 0.993 and 0.996 for WHO-Original and WHO-CRES images respectively) but performance dropped by 14% on WHO low agreement images (AUC = 0.845 for WHO-Original images) (Table 2). A similar decline in performance was observed on discordant images in the PERCH dataset. The predicted probabilities of classification were higher for high versus low agreement images across all classes, with higher predicted probabilities for PEP compared to other classes (Fig 1).

Fig 1

Comparison of model predicted probabilities and 95% confidence intervals by endpoint and level of human reader agreement.

To illustrate the model performance in the context of discontinuous accuracy scores, we took the optimal cut-off of AUC score that maximizes the Youden’s index [38]. Among high-agreement WHO images, the test AUCs correspond to 95.3%, 96.7%, 91.1% of sensitivity, and 95.5%, 96.8%, 91.8% of specificity for PEP, normal and other-infilrates, respectively. Among low-agreement WHO images, the sensitivity dropped to 76.7%, 69.7%, 71.9% and specificity dropped to 76.9%, 70%, 66.4% for the three outcomes, respectively. The specificity is higher than sensitivity for all outcomes, reflecting the intended specificity of the WHO definition, a criteria that is important for estimating vaccine efficacy and impact. Fig 2 shows the comparison between the model’s prediction and the 4 readings against the final conclusion determined during the consensus discussion for discordant images. ROC and precision-recall (PR)-curves from 10-fold validation are displayed with the average curves highlighted in blue. Four operating points are also displayed representing the conclusions given by the 4 readings. Averaging across ten folds, the model’s performance on all three classes was better than the 4 readings provided by pediatricians and radiologists.

Fig 2

Comparison of model performance to radiologist and pediatricians on discordant images.

Comparison of model performance to radiologist and pediatricians on discordant images.

The four operating points represent the conclusions given by the 4 readings. The lines represent model’s performance, with the average of 10-fold validation in blue color. The top rows shows the receiver operating characteristic (ROC) curve and the bottom shows the Precision-Recall (PR) curve. Fig 3a and 3b show a side by side comparison of WHO’s annotation and the localized regions that the model predicted to be most indicative of an outcome class. The model identified the area of infection for PEP with high accuracy (p = 0.980). The identification of other infiltrates was relatively less accurate given the dispersed nature of the infected area (p = 0.917).

Fig 3

(a). Activation map of PEP. Frontal radiographs of the chest in a child with WHO-defined primary endpoint pneumonia; the child is rotated to the right with dense opacity in the right upper lobe; the model localizes consolidation with a predicted probability p = 0.980; the discriminative visualization shows fine-grained features important to the predicted class. (b). Activation map of other-infiltrates. Frontal radiograph of the chest presents patchy opacity consistent with non-endpoint infiltrate. The model correctly classifies the image as infiltrate with a probability of p = 0.917 and localizes the areas of opacity. The class discriminative visualization highlights important class features.

Additional results

The ablation experiment showed that image cropping improved model performance by 1–3% on AUC (S1 Table). Table 3 shows the agreement between the model’s prediction and the final conclusion assigned to images that received a discordant interpretation during arbitration. The model’s prediction agreed with that of pediatricians and radiologists on 60% of the discordant images (Table 3). Agreement was highest for PEP and lowest for other infiltrates. Predicted probabilities were higher when the model’s conclusion agreed with the final conclusion for PEP (85.6% vs 70%, p<0.001) and normal (76.9% vs 69.9%, p<0.001), but not for other infiltrates (60.6% vs 60.5%). No association was found between agreement status and other variables.

Table 3

Final conclusion and model prediction on discordant CXR images (N = 1,062) by key features.

	Model = Final conclusion (n = 637)	Model ≠ Final conclusion (n = 425)
Final conclusion (n,%) *
Primary Endpoint Pneumonia (PEP)	257(40.4%)	132(31.1%)
Other Infiltrates (OI)	149(23.4%)	207(48.7%)
Normal	231(36.3%)	86(20.2%)
CXR + (PEP or OI)**	558/789(70.7%)	187/273(68.5%)
Predicted Probability (mean, SD)
PEP *	85.6%(0.16)	70.0%(0.20)
OI	60.6%(0.13)	60.5%(0.12)
Normal*	76.9%(0.15)	69.9%(0.16)
Gender (n,%)
Male	351(55.1%)	233(54.8%)
Female	286(44.9%)	192(45.2%)
Age in months (mean, SD)	10.50(10.69)	11.19(10.96)
Countries (n,%)
Bangladesh	70(11.0%)	52(12.2%)
Gambia	93(14.6%)	70(16.5%)
Kenya	82(12.9%)	68(16.0%)
Mali	81(12.7%)	48(11.3%)
South Africa	186(29.2%)	105(24.7%)
Thailand	33(5.2%)	27(6.4%)
Zambia	92(14.4%)	55(12.9%)

* p<0.0001 for Pearson’s chi-squared test or two-proportion z-test.

** Differences between PEP and OI are ignored so a greater number of images have concordant results.

* p<0.0001 for Pearson’s chi-squared test or two-proportion z-test. ** Differences between PEP and OI are ignored so a greater number of images have concordant results. In S2 Table, we present various comparison matrices, where diagonal cells indicate agreement between the model and the final conclusion, and off diagonal cells indicate disagreement. The model and the readers were both more likely to classify normal images as other infiltrates than PEP. However, the model was more likely to classify other infiltrates as normal than PEP, while readers were more likely to classify other infiltrates as PEP than normal. When the model’s prediction differed from the final conclusion, 22% of its predictions agreed with the conclusion that received a majority of votes where one reviewer agrees with one arbitrator, and the other reviewer disagrees with the other arbitrator prior to final consensus discussion (S4 Fig).

Discussion

This is one of the first studies to automate detection of WHO-defined radiological pneumonia using deep learning. The neural network’s performance is better than a previous study using classical image texture analysis on a subset of PERCH images. The increase in performance could be due to either the modeling approach, or the increase in sample size. The model’s performance is also comparable to the performance of human readers on all 3 WHO-defined pneumonia categories. We also noticed a boost in model performance on external test sets, and were able to identify the root cause to be related to inter-rater agreement in the human-assigned image labels from the external test sets. Model performance was higher on the WHO test images than on the validation or test images from PERCH. This is atypical, as one would expect an independent test sample to have poorer classification performance than validation or test samples retrieved from the training set. We concluded that the improvement in AUCs on the WHO datasets, ranging from 5–11% across the 3 conclusions, was due to a larger proportion of images having high inter-observer agreement. This also explains why model performance was higher on PERCH concordant than discordant images. Previous studies have shown that inter-observer agreement is highest for PEP and lowest for other infiltrates, even when a rigorous standardization process is implemented [3, 12]. This also appears to be the case in our study, where the model achieved the best classification performance for PEP. In contrast, a finding of “other infiltrates” is known to have the lowest inter-observer agreement among human readers and was also the most difficult for the model to identify. Compared to PEP, the model generated lower predicted probabilities for classifying other infiltrates even when its conclusions agreed with the readers. We also observed that the model was more likely to classify images as normal than PEP when it disagreed with the conclusion of other infiltrates, while the readers were more likely to classify these images as PEP. It is not clear if the observed difference reflects any underlying bias in the labeling procedure or in the model’s prediction. However, this may lower the potential of the model to overestimate vaccine impact or pneumonia disease burden when applied to new settings. To better understand the predictive qualities of the model, two radiologists reviewed a selection of the WHO images that were misclassified as PEP by the model. For PEP images that the model falsely predicted as negative, the CXR films were generally more penetrated, with more photons hitting the film. The areas of consolidation that were missed also tended to be adjacent to the cardiac silhouette or scapula. One of the false negatives may be a mislabeled image as neither radiologist could identify any consolidation. On images with false positive predictions, CXR films appeared to be less penetrated. This is consistent with human intuition, as under-penetration may obscure lung parenchyma, which happens more focally with consolidation. This underscores the importance of quality control, including technical processes and patient positioning, in research studies using pediatric CXRs.

Limitations

One potential limitation of training a large neural network on a small dataset is overfitting. The validation loss on average plateaued after 3–5 epochs of fine-tuning. However, transfer learning allows the model to converge faster without excessive training [26]. Although the model achieved high performance on the WHO dataset, additional validation may be needed when applying it to new settings given that the WHO dataset is a teaching set with mostly high agreement images. The model is developed for epidemiologic purposes, and not intended to be used as a diagnostic tool in clinical practice. The model is trained on images from children hospitalized for WHO-defined severe and very severe pneumonia and its application should be restricted to images from children with a diagnosis of pneumonia to avoid misclassification of non-pneumonia cases as PEP.

Conclusion

A deep learning model can identify primary endpoint pneumonia in children at a performance comparable to human readers. It can be implemented without manual feature engineering and achieves better performance than classical image analysis. This study lays a strong foundation for the potential inclusion of computer-aided pediatric CXR readings in vaccine trials and epidemiology studies.

Transfer learning illustration.

(PNG) Click here for additional data file.

Sample images and cropping.

(PNG) Click here for additional data file.

U-Net for auto-cropping.

(PNG) Click here for additional data file.

Flow of model prediction disagrees when model disagrees with final conclusion.

(PNG) Click here for additional data file.

AUROC scores of models trained on raw PERCH images and tested on WHO.

(DOCX) Click here for additional data file.

Matrix of model predictions by final and majority conclusions.

(DOCX) Click here for additional data file. 29 Mar 2021 PONE-D-21-05523 Deep Learning for Classification of Pediatric Chest Radiographs by WHO's Standardized Methodology PLOS ONE Dear Dr. Chen, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by May 13 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Khanh N.Q. Le Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ 3. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information. If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information. 4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 5. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 6. Thank you for providing the following Funding Statement: [Yiyun Chen, Tanaz Petigara, Wanmei Ou, Craig S. Robert, Gregory V. Goldmacher were employees of Merck & Co., Inc. during the conduct of the study. Maria Deloria Knoll received grant from Merck & Co. to cover expenses related to preparing the PERCH dataset shared for use in the study, and for consulting on the manuscripts.]. We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study in the Author Contributions section of the online submission form. Please make any necessary amendments directly within this section of the online submission form. Please also update your Funding Statement to include the following statement: “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” If the funding organization did have an additional role, please state and explain that role within your Funding Statement. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc. Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have used a DenseNet-121 model pretrained on the CheXpert data and fine-tuned it to predict CXRs as showing PEP, other infiltrates, or normal lungs. They have experimented with in-house and WHO test sets. They conclude that the model delivers superior performance with concordant images from the WHO test set compared to discordant images in the PERCH test set. What is the objective of the study? Are the authors trying to propose a novel architecture that can generalize to real-world PEP detection? The authors mention “The model is trained on images from children hospitalized for WHO-defined severe and very severe pneumonia and its application should be restricted to images from children with a diagnosis of pneumonia to avoid misclassification of non-pneumonia cases as PEP.” What is the purpose of using a trained model to classify a CXR as showing PEP when the patient is already diagnosed to have PEP? The model indeed needs to function as triage in resource-constrained settings. What is the effect of increasing/decreasing spatial resolution in the proposed approach? This reviewer is not clear why the authors resized the images to 320 x 320 pixels during the pretraining stage and then 224 x 224 in the fine-tuning stage. A detailed discussion on using different spatial resolutions and their impact on model performance shall be provided. A statistical significance analysis is required in this regard. This reviewer is unclear why the authors worked with only eight ImageNet models. What is the performance obtained by a baseline, sequential CNN? Deeper models are not always better for medical image analysis tasks. There is no comprehensive analysis provided while pretraining and finetuning with different models. How did the authors optimize the architecture and hyperparameters of these models? Did they truncate these models at an optimal layer to suit the classification problem under study? These details are not discussed. Also, the authors didn’t discuss how they managed the class-imbalance problem during the pretraining stage. At present there are several studies available that discuss CXR modality-specific pretraining in the context of classifying lung diseases, a few are mention below: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8621525 https://pubmed.ncbi.nlm.nih.gov/33180877/ The authors need to perform a comprehensive literature survey in this regard and cite these studies. This reviewer does not think reference [27] suits here and should be replaced with some of the aforementioned references. This reviewer is not convinced about the AUC curves obtained with various test sets. The default classification threshold used by these models is 0.5. What are the sensitivity and specificity obtained at varying operating thresholds? How did the authors make a selection of the sensitivity and specificity at an optimum threshold that offers a good trade-off? This reviewer is not convinced with the manual cropping of the lung regions from the CXR images. It is recommended to automate the process using U-Nets or other image segmentation models as that discussed in the SOTA like https://pubmed.ncbi.nlm.nih.gov/33180877/ and discuss the model performance in terms of segmentation evaluation metrics and see if the differences are statistically significantly different. The authors mention in the discussion that “Results demonstrate that the deep learning approach can achieve better classification performance for PEP (by about 8%) than image texture analysis with no need for feature engineering. This is not agreeable. They have not performed any kind of handcrafted texture analysis in this study. The literature might not have used the same train and test sets that the authors propose in this study. It is recommended to perform a texture analysis with varying texture descriptors and then discuss how the performance compares to the deep learning methods. The authors mention in the conclusion that “It can be implemented without manual feature engineering and achieves better performance than classical image analysis”. This is not agreeable for the reason aforementioned. The authors shall mention the computational resources and deep learning frameworks used in this study. Reviewer #2: It is a common practice to use transfer learning in image analysis. The paper presents a detailed methodology in classifying radiographs using transfer learning. Please see the following comments: 1. It would be nice if the authors could provide more technical details in pre-training using ImageNet initialization. This is because the number of classes is 1000, however, the pre-training only requires 14. If this is true, does that mean the weights in the final linear projection layer are re-initialized, instead of using the ImageNet initializations? 2. Similar to the point above, the S1 figure should have more details to precisely describe the pretraining process. 3. It is understandable that the focus of the paper is to present a practical way in radiograph classification using transfer learning. It would be nice if the authors could provide some future work. Reviewer #3: Some Recommendations: Highlight the originality of the paper as model that have been used is already available. It is not clear whether cropping of CXRs is done manually or automated. Specify it.(line no 157) Discuss the deep-learning model that is used in the study. Also add the results obtained on various deep-learning models that you have compared. Reviewer #4: The manuscript is good overall. This study based on deep learning method lays a strong foundation for the potential inclusion of computer-aided pediatric CXR readings in vaccine trials and epidemiology studies. There are some spelling mistakes need to be corrected，such as "trainig" in line 140. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Dr. Rahul Hooda Reviewer #4: Yes: Rongguo Zhang [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 12 May 2021 Reviewer #1: The authors have used a DenseNet-121 model pretrained on the CheXpert data and fine-tuned it to predict CXRs as showing PEP, other infiltrates, or normal lungs. They have experimented with in-house and WHO test sets. They conclude that the model delivers superior performance with concordant images from the WHO test set compared to discordant images in the PERCH test set. Response: We would like to thank the reviewer for taking the time to give the paper a careful read. Reviewer #1: What is the objective of the study? Are the authors trying to propose a novel architecture that can generalize to real-world PEP detection? The authors mention “The model is trained on images from children hospitalized for WHO-defined severe and very severe pneumonia and its application should be restricted to images from children with a diagnosis of pneumonia to avoid misclassification of non-pneumonia cases as PEP.” What is the purpose of using a trained model to classify a CXR as showing PEP when the patient is already diagnosed to have PEP? The model indeed needs to function as triage in resource-constrained settings. Response: WHO-defined clinical pneumonia is a general clinical definition of symptoms (i.e. cough, fever etc.) which does not indicate the etiology of the infection (i.e. bacterial vs viral infection). To support classification as an endpoint in vaccine efficacy and epidemiology studies the WHO convened an expert panel to define specific chest x-ray (CXR) findings specifically associated with bacterial pneumonia in children=. PEP, per this definition, is a specific type of pneumonia, more likely to be bacterial and with high relevance to pneumococcal infection. The major characteristic of PEP is the manifestation of consolidation on chest x-ray images. Consolidation refers to the alveolar airspaces being filled with fluid (exudate/transudate/blood), cells (inflammatory), tissue, or other material. In practice, implementation of PEP in bacterial pneumonia research studies has often involved a team of radiologists, with multiple readings per child and a consensus process to interpret the findings. By using the data and images of such a study, (e.g., PERCH), we are training the computer to provide this specific reading, which can enable greater scalability and consistency to these types of studies without the teams of radiologists to adjudicate. We did not describe this background information in too much detail given concerns of conciseness. We have cited all the relevant studies that can provide this background knowledge for readers who are interested. Meanwhile to address the reviewer’s comment, we have now added between line 51-53, a description of consolidation and mild interstitial changes to clarify where the concept of PEP came from. We have also added more details to help establish the connection between pneumococcal pneumonia, bacterial infection, and lobar consolidation (i.e. PEP). Lastly, we reorganized the introduction in order to better delineate the rationale behind the development of a deep learning model for PEP. The ultimate objective of the study is to automate CXR reading for PEP to incorporate in large vaccine efficacy and epidemiology studies. Reviewer #1: What is the effect of increasing/decreasing spatial resolution in the proposed approach? This reviewer is not clear why the authors resized the images to 320 x 320 pixels during the pretraining stage and then 224 x 224 in the fine-tuning stage. A detailed discussion on using different spatial resolutions and their impact on model performance shall be provided. A statistical significance analysis is required in this regard. Response: We have now added a description of how we select the hyperparameters under the training procedures section. Image size or “spatial resolution”, is one of the many hyperparameters that needs to be tuned during training. In machine learning/deep learning, hyperparameter tuning does not necessarily require a statistical significance test. Grid search or random search is typically used over a range of plausible values of different hyperparameters, and the combination that yields the lowest loss or highest accuracy will be selected for full model training. Example discussions and implementations can be found at: https://stats.stackexchange.com/questions/321805/statistical-significance-of-changing-a-hyperparameter https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/. This can also been seen in the IEEE paper that the reviewer mentioned later in one of the comments (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8621525). In that study, the researchers simply stated “The creators of ResNet recommend converting images to square sized images of either 224 by 224 pixels or 299 by 299 pixels before feeding them to ResNet. We opted for the higher resolution 299 by 299 for higher accuracy” as the rationale for choosing 299 pixels. In this study, the hyperparameters of interest include image size, learning rate, alpha and beta coefficient of adam optimizer, level of image augmentation, batch size, etc. Because the pretraining is a replication of the CheXpert study, we kept the same “special resolution” as the original study given that other researchers have already narrow down to the best resolution (320 by 320 pixels) for the proposed architecture. For finetuning on the PERCH dataset, we found 224 by 224 to yield a slightly higher AUC score than the 320 by 320 pixels. This detail has been added to the manuscript between line 180-181. Reviewer #1: This reviewer is unclear why the authors worked with only eight ImageNet models. What is the performance obtained by a baseline, sequential CNN? Response: These 8 models cover most of the available architectures that have shown top notch results on image classification. Each of the eight model architectures have shown by previous studies to have significantly outperformed the previous generation of models. The eight models we tried were the same eight models applied by other researchers (https://arxiv.org/pdf/1901.07031.pdf), where the researchers found that DenseNet121 works best for chest radiograph images. Other studies on CXR images have also done similar analysis and found the DenseNet architecture to work better on CXR images than other available ones. As with the previous work, we found the DenseNet model to have high accuracy with training. This detail is provided between line 148-150 with all the relevant references. Reviewer #1: Deeper models are not always better for medical image analysis tasks. There is no comprehensive analysis provided while pretraining and finetuning with different models. How did the authors optimize the architecture and hyperparameters of these models? Did they truncate these models at an optimal layer to suit the classification problem under study? These details are not discussed. Response: We agree with the reviewer that deeper model is indeed not necessarily better, especially on a relatively smaller dataset. After finding the optimal architecture (i.e. DenseNet 121) for CXR images, by following the same procedure in previous studies, we had attempted freezing the weights at the lower layers of the model, in order to reduce the number of parameters need to be trained during back propagation. This has similar effect as training on a shallower model with smaller number of parameters. But it turned out that training on the entire model still yields better performance. We tried shallower models, such as VGG16, but its accuracy was over 10% lower than DenseNet121. We have added these details in the paper, along with a comparison to the IEEE article finding that the reviewer mentioned in one of the later comments between line 185-187. On the other hand, the pretraining procedure is usually not extensively described because it is typically a replication of previous studies, as we mentioned in the paper. Meanwhile, we have reorganized the paragraph to clarify that that the pre-training approach was a replication of the previous study from which we borrow the dataset. We have cited all the relevant papers in the manuscript for readers who are interested in more details. Meanwhile we also added more details on how we handled uncertain labels in the pretraining dataset between line 164-168, as this detail is unique to the current study, and is not directly available in the reference article. Reviewer #1: Also, the authors didn’t discuss how they managed the class-imbalance problem during the pretraining stage. Response: Thank you for pointing this out and we have added a description of how the imbalance problem is handled on page between line 161-162. Reviewer #1: At present there are several studies available that discuss CXR modality-specific pretraining in the context of classifying lung diseases, a few are mention below: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8621525 https://pubmed.ncbi.nlm.nih.gov/33180877/ The authors need to perform a comprehensive literature survey in this regard and cite these studies. This reviewer does not think reference [27] suits here and should be replaced with some of the aforementioned references. Response: Reference 27 is a textbook cited often as a source of transfer learning technique, which as the reviewer has appropriately pointed out, may not be relevant here and hence we have removed it. We also agree with the reviewer that there are several studies, and in fact more than several papers available that used transfer learning on CXR images. A majority of these studies are about tuberculosis, along with a surge of COVID-19 deep learning papers over the past year. Transfer learning is almost a standard approach for deep learning in image classification, as reviewer 2 also mentioned in the opening comment. We did not cite all those studies as they are in different disease areas, none of these studies is the original study that proposed transfer learning in image classification, and none of these are the first to use transfer learning on CXR. Our goal is to focus the literature search on studies with direct relevance to the current study, including all the PCV vaccine disease burden studies using CXR finding as the outcome, along with deep learning method papers including the CheXpert paper (https://arxiv.org/abs/1901.07031), and its prior versions, the CheXNet paper (https://arxiv.org/pdf/1711.05225.pdf) and CheXNeXt paper (https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002686) from which we either borrowed the method or used the dataset. Meanwhile, we do see the value of citing the two papers that the reviewer mentioned that used CXR images from the perspective of methodology, so we have cited the aforementioned paper in the relevant part of the revised manuscript on line 186, and line 171, along with relevant description and comparison to the current study. Reviewer #1: This reviewer is not convinced about the AUC curves obtained with various test sets. The default classification threshold used by these models is 0.5. What are the sensitivity and specificity obtained at varying operating thresholds? How did the authors make a selection of the sensitivity and specificity at an optimum threshold that offers a good trade-off? Response: We understand the reviewer’s desire to know the sensitivity and specificity and we have now included sensitivity and specificity in the results section. We did not report this piece of information originally because the goal of the study is not label generation, but model validation. Sensitivity and specificity are discontinuous arbitrary accuracy scores, whereas the AUC score is not used for forced choice but rather for assessing the pure predictive discrimination of a continuous prediction. It is also widely used in the literature for validating a deep learning model and has been the default metric used in previous literature that we have cited. However, we do understand that the concept of sensitivity and specificity is more relevant in a medical setting and is practical information when applied to the ultimate objective of the work, which is application to large scale research studies of childhood pneumonia. These metrics have now been added, along with relevant description in the manuscript between line 246-253. Reviewer #1: This reviewer is not convinced with the manual cropping of the lung regions from the CXR images. It is recommended to automate the process using U-Nets or other image segmentation models as that discussed in the SOTA like https://pubmed.ncbi.nlm.nih.gov/33180877/ and discuss the model performance in terms of segmentation evaluation metrics and see if the differences are statistically significantly different. Response: Automation is our goal as well, and indeed a U-Net can help to automate segmentation. However, the training of a U-Net is a supervised learning approach, and requires ground-truth bounding boxes. We are not aware of any open-source U-Net available to crop lung regions on pediatric CXRs, and this is why we manually cropped the images and created our own ground-truth masks or bounding box. We have already used the masks to train a U-Net to automatically identify the lung area, so in the future this process can be automated. But in the current study, given that we already have the ground truth bounding box, we do not see the need to apply the U-Net and regenerate the bounding box again. We have added a description of the U-Net in the supplementary material (see S3 Fig), along with a description between line 171-173 in the manuscript. We originally did not mention this detail because it is more relevant to future studies than the current one, but we are glad that the reviewer pointed this out, and we also see the value of adding it to the paper for future reference. Reviewer #1: The authors mention in the discussion that “Results demonstrate that the deep learning approach can achieve better classification performance for PEP (by about 8%) than image texture analysis with no need for feature engineering. This is not agreeable. They have not performed any kind of handcrafted texture analysis in this study. The literature might not have used the same train and test sets that the authors propose in this study. It is recommended to perform a texture analysis with varying texture descriptors and then discuss how the performance compares to the deep learning methods. The authors mention in the conclusion that “It can be implemented without manual feature engineering and achieves better performance than classical image analysis”. This is not agreeable for the reason aforementioned. Response: We agree with the reviewer that the original description is problematic because there are two potential reasons for the different results. The first is the different modeling approach, and the second is the difference in datasets. The reviewer has also raised a very good point on the population being different because in that study, the researcher only used data from some of the study sites, instead of all sites. We have revised the description between line 285-288, and also incorporated the reviewer’s point between line 87-96 in the introduction. Yet, it is not clear what the reviewer means by texture analysis in the context of deep learning. The difference between deep learning and classical image analysis is that “texture” no longer needs to be selected by the human user, and instead of texture descriptor, the convolutional filters is part of the trained parameters and network will decide for human what kind of “texture” that needs to be learned at each layer. It seems that the type of analysis that the reviewer is potentially implying may be beyond the scope of the current study. Reviewer #1: The authors shall mention the computational resources and deep learning frameworks used in this study. Response: We have added this piece of information in the supplementary material. We will also be sharing the codes and trained model on github later. Reviewer #2: It is a common practice to use transfer learning in image analysis. The paper presents a detailed methodology in classifying radiographs using transfer learning. Please see the following comments: Response: We would like to thank the reviewer for taking the time to read the paper. 1. It would be nice if the authors could provide more technical details in pre-training using ImageNet initialization. This is because the number of classes is 1000, however, the pre-training only requires 14. If this is true, does that mean the weights in the final linear projection layer are re-initialized, instead of using the ImageNet initializations? Response: We have added a paragraph to describe the weight initialization in more detail between line 188-194. 2. Similar to the point above, the S1 figure should have more details to precisely describe the pretraining process. Response: We have added some details in the graph to indicate that the weight of the final layer was randomly initialized. 3. It is understandable that the focus of the paper is to present a practical way in radiograph classification using transfer learning. It would be nice if the authors could provide some future work. Response: We appreciate that the reviewer has a clear understanding of the ultimate goal of the study. Our future work is currently in progress. We are hoping this paper can lay the foundation as a first step. We have also added a small detail (S3 Fig) which is an image segmentation model that will be used in the future work. In this study, we manually cropped all the images, but this is infeasible for future studies on a larger scale. Therefore, we used manual cropped bounding box/mask to train a model that will allow us to automate image cropping in the future. The greater context of this work is to enable larger scale epidemiology studies of bacterial pneumonia in children. Historically, studies that have aspired to do this have used the method for classification of childhood pneumonia radiographs, and employed a team of radiologists with multiple readers per image and an adjudication process to determine a reading. By using such a study to train the algorithm (e.g., PERCH), we intend to release and deploy this tool to facilitate the conduct of similar studies over larger databases of images without the need for a panel of radiologists. Further work is ongoing. Reviewer #3: Some Recommendations: Highlight the originality of the paper as model that have been used is already available. Response: We are not aware of an available deep learning model to classify CXR images in children with pneumonia into WHO-defined pneumonia categories. Also , we would like to clarify that this study is not about developing new architecture, but about finding a practical use case of an available model architecture, as is also pointed out by reviewer 2. We are aware of only one previous study, by Mahomed et al., which was focused on training a model to classify pediatric CXR into WHO-defined pneumonia. We have revised the last paragraph of the introduction to highlight the originality of the paper as compared to the one by Mahomed et al. Reviewer #3: It is not clear whether cropping of CXRs is done manually or automated. Specify it.(line no 157) Response: We have clarified cropping process between line 169-173, and added an illustration in S3 Fig. We used manual cropping for this study, and the byproduct of the process is that we have ground-truth label to train a model to automate the cropping process. We have added this detail in the paper now. Reviewer #3: Discuss the deep-learning model that is used in the study. Also add the results obtained on various deep-learning models that you have compared. Response: All the available architectures were widely used in image classification, and the original paper that invented the model has described them in detail. We have now added reference papers for all the models, and readers who are interested in further detail can refer to those papers. For the reviewer’s reference, below are the results of all models in a preliminary analysis. A brief summary of the results in the paper is between line 151-155. We did not report these results in detail, because typically model architecture selection, like hyperparameter tuning, is a routine process and is not expected to be reported in detail. In previous studies such as https://arxiv.org/pdf/1711.05225.pdf, https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002686, https://stanfordmlgroup.github.io/competitions/chexpert/ , the researchers either simply reported the final architecture of selection that yields the lowest loss or highest accuracy in preliminary analysis, or simply mentioned the name of all architectures that were tried. All previous studies found DenseNet 121 to be the optimal architecture for CXR images. Reviewer #4: The manuscript is good overall. This study based on deep learning method lays a strong foundation for the potential inclusion of computer-aided pediatric CXR readings in vaccine trials and epidemiology studies. There are some spelling mistakes need to be corrected, such as "trainig" in line 140. Response: We would like to thank the reviewer for the positive comment. We have proofread the paper again to correct residual spelling errors. Submitted filename: Response to Reviewers.docx Click here for additional data file. 1 Jun 2021 Deep Learning for Classification of Pediatric Chest Radiographs by WHO's Standardized Methodology PONE-D-21-05523R1 Dear Dr. Chen, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Khanh N.Q. Le Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have addressed my queries to satisfaction. It is recommended to mention the GitHub/other links to the data and codes that would be populated after possible acceptance of the manuscript. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 10 Jun 2021 PONE-D-21-05523R1 Deep Learning for Classification of Pediatric Chest Radiographs by WHO's Standardized Methodology Dear Dr. Chen: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Khanh N.Q. Le Academic Editor PLOS ONE

24 in total

1. Early impact of 10-valent pneumococcal conjugate vaccine in childhood pneumonia hospitalizations using primary data from an active population-based surveillance.

Authors: Sabrina Sgambatti; Ruth Minamisava; Ana Luiza Bierrenbach; Cristiana Maria Toscano; Maria Aparecida Vieira; Gabriela Policena; Ana Lucia Andrade
Journal: Vaccine Date: 2015-12-17 Impact factor: 3.641

2. Effectiveness of heptavalent pneumococcal conjugate vaccine in children younger than 5 years of age for prevention of pneumonia: updated analysis using World Health Organization standardized interpretation of chest radiographs.

Authors: John Hansen; Steven Black; Henry Shinefield; Thomas Cherian; Jane Benson; Bruce Fireman; Edwin Lewis; Paula Ray; Janelle Lee
Journal: Pediatr Infect Dis J Date: 2006-09 Impact factor: 2.129

3. Standardized interpretation of paediatric chest radiographs for the diagnosis of pneumonia in epidemiological studies.

Authors: Thomas Cherian; E Kim Mulholland; John B Carlin; Harald Ostensen; Ruhul Amin; Margaret de Campo; David Greenberg; Rosanna Lagos; Marilla Lucero; Shabir A Madhi; Katherine L O'Brien; Steven Obaro; Mark C Steinhoff
Journal: Bull World Health Organ Date: 2005-06-24 Impact factor: 9.408

4. Index for rating diagnostic tests.

Authors: W J YOUDEN
Journal: Cancer Date: 1950-01 Impact factor: 6.860

5. Long-term Association of 13-Valent Pneumococcal Conjugate Vaccine Implementation With Rates of Community-Acquired Pneumonia in Children.

Authors: Naïm Ouldali; Corinne Levy; Philippe Minodier; Laurence Morin; Sandra Biscardi; Marie Aurel; François Dubos; Marie Alliette Dommergues; Ellia Mezgueldi; Karine Levieux; Fouad Madhi; Laure Hees; Irina Craiu; Chrystèle Gras Le Guen; Elise Launay; Ferielle Zenkhri; Mathie Lorrot; Yves Gillet; Stéphane Béchet; Isabelle Hau; Alain Martinot; Emmanuelle Varon; François Angoulvant; Robert Cohen
Journal: JAMA Pediatr Date: 2019-04-01 Impact factor: 16.193

6. Computer-aided diagnosis for World Health Organization-defined chest radiograph primary-endpoint pneumonia in children.

Authors: Nasreen Mahomed; Bram van Ginneken; Rick H H M Philipsen; Jaime Melendez; David P Moore; Halvani Moodley; Tanusha Sewchuran; Denny Mathew; Shabir A Madhi
Journal: Pediatr Radiol Date: 2020-01-13

7. Use of Chest Radiography Examination as a Probe for Pneumococcal Conjugate Vaccine Impact on Lower Respiratory Tract Infections in Young Children.

Authors: Shalom Ben-Shimol; Ron Dagan; Noga Givon-Lavi; Dekel Avital; Jacob Bar-Ziv; David Greenberg
Journal: Clin Infect Dis Date: 2020-06-24 Impact factor: 9.079

Review 8. Pneumococcal conjugate vaccines for preventing vaccine-type invasive pneumococcal disease and X-ray defined pneumonia in children less than two years of age.

Authors: Marilla G Lucero; Vernoni E Dulalia; Leilani T Nillos; Gail Williams; Rhea Angela N Parreño; Hanna Nohynek; Ian D Riley; Helena Makela
Journal: Cochrane Database Syst Rev Date: 2009-10-07

Review 9. Introduction to the Epidemiologic Considerations, Analytic Methods, and Foundational Results From the Pneumonia Etiology Research for Child Health Study.

Authors: Katherine L O'Brien; Henry C Baggett; W Abdullah Brooks; Daniel R Feikin; Laura L Hammitt; Stephen R C Howie; Maria Deloria Knoll; Karen L Kotloff; Orin S Levine; Shabir A Madhi; David R Murdoch; J Anthony G Scott; Donald M Thea; Scott L Zeger
Journal: Clin Infect Dis Date: 2017-06-15 Impact factor: 9.079

10. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists.

Authors: Pranav Rajpurkar; Jeremy Irvin; Robyn L Ball; Kaylie Zhu; Brandon Yang; Hershel Mehta; Tony Duan; Daisy Ding; Aarti Bagul; Curtis P Langlotz; Bhavik N Patel; Kristen W Yeom; Katie Shpanskaya; Francis G Blankenberg; Jayne Seekins; Timothy J Amrhein; David A Mong; Safwan S Halabi; Evan J Zucker; Andrew Y Ng; Matthew P Lungren
Journal: PLoS Med Date: 2018-11-20 Impact factor: 11.069

2 in total

Review 1. Artificial intelligence-based clinical decision support in pediatrics.

Authors: Sriram Ramgopal; L Nelson Sanchez-Pinto; Christopher M Horvat; Michael S Carroll; Yuan Luo; Todd A Florin
Journal: Pediatr Res Date: 2022-07-29 Impact factor: 3.953

Review 2. Pediatric chest radiograph interpretation: how far has artificial intelligence come? A systematic literature review.

Authors: Sirwa Padash; Mohammad Reza Mohebbian; Scott J Adams; Robert D E Henderson; Paul Babyn
Journal: Pediatr Radiol Date: 2022-04-23

2 in total