Literature DB >> 34968393

Novel loss functions for ensemble-based medical image classification.

Sivaramakrishnan Rajaraman¹, Ghada Zamzmi¹, Sameer K Antani¹.

Abstract

Medical images commonly exhibit multiple abnormalities. Predicting them requires multi-class classifiers whose training and desired reliable performance can be affected by a combination of factors, such as, dataset size, data source, distribution, and the loss function used to train deep neural networks. Currently, the cross-entropy loss remains the de-facto loss function for training deep learning classifiers. This loss function, however, asserts equal learning from all classes, leading to a bias toward the majority class. Although the choice of the loss function impacts model performance, to the best of our knowledge, we observed that no literature exists that performs a comprehensive analysis and selection of an appropriate loss function toward the classification task under study. In this work, we benchmark various state-of-the-art loss functions, critically analyze model performance, and propose improved loss functions for a multi-class classification task. We select a pediatric chest X-ray (CXR) dataset that includes images with no abnormality (normal), and those exhibiting manifestations consistent with bacterial and viral pneumonia. We construct prediction-level and model-level ensembles to improve classification performance. Our results show that compared to the individual models and the state-of-the-art literature, the weighted averaging of the predictions for top-3 and top-5 model-level ensembles delivered significantly superior classification performance (p < 0.05) in terms of MCC (0.9068, 95% confidence interval (0.8839, 0.9297)) metric. Finally, we performed localization studies to interpret model behavior and confirm that the individual models and ensembles learned task-specific features and highlighted disease-specific regions of interest. The code is available at https://github.com/sivaramakrishnan-rajaraman/multiloss_ensemble_models.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34968393 PMCID： PMC8718001 DOI： 10.1371/journal.pone.0261307

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Deep learning (DL) has demonstrated superior performance in natural and medical computer vision tasks. Computer-aided diagnostic tools developed with DL models have been widely used in analyzing medical images including Chest-X-rays (CXRs) and computerized tomography (CT). CXRs have been studied extensively where the models are used to predict manifestations of cardiopulmonary diseases such as pneumonia opacities, pneumothorax, cardiomegaly, Tuberculosis (TB), lung nodules, and, more recently, COVID-19 [1, 2]. Such tools are extremely helpful, particularly in resource-constrained regions where there exists a scarcity of expert radiologists. The DL model parameters are iteratively modified to minimize the training error using several optimization methods (e.g., stochastic gradient descent). This error is computed using a loss function, also called a cost function, that maps model predictions to their associated costs. Cross-entropy loss is the most commonly used loss function in medical image classification tasks, including CXRs [3-7]. This loss function outputs a class probability value between 0 and 1, where high values indicate high disagreement of the predicted class with the ground truth label. In class-imbalanced medical image classification tasks, training a model to minimize the cross-entropy loss might lead to biased learning since (i) the loss asserts equal weights to all the classes, and (ii) the model would predict the majority of test samples as belonging to the dominant normal class. To mitigate these issues, the authors of [8] proposed a loss function, called focal loss, for object detection tasks. Here, the standard cross-entropy loss function is modified to down-weight the majority background class so the model would focus on learning the minority object samples. Following this study, the focal loss function has been used in several medical image classification studies. For example, the authors of [9] trained DL models to minimize the focal loss and improve pulmonary nodule detection and classification performance using CT scans. They observed that the model trained with the focal loss resulted in superior performance with 97.2% accuracy and 96.0% sensitivity. Another study [10] used the focal loss to train the models toward classifying CXRs into normal, bacterial pneumonia, viral pneumonia, or COVID-19 categories. It was observed that the models trained with the focal loss outperformed other models by demonstrating superior values for precision (78.33%), recall (86.09%), and F-score (81.68%). Aside from these studies, the literature does not have a comprehensive study that investigates the effects of loss functions on medical image classification, particularly CXRs. DL models learn a mapping function through error backpropagation and update model weights to minimize error. They can vary in their architecture, hyper-parameters, and training strategy, thereby resulting in varying degrees of bias and variance errors. Ensemble learning, a paradigm of machine learning, helps to (i) reduce prediction variance and achieve improved performance over any individual constituent model, and (ii) increase robustness by reducing the range (spread) of the predictions. There are several ensemble methods reported in the literature including majority voting, simple averaging, weighted averaging, and stacking, among others [11]. Ensemble models have been widely used in medical image classification tasks including CXRs [2, 7, 12–16]. However, these studies trained ensemble models to minimize the de-facto cross-entropy loss in their respective classification tasks. To the best of our knowledge, we observed that no studies reported evaluations on the performance of ensemble DL models trained with other loss functions toward improving classification performance. In this study, we aim to demonstrate the benefits of (i) training DL classification models using existing and proposed loss functions and (ii) constructing model ensembles to improve performance in a multi-class classification task that classifies pediatric CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. This systematic study is performed as follows. First, we train an EfficientNet-B0-based U-Net model on a collection of CXRs and their associated lung masks [17] to segment lungs in the pediatric pneumonia CXR collection [6]. Lung segmentation helps to exclude irrelevant image regions and learn lung region-specific features. We select the EfficientNet-B0-based model because it delivered state-of-the-art (SOTA) performance in ImageNet classification tasks, with reduced computational complexity [18]. Next, the encoder from the trained EfficientNet-B0-based U-Net model is truncated and appended with classification layers. This is done to transfer CXR modality-specific knowledge for improving performance in the task of classifying CXRs in the pediatric pneumonia CXR dataset into normal, bacterial pneumonia, or viral pneumonia categories. Finally, the top-K (K = 3, 5) performing models are used to construct prediction-level and model-level ensembles. The performance of the individual models, prediction-level, and model-level ensembles are further analyzed for statistical significance. We also performed localization studies to ensure that the individual models and their ensembles learned task-specific features and highlighted the disease-manifested regions of interest (ROIs) in the CXRs.

Materials and methods

Datasets

This retrospective study uses the following two datasets: Montgomery TB CXRs [19]: This is a publicly available collection of 58 CXRs showing TB-related manifestations and radiologist readings and 80 CXRs showing lungs with no findings. The images and their associated lung masks are deidentified and exempted from the National Institutes of Health (NIH) IRB review (OHSRP#5357). We use this as an independent test set to evaluate the segmentation model proposed in this study. Pediatric pneumonia [6]: A set of 4273 CXRs showing lungs infected with bacterial and viral pneumonia and 1583 CXRs showing normal lungs are collected from children of 1 to 5 years of age at the Guangzhou Medical Center in China. The author-defined [6] training set contains 1349, 2538, and 1345 CXRs and the test set contains 234, 242, and 148 CXRs showing normal lungs, bacterial pneumonia, and viral pneumonia manifestations, respectively. The CXRs are acquired as a part of routine clinical care, curated by expert radiologists, and made publicly available with IRB approvals. We use this dataset toward classifying CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations.

Lung segmentation and cropping

As CXR images contain irrelevant regions that do not help in learning classification task-specific features, we segmented the ROI, i.e., the lungs from the CXRs, and used the lung-segmented images for training the classification models. Our review of the literature reveals that U-Net [20] is widely used for segmenting ROIs in natural and medical images. Further, the study of the literature shows that EfficientNet [18] models have achieved superior performance in natural and medical computer vision tasks, as compared to other models, in terms of accuracy, efficiency, and computational complexity. Hence, we used an EfficientNet-B0-based U-Net model [21] to perform pixel-wise segmentation. The EfficientNet-B0-based U-Net model is trained using the CXR collection and their associated lung masks discussed in [17] to minimize the following loss functions: (i) Binary cross-entropy (BCE), (ii) Weighted BCE-Dice [2], (iii) Focal [8], (iv) Tversky [22], and (v) Focal Tversky [23]. We used 10% of the training data for validation with a fixed seed. Each mini-batch of the training data is augmented using random affine transformations such as pixel shifting [-2 +2], horizontal flipping, and rotations [-5 +5] to introduce variability into the training process. The model is trained using an Adam optimizer with an initial learning rate of 1e-3. The learning rate is reduced whenever the validation loss ceased to improve. The model demonstrating the least validation loss is used to predict lung masks of a reduced 512×512 pixel resolution for the CXRs in the Montgomery TB CXR collection. The images are resized using bicubic interpolation from the OpenCV software library. The performance of the segmentation models is evaluated using the following metrics: (i) Segmentation accuracy; (ii) Dice coefficient, and (iii) Intersection over union (IoU). We selected the top-3 segmentation models from those that are trained using the aforementioned loss functions based on segmentation accuracy, Dice coefficient, and IoU metrics. The selected models are used to predict the lung masks for the CXRs in the Montgomery CXR collection. These masks are then bitwise-ANDed to produce the final lung mask. The bitwise-AND operation compares each pixel of the predicted masks by the top-3 performing models. If only all the pixels are 1, i.e., belonging to the lung ROI, the corresponding bit in the final mask is set to 1, otherwise, it is set to 0. The final lung mask is then overlaid on the original CXR image to delineate the lung boundaries and the bounding box containing the lung pixels is cropped. The resulting lung-cropped image is resized to 512×512 pixel resolution. Then, the cropped CXRs are contrast-enhanced by saturating the top and bottom 1% of all the image pixels followed by normalizing the pixels to the range [0 1]. Fig 1 shows the diagram of the segmentation module proposed in this study.

Fig 1

Segmentation module.

The U-Net constructed with an EfficientNet-B0-based encoder and symmetrical decoder is trained to minimize the following losses: (i) BCE; (ii) Weighted BCE-Dice, (iii) Focal, (iv) Tversky, and (v) Focal Tversky. The trained models predict lung masks in the Montgomery TB CXR collection. The predictions of the top-3 performing models are bitwise-ANDed to produce the final lung mask.

Segmentation module.

Classification module

The encoder from the trained EfficientNet-B0-based U-Net model is truncated at the ‘block5c_add’ layer (TensorFlow Keras naming convention) with feature map dimensions of [16, 16, 512]. This approach is followed to transfer CXR modality-specific knowledge to improve performance in the current CXR classification task. The truncated model is appended with the following layers: (i) a zero-padding (ZP) layer, (ii) a convolutional layer with 512 filters, each of size 3×3, (iii) a global averaging pooling (GAP) layer; and (iv) a final dense layer with three neurons and Softmax activation, to classify the pediatric CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. We used the train and test splits published in [6] to compare our model performance with the SOTA literature [6, 24]. We allocated 10% of the training data for validation with a fixed seed. The model is trained using a stochastic gradient descent optimizer with an initial learning rate of 1e-3 and momentum of 0.9, to minimize the loss functions discussed in this study. The best-performing model is selected based on the least loss obtained with the validation data. These models are evaluated with the test set, and the performance is recorded in terms of the following metrics: (a) accuracy; (b) AUROC; (c) area under the precision-recall curve (AUPRC); (d) precision; (e) recall; (f) F-score; and (g) MCC. The top-K (K = 3, 5) models that deliver superior performance with the test set are used to construct the ensembles. We constructed prediction-level and model-level ensembles. At the prediction level, the models’ predictions are combined using various ensemble strategies such as majority voting, simple averaging, weighted averaging, and stacking. In a majority voting ensemble, the most voted predictions are considered final for classifying CXRs to their respective classes. In a simple averaging ensemble, the individual model predictions are averaged to generate the final prediction. For the weighted averaging ensemble, we propose to optimize the weights that minimize the total logarithmic loss so that the predicted labels converge to the target labels. We iteratively minimized the logarithmic loss using the Sequential Least-Squares Programming (SLSQP) algorithm [25]. In a stacking ensemble, the predictions are fed into a meta-learner that consists of a single hidden layer with 9 and 15 neurons respectively, for the top-3 and top-5 performing models. The weights of the top-K models are frozen and only the meta-learner is trained to optimally combine the models’ predictions. A dense layer with three neurons and Softmax activation is appended to output prediction probabilities. Fig 2 shows the classification and ensemble frameworks proposed in this study.

Fig 2

Classification module.

Classification module.

The EfficientNet-B0-based encoder is truncated at the block-5c-add layer and appended with the classification layers to output multi-class prediction probabilities. GAP denotes the global average pooling layer and DCL denotes the deepest convolutional layer in the trained models. The classification model is trained to minimize the various loss functions discussed in this study. The top-K (K = 3, 5) performing models are used to construct prediction-level and model-level ensembles. For the model level ensemble, the top-K models are instantiated with their trained weights and truncated at their deepest convolutional layer. The features from these layers are concatenated and appended with a 1×1 convolutional layer, to reduce feature dimensions. This is followed by appending a GAP layer and a dense layer with three neurons and Softmax activation to classify the CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. The performance of the individual models, prediction-level ensembles, and model-level ensembles are further compared for statistical significance. All the models are trained and evaluated using Tensorflow Keras 2.4 on a Windows system with an Intel Xeon 3.80 GHz CPU, NVIDIA GeForce GTX 1050 Ti GPU, and CUDA dependencies for GPU acceleration. Statistical significance analysis is performed using R software version 4.1.1.

Classification losses

We experimented with the following loss functions to provide a comprehensive evaluation of their impact on the multi-class classification task under study: (i) Categorical cross-entropy (CCE) loss; (ii) Categorical focal loss [8]; (iii) Kullback-Leibler (KL) divergence loss [26]; (iv) Categorical Hinge loss [27]; (v) Label-smoothed CCE loss [28]; (vi) Label-smoothed categorical focal loss [28], and (vii) Calibrated CCE loss [29]. We also propose several loss functions, as follows, that mitigate the issues with the existing loss functions when applied to the multi-class classification task under study: (i) CCE loss with entropy-based regularization; (ii) Calibrated negative entropy loss, (iii) Calibrated KL divergence loss; (iv) Calibrated categorical focal loss, and (v) Calibrated categorical Hinge loss. The details of the proposed loss functions are discussed below.

(i) CCE with entropy-based regularization

DL models demonstrate low entropy values for the output distributions when they are confident about their predictions [29]. However, under class-imbalanced training conditions, the models might be overconfident about the majority class and classify most of the samples as belonging to this dominant class. This may lead to model overfitting and adversely impact generalization performance. Under these circumstances, a penalty could be introduced in the form of a regularization term that penalizes peaked distributions, thereby reducing overfitting and improving generalization. A model produces a conditional distribution p(y|x) through the Softmax function, over a set of classes y given an input x. The entropy of this conditional distribution is given by, Here, H denotes the entropy term. A regularization term is proposed where the negative entropy is added to the negative log-likelihood to penalize over-confident output distributions. It is given by, Here, β controls the intensity of the penalty. Through empirical evaluations, we set the value of β = 2. We used this regularization term in the final dense layer as an activity regularizer and trained the model to minimize the CCE loss.

(ii) Calibrated negative entropy loss

We propose an entropy-based loss function where the negative entropy is added as an auxiliary term to the negative log-likelihood term as shown in Eqs [1] and [2] to penalize over-confident output distributions. A model is said to demonstrate poor calibration if it is overconfident or underconfident about its predictions and would not reflect the true occurrence likelihood of the class events. Motivated by [29], we propose to add a regularization term that computes the difference between the accuracy and the predicted probabilities to the entropy-based loss function. This regularization term helps to penalize the model when the entropy-based loss function reduces without a corresponding change in the accuracy. The regularization term forces the accuracy to match the average predicted probabilities, thereby (i) acting as a smoothing parameter that smoothens overconfident or underconfident predictions and (ii) pushing the model to converge to the ideal condition when the accuracy would reflect the true occurrence likelihood. The calibrated negative entropy loss is given by, Here, β controls the penalty intensity. The auxiliary term difference is calculated for each mini-batch, as given by, Here, denotes the predicted label. The value of c is 1 if ; otherwise, c is 0. This auxiliary term forces the average value of the predicted probabilities to match the accuracy over all training examples. This pushes the model closer to the ideal situation, where the model accuracy would reflect the true occurrence likelihood of the samples. The auxiliary term serves as a smoothing parameter for predictions with extremely low or high prediction confidences. We tested with different weights for β = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 2] and λ = [0.5, 1, 2, 5, 10, 15, 20]. After empirical evaluations, we set the value of β = 0.001 and λ = 10.

(iii) Calibrated KL divergence loss

The KL divergence, also called relative entropy, measures the difference between the observed and actual probability distributions. The KL divergence between two distributions A(x) and B(x) is given by, We propose to benefit from the regularization term mentioned in Eq [4] to smoothen model predictions when trained to minimize the KL divergence loss. We propose the calibrated KL divergence loss where the regularization term in Eq [4] is added to the KL divergence loss. This is done to penalize the model when the KL divergence loss reduces without a corresponding change in the accuracy. The calibrated KL divergence loss is given by, The auxiliary term difference is calculated for each mini-batch and is given by Eq [4]. We tested with different weights for λ = [0.5, 1, 2, 5, 10, 15, 20]. After empirical evaluations, the value of λ is set to 1.

(iv) Calibrated categorical focal loss

The principal limitation of CCE loss is that the loss asserts equal learning from all the classes. This adversely impacts training and classification performance during class-imbalanced training. This holds for medical images, particularly CXRs, where a class imbalance exists between the majority normal class and other minority disease classes. In this regard, the authors of [8] proposed the focal loss for object detection tasks, in which the standard cross-entropy loss function is modified to down weight the majority class so that the model would focus on learning the minority classes. In a multi-class classification setting, the categorical focal loss is given by, Here, K = 3, denotes the number of classes, k = {0, 1, K−1} denotes the class labels for bacterial pneumonia, normal, and viral pneumonia classes respectively, and is a vector representing an estimated probability distribution over the three classes. The value γ denotes the rate at which the easy samples are down-weighted. The categorical focal loss converges to CCE loss at γ = 0. We propose the calibrated categorical focal loss, where the difference between the accuracy and predicted probabilities is added as a regularization term to penalize the model for overconfident and underconfident predictions when trained to minimize the categorical focal loss. The calibrated categorical focal loss is given by, The auxiliary term difference is calculated for each mini-batch and is given by Eq [4]. We tested with different weights for γ = [0.5, 1, 2, 5] and λ = [0.5, 1, 2, 5, 10, 15, 20]. After empirical evaluations, the value of γ and λ is set to 1.

(v) Calibrated categorical Hinge loss

The Hinge loss is widely used in binary classification problems to produce “maximum-margin” classification [27], particularly with SVM classifiers. This loss could be used in a multi-class classification setting and is given by, Here, y and y denote the ground truth one-hot encoded labels and predictions, respectively. We propose the calibrated categorical Hinge loss, where the difference between the accuracy and predicted probabilities is added as an auxiliary term to the categorical Hinge loss. This auxiliary term penalizes the model when the categorical Hinge loss reduces without a corresponding change in the accuracy. The calibrated categorical Hinge loss is given by, The negative and positive terms are given by Eqs [10] and [11]. The auxiliary term difference is calculated for each mini-batch and is given by Eq [4]. We tested with different weights for λ = [0.5, 1, 2, 5, 10, 15, 20]. After empirical evaluations, the value of λ is set to 10.

Results

CXR lung segmentation

Recall that an EfficientNet-B0-based U-Net model is trained to minimize BCE, weighted BCE-Dice, focal, Tversky, and focal Tversky loss functions and predict lung masks for the CXRs in the Montgomery TB CXR collection. The lung masks predicted by the top-3 performing models are bitwise-ANDed to produce the final lung mask. The performance of the individual models and the bitwise ANDed model ensemble is evaluated using segmentation accuracy, IoU, and Dice coefficient as shown in Table 1. We observed that the segmentation model demonstrated higher values for the Dice coefficient compared to the IoU metrics due to the way the two functions are defined. The Dice coefficient value is given by twice the area of the intersection of two masks, divided by the sum of the areas of the masks. It is observed from Table 1 that, considering individual models, the segmentation model trained to minimize the focal Tversky loss demonstrated superior performance in terms of IoU, Dice coefficient, and accuracy metrics, followed by those trained with Tversky and weighted BCE-Dice losses. These top-3 performing models are used to construct the ensemble. Here, the lung masks predicted by the top-3 performing models are bitwise-ANDed to produce the final lung mask. We observed that the IoU, Dice coefficient, and accuracy, achieved using the bitwise-ANDed model ensemble are superior compared to any individual constituent model. However, we observed no statistically significant difference in performance (p > 0.05) between the individual models and the ensemble.

Table 1

Segmentation performance achieved by the individual models and the bitwise-ANDed ensemble of the top-3 performing models.

Loss/Method	Metrics
Loss/Method	IoU	Dice	Accuracy
BCE	0.8186±0.0384	0.9571±0.0361	0.9720±0.0096
Weighted BCE-Dice	0.8465±0.0401	0.9601±0.0396	0.9732±0.0104
Focal	0.2601±0.0621	0.9189±0.0527	0.7788±0.0485
Tversky	0.9360±0.0368	0.9624±0.0225	0.9912±0.0102
Focal Tversky	0.9510±0.0415	0.9637±0.0271	0.9925±0.0130
Ensemble	0.9518±0.0462	0.9652±0.0309	0.9927±0.0117

The bold numerical values denote the best performance in respective columns.

The bold numerical values denote the best performance in respective columns. We used the top-3 performing models and the bitwise-ANDed ensemble approach to predict lung masks for the CXRs in the pediatric pneumonia CXR collection. As the ground truth lung masks for these CXRs are not made available by the authors of [6], the segmentation performance could not be validated. The predicted lung masks are overlaid on the original CXRs to delineate the lung boundaries and are cropped. The cropped images are resized to 512×512 pixel resolution and used for further analysis (i.e., disease classification).

CXR disease classification

Recall that the encoder from the trained EfficientNet-B0-based U-Net model is truncated and appended with classification layers. This approach is followed to perform a CXR modality-specific knowledge transfer [2, 15, 16, 30] to improve performance in a relevant task of classifying the CXRs in the pediatric pneumonia CXR collection into normal, bacterial pneumonia, or viral pneumonia categories. The classification models are trained to minimize the existing and proposed loss functions in this study. Table 2 summarizes the classification performance achieved by these models. We measured the 95% CI as the exact Clopper–Pearson interval for the MCC metric to test for statistical significance. It is observed that the classification models demonstrated higher values for F-score compared to the MCC metric. F-score provides a balanced measure of precision and recall but could provide a biased estimate since it does not consider TN values. MCC considers TPs, TNs, FPs, and FNs in its computation. The score of MCC lies in the range [-1 +1] where +1 demonstrates a perfect model while -1 demonstrates poor performance. The authors of [31] discuss the benefits of using MCC metric over F-score and accuracy in evaluating classification models. It is observed from Table 2 that the model trained to minimize the calibrated CCE loss demonstrated superior values for accuracy (0.9343), AUROC (0.9928), AUPRC (0.9869), precision (0.9345), recall (0.9343), F-score (0.9338), and MCC (0.8996) metrics. The 95% CI for the MCC metric demonstrated a tighter error margin and hence higher precision as compared to other models. The performance achieved with the calibrated CCE loss is significantly superior (p < 0.05) as compared to those achieved by the models that are trained to minimize the categorical focal and calibrated categorical focal loss functions. Fig 3 shows the confusion matrix, AUROC, and AUPRC curves obtained with the calibrated CCE loss-trained model. This performance is followed by the models that are trained to minimize the CCE with entropy-based regularization, calibrated negative entropy, label-smoothed categorical focal, and calibrated categorical Hinge loss functions.

Table 2

Classification performance achieved by the classification models that are trained using the loss functions discussed in this study.

Loss	Metrics
Loss	Accuracy	AUROC	AUPRC	Precision	Recall	F-Score	MCC
CCE	0.9279	0.9921	0.9857	0.9292	0.9279	0.9282	0.8899
CCE	0.9279	0.9921	0.9857	0.9292	0.9279	0.9282	(0.8653, 0.9145)
CCE with entropy-based regularization (β = 2.0)	0.9311	0.9913	0.9844	0.9337	0.9311	0.9319	0.8953
CCE with entropy-based regularization (β = 2.0)	0.9311	0.9913	0.9844	0.9337	0.9311	0.9319	(0.8712, 0.9194)
KL divergence	0.9231	0.99	0.9825	0.9261	0.9231	0.924	0.8831
KL divergence	0.9231	0.99	0.9825	0.9261	0.9231	0.924	(0.8578, 0.9084)
Categorical focal (γ = 1)	0.9054	0.984	0.9753	0.9079	0.9054	0.9054	0.8562
Categorical focal (γ = 1)	0.9054	0.984	0.9753	0.9079	0.9054	0.9054	(0.8286, 0.8838)
Categorical Hinge	0.9247	0.9892	0.9803	0.928	0.9247	0.9255	0.8858
Categorical Hinge	0.9247	0.9892	0.9803	0.928	0.9247	0.9255	(0.8608, 0.9108)
Smoothed-CCE (σ = 0.2)	0.9231	0.9899	0.9821	0.9252	0.9231	0.9237	0.8829
Smoothed-CCE (σ = 0.2)	0.9231	0.9899	0.9821	0.9252	0.9231	0.9237	(0.8576, 0.9082)
Smoothed-focal (σ = 0.2)	0.9279	0.9847	0.9744	0.9317	0.9279	0.9287	0.8909
Smoothed-focal (σ = 0.2)	0.9279	0.9847	0.9744	0.9317	0.9279	0.9287	(0.8664, 0.9154)
Calibrated-CCE (λ = 10)	0.9343	0.9928	0.9869	0.9345	0.9343	0.9338	0.8996
Calibrated-CCE (λ = 10)	0.9343	0.9928	0.9869	0.9345	0.9343	0.9338	(0.876, 0.9132)
Calibrated-KL divergence (λ = 1)	0.9215	0.9895	0.9817	0.9239	0.9215	0.9217	0.8807
Calibrated-KL divergence (λ = 1)	0.9215	0.9895	0.9817	0.9239	0.9215	0.9217	(0.8552, 0.9062)
Calibrated focal (γ = λ = 1)	0.9167	0.986	0.9777	0.9187	0.9167	0.9164	0.8734
Calibrated focal (γ = λ = 1)	0.9167	0.986	0.9777	0.9187	0.9167	0.9164	(0.8473, 0.8995)
Calibrated Hinge (λ = 10)	0.9279	0.9894	0.9803	0.9292	0.9279	0.9275	0.8903
Calibrated Hinge (λ = 10)	0.9279	0.9894	0.9803	0.9292	0.9279	0.9275	(0.8657, 0.9149)
Calibrated negative entropy(β = 1e-3; λ = 10)	0.9311	0.9917	0.9851	0.9316	0.9311	0.9308	0.8947
Calibrated negative entropy(β = 1e-3; λ = 10)	0.9311	0.9917	0.9851	0.9316	0.9311	0.9308	(0.8706, 0.9188)

Fig 3

Confusion matrix, AUROC, and AUPRC curves obtained using the model that is trained to minimize the calibrated CCE loss function.

The top-K (K = 3, 5) models are selected based on the MCC metric. The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric. Bold numerical values denote superior performance in respective columns. The top-3 (i.e., models that are trained to minimize the calibrated CCE, CCE with entropy-based regularization, and calibrated negative entropy losses) and top-5 (i.e., models that are trained to minimize the calibrated CCE, CCE with entropy-based regularization, calibrated negative entropy, label-smoothed categorical focal, and calibrated categorical Hinge losses) are used to construct prediction-level and model-level ensembles. Recall that for the prediction-level ensemble, the models’ predictions are combined using majority voting, simple averaging, weighted averaging, and stacking-based ensemble methods. Table 3 summarizes the classification performance achieved by the prediction-level ensembles.

Table 3

Performance metrics achieved by the prediction-level ensembles using the top-K (K = 3, 5) models.

Models	Method	Metrics
Models	Method	Accuracy	AUROC	AUPRC	Precision	Recall	F-Score	MCC
Top-3	Max voting	0.9295	0.9471	0.9412	0.9305	0.9295	0.9297	0.8923
	Max voting	0.9295	0.9471	0.9412	0.9305	0.9295	0.9297	(0.8679, 0.9167)
	Simple averaging	0.9279	0.9924	0.9863	0.9287	0.9279	0.9281	0.8898
	Simple averaging	0.9279	0.9924	0.9863	0.9287	0.9279	0.9281	(0.8652, 0.9144)
	Weighted averaging	0.9343	0.9925	0.9865	0.9345	0.9343	0.9338	0.8996
	Weighted averaging	0.9343	0.9925	0.9865	0.9345	0.9343	0.9338	(0.876, 0.9232)
	Stacking	0.9263	0.99	0.9831	0.9284	0.9263	0.9269	0.8877
	Stacking	0.9263	0.99	0.9831	0.9284	0.9263	0.9269	(0.8629, 0.9125)
Top-5	Max voting	0.9327	0.9495	0.9439	0.9334	0.9327	0.9327	0.8972
	Max voting	0.9327	0.9495	0.9439	0.9334	0.9327	0.9327	(0.8733, 0.9211)
	Simple averaging	0.9295	0.9923	0.9863	0.9311	0.9295	0.9298	0.8926
	Simple averaging	0.9295	0.9923	0.9863	0.9311	0.9295	0.9298	(0.8683, 0.9169)
	Weighted averaging	0.9359	0.9925	0.9865	0.9375	0.9359	0.9363	0.9024
	Weighted averaging	0.9359	0.9925	0.9865	0.9375	0.9359	0.9363	(0.8791, 0.9157)
	Stacking	0.9279	0.9873	0.9801	0.9303	0.9279	0.9286	0.8903
	Stacking	0.9279	0.9873	0.9801	0.9303	0.9279	0.9286	(0.8657, 0.9149)

The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric. Bold numerical values denote superior performance in respective columns.

The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric. Bold numerical values denote superior performance in respective columns. It is observed from Table 3 that the prediction-level ensembles constructed using the top-3 and top-5 performing models demonstrated higher values for F-score as compared to the MCC metrics for the reasons discussed before. The weighted averaging ensemble of the top-5 performing models using the optimal weights [0.40560531, 0.192276399, 0.00356809023, 0.3985502, 1.10927275e-16] calculated using the SLSQP method achieved superior performance compared to other ensembles. The 95% CI obtained using the MCC metric demonstrated a tighter error margin and hence higher precision compared to other ensemble methods. However, we observed no statistically significant difference (p > 0.05) in performance across the ensemble methods. Fig 4 shows the confusion matrix, AUROC, and AUPRC curves achieved using the top-5 weighted averaging ensemble.

Fig 4

Confusion matrix, AUROC, and AUPRC curves obtained by the weighted averaging ensemble of the top-5 performing models.

Recall that the model-level ensembles are constructed using the top-K (K = 3, 5) models by instantiating them with their trained weights and truncating them at their deepest convolutional layers. The feature maps from these layers are concatenated and appended with a 1×1 convolutional layer for feature dimensionality reduction. In our study, the feature maps of the deepest convolutional layers for the models have [16, 16, 512] dimensions. Hence, after concatenation, the feature maps for the top-3 models are of [16, 16, 1536] dimensions, and that for the top-5 models are of [16, 16, 2560] dimensions. We used 1×1 convolutions to reduce these dimensions to [16, 16, 512]. The 1×1 convolutional layer is appended with a GAP and dense layer with three neurons to classify the CXRs into normal, bacterial pneumonia, or viral pneumonia categories. Table 4 shows the classification performance achieved in this regard. We observed no statistically significant difference (p > 0.05) in performance between the top-3 and top-5 model-level ensembles. We further performed a weighted averaging of the predictions of the top-3 and top-5 model-level ensembles. We calculated the optimal weights [0.3764, 0.6236] using the SLSQP method to improve performance. Fig 5 shows the confusion matrix, AUROC, and AUPRC curves obtained by the weighted averaging ensemble using the predictions of the top-3 and top-5 model-level ensembles. We observed that this ensemble approach demonstrated superior performance for all metrics compared to the individual models and all ensemble methods discussed in this study.

Table 4

Classification performance achieved by model-level ensembles.

Method	Metrics
Method	Accuracy	AUROC	AUPRC	Precision	Recall	F-Score	MCC
Top-3	0.9327	0.9933	0.9881	0.9334	0.9327	0.933	0.897
Top-3	0.9327	0.9933	0.9881	0.9334	0.9327	0.933	(0.8731, 0.9209)
Top-5	0.9359	0.9928	0.9872	0.9365	0.9359	0.936	0.9019
Top-5	0.9359	0.9928	0.9872	0.9365	0.9359	0.936	(0.8785, 0.9253)
Weighted averaging	0.9391	0.9933	0.9881	0.9396	0.9391	0.9392	0.9068
Weighted averaging	0.9391	0.9933	0.9881	0.9396	0.9391	0.9392	(0.8839, 0.9297)

The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric.

Fig 5

Confusion matrix, AUROC, and AUPRC curves obtained through the weighted averaging ensemble of the predictions of top-3 and top-5 model level ensembles.

The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric. Table 5 shows a comparison of the performance achieved with (i) the weighted averaging ensemble of top-3 and top-5 model-level predictions and (ii) SOTA literature.

Table 5

Comparison of the proposed approach with the SOTA literature.

Study	Metrics
Study	Acc.	AUROC	AUPRC	Prec.	Rec.	F	MCC
Kermany et al. [6]	NA	NA	NA	NA	NA	NA	NA
Rajaraman et al. [24]	0.918	0.939	NA	0.92	0.9	0.91	0.87
Rajaraman et al. [24]	0.918	0.939	NA	0.92	0.9	0.91	(0.8436, 0.8964)
Proposed	0.9391	0.9933	0.9881	0.9396	0.9391	0.9392	0.9068
Proposed	0.9391	0.9933	0.9881	0.9396	0.9391	0.9392	(0.8839, 0.9297)

The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric.

The values in parentheses denote the 95% CI measured as the exact Clopper–Pearson interval for the MCC metric. The authors of [6] that released the pediatric pneumonia CXR dataset performed binary classification to classify the CXRs as showing normal lungs or other abnormal manifestations. To the best of our knowledge, only the authors of [24] performed a multi-class classification using the train and test splits released by the authors of [6]. We observed that the MCC metric achieved by the weighted averaging ensemble of top-3 and top-5 model-level predictions is significantly superior (p < 0.05) compared to the MCC metric reported in the literature [24].

Disease ROI localization

We used Grad-CAM tools [32] for localizing the disease-manifested ROIs to ensure that the models learned meaningful features. Fig 6 shows instances of pediatric CXRs showing expert ground truth annotations for bacterial and viral pneumonia manifestations and Grad-CAM localizations of the top-5 performing models and the top-5 model-level ensemble. It is observed from Fig 6 that the classification models trained using the existing and proposed loss functions and the top-5 model-level ensemble highlighted the ROIs showing disease manifestations. The highest activations, observed as the hottest region in the heatmap, contribute the majority toward the models’ decision toward classifying the CXRs into their respective categories.

Fig 6

Grad-CAM-based localization of the disease ROIs.

(a) and (h) denote instances of CXR with expert annotations showing bacterial and viral pneumonia manifestations, respectively. The sub-parts (b), (c), (d), (e), (f), and (g) show Grad-CAM-based ROI localization achieved using the models trained with calibrated CCE, CCE with entropy-based regularization, calibrated negative entropy, label-smoothed categorical focal, calibrated categorical Hinge loss functions, and the top-5 model-level ensemble, respectively, highlighting regions of bacterial pneumonia manifestations. The sub-parts (i), (j), (k), (l), (m), and (n) show the localization achieved using the models in the same order as above, highlighting viral pneumonia manifestations.

Grad-CAM-based localization of the disease ROIs.

Discussion and conclusions

While several studies [33, 34] report using the pediatric pneumonia CXR dataset [6] in a binary classification setting, only the authors of [24] trained models for a multi-class classification task. Further, studies in [33, 34] used ImageNet-pretrained models to transfer knowledge to a target CXR classification task as opposed to a CXR modality-specific pretrained model. Such transfer of knowledge may not be relevant since the characteristics of natural images are distinct from medical images. In this work, we propose to resolve the aforementioned issues by transferring knowledge from a CXR modality-specific pretrained model to improve performance in a relevant CXR classification task. We trained the models using existing loss functions and also proposed several loss functions. Our experimental results showed that the model trained to minimize the calibrated CCE loss demonstrated superior values for all metrics. This performance is followed by those that are trained to minimize the proposed losses such as CCE with entropy-based regularization, calibrated negative entropy, label-smoothed categorical focal, and calibrated categorical Hinge loss. We evaluated the performance of both prediction-level and model-level ensembles. We observed from the experiments that the model-level ensembles demonstrated markedly improved performance than the prediction-level ensembles. We further improved performance by (i) deriving optimal weights using the SLSQP method, and (ii) using the derived weights to perform weighted averaging of the predictions of top-3 and top-5 model-level ensembles. We observed that the weighted averaging ensemble demonstrated superior performance for all metrics compared to other individual models, their ensemble, and the SOTA literature. Finally, we used Grad-CAM-based visualization tools to interpret the learned weights in the individual models and model-level ensembles. We observed that these models precisely localized the ROIs showing disease manifestations, confirming the expert’s knowledge of the problem. Our study combined the benefits of (i) performing CXR modality-specific knowledge transfer, (ii) proposing loss functions that delivered superior classification performance in a multi-class classification setting, (iii) constructing prediction-level and model-level ensembles to achieve SOTA performance as shown in Table 5. However, there are a few limitations to this study. For example, novel loss functions could be proposed for classification tasks to train models and their ensembles. Other ensemble methods such as blending and snapshot ensembles could also be attempted to improve performance. It is becoming increasingly viable to deploy ensemble models in real-time for image and video analysis with the advent of low-cost computation, storage solutions, and cloud technology [35]. The methods proposed in this study could be extended to the classification and detection of cardiopulmonary abnormalities [36] including COVID-19, TB, cardiomegaly, and lung nodules, among others. 8 Nov 2021 PONE-D-21-32872Deep model ensembles with novel loss functions for multi-class medical image classificationPLOS ONE Dear Dr. Rajaraman, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ACADEMIC EDITOR: Based on the comments from the reviewers and my own observation I recommend major revisions. Please submit your revised manuscript by Dec 23 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Thippa Reddy Gadekallu Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Thank you for stating the following financial disclosure: "This study is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH). The intramural research scientists (authors) at the NIH dictated study design, data collection, data analysis, decision to publish and preparation of the manuscript." We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please do the following: a. Review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. These amendments should be made in the online form. b. Confirm in your cover letter that you agree with the following statement, and we will change the online submission form on your behalf: “The funder provided support in the form of salaries for authors SR, GZ and SA, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section. 4. Thank you for stating the following in the Acknowledgments Section of your manuscript: "This study is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH)." We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "This study is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH). The intramural research scientists (authors) at the NIH dictated study design, data collection, data analysis, decision to publish and preparation of the manuscript." Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments: The authors are suggested to address all the comments carefully. The authors can cite the references suggested by the reviewers only if they are relevant and strengthen the references section. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: No Reviewer #3: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: I Don't Know Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The proposed work presents a deep model ensembling approach with loss functions for multiclass image classification. There have been numerous research works is already conducted in this domain. However, the approach brings some interesting discussions about the application of loss functions in medical images. However, the following revisions are required. • The title of the paper needs revision. The length of the paper is very long, and I recommend authors to focus on the essential parts and discard the basic stuff such as what is ensemble deep learning, what is statistical analysis, etc. • Secondly, the manuscript should be focused on the loss functions for multiclass image classification. But the paper discusses so many other things, such as disease classification details. If authors want to keep these contents, then organize the manuscript so that it is easy for readers to read. I recommend, authors to revise the entire manuscript with the focus on “with loss functions for multiclass image classification.” • The literature review carried out for the proposed work is not up to date. The proposed research demands the referral of some of the latest research works published recently, such as “ReCognizing SUspect and PredictiNg ThE SpRead of Contagion Based on Mobile Phone LoCation DaTa (COUNTERACT): A System of identifying COVID-19 infectious and hazardous sites, detecting disease outbreaks based on the internet of things, edge computing, and artificial intelligence”, “Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences Reviewer #2: The work lacks novelty. Literature review is poor. Proposed work should be described clearly with clear diagram/algorithm along with discussion.Introduction section and conclusion need to be revised. I strongly suggest the authors to format the content and structure of the paper before submission. This article does not look like a research paper, just like a manual. Simultaneously, the figures included in this paper are not obvious and casual. The authors should refer to some related papers in some venue for revision. Reviewer #3: Abstract should be concise yet. But should give complete overview of the work and study. Abstract should reflect the background knowledge on the problem addressed need to be added. Abstract should reflect the wide range of applications and its possible solutions need to be added. In Introduction section, the drawbacks of each conventional technique should be described clearly. Introduction section can be extended to add the issues in the context of the existing work What is the motivation of the proposed work? Literature review techniques have to be strengthened by including the issues in the current system and how the author proposes to overcome the same Research gaps, objectives of the proposed work should be clearly justified. The writing of the paper needs a lot of improvement in terms of grammar, spellings, and presentations. The paper needs careful English polishing since there are many typos and poorly written sentences. Authors can use latest related works from reputed journals like IEEE/ACM Transactions, MDPI, Elsevier, Inderscience, Springer, Taylor & Francis etc. and write the references in proper format, from year 2020-21. The authors seem to disregard or neglect some important finding in results that have been achieved in paper. So, elaborate and explain the results in more details. Improve the results and discussion section in paragraph. The conclusion should state scope for future work. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 19 Nov 2021 Response to the Editor: We render our sincere thanks to the Editor for arranging peer review and encouraging resubmission of our manuscript. To the best of our knowledge and belief, we have addressed the concerns of the Editor and the reviewers in the revised manuscript. Q1: Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. Author response: We have formatted the manuscript per the templates recommended by the Editor. Q2: We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. Thank you for stating the following financial disclosure: "This study is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH). The intramural research scientists (authors) at the NIH dictated study design, data collection, data analysis, decision to publish, and preparation of the manuscript." We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis, or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. If the funding organization did not play a role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please do the following: a. Review your statements relating to the author contributions and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. These amendments should be made in the online form. b. Confirm in your cover letter that you agree with the following statement, and we will change the online submission form on your behalf: “The funder provided support in the form of salaries for authors SR, GZ, and SA, but did not have any additional role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section. Author response: All authors of this manuscript are employed by the National Library of Medicine. Our research is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH). We do not have a specific grant number. All authors reviewed the contributions listed in the manuscript. We hereby agree to include the following statements under the “Funding Information and Financial Disclosure” sections in the online submission form. “This study is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH). The funder provided support in the form of salaries for authors SR, GZ, and SA, but did not have any additional role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” Q3: Thank you for stating the following in the Acknowledgments Section of your manuscript: "This study is supported by the Intramural Research Program (IRP) of the National Library of Medicine (NLM) and the National Institutes of Health (NIH).” We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement Author response: We have removed the Acknowledgment section (and included text) per the Editor’s recommendation. Response to Reviewer #1: We render our sincere thanks to the reviewer for the valuable comments and appreciation of our study. To the best of our knowledge and belief, we have addressed the reviewer’s concerns. Q1: Is the manuscript technically sound, and do the data support the conclusions? Yes; Has the statistical analysis been performed appropriately and rigorously? No; Have the authors made all data underlying the findings in their manuscript fully available? No; Is the manuscript presented in an intelligible fashion and written in standard English? Yes. Author response: We wish to reiterate, as indicated in the manuscript, that the data used in this study is publicly available without restriction. The details of the data and their availability are discussed under the Materials and methods section. We compared models’ performance and reported statistical significance in the results. We computed the binomial confidence intervals as the exact Clopper–Pearson interval for the MCC metric to analyze statistical significance. These results are comprehensively discussed in the revised manuscript. Q2: The proposed work presents a deep model ensembling approach with loss functions for multiclass image classification. There have been numerous research works is already conducted in this domain. However, the approach brings some interesting discussions about the application of loss functions in medical images. Author response: We sincerely thank the reviewer for the words of appreciation. The study aims to compare multi-class classification performance using the models trained on existing and novel loss functions proposed in this study. We propose several loss functions including cross-entropy loss, negative entropy loss, KL divergence loss, categorical focal loss, and categorical hinge loss, each added with a calibration component (to penalize overconfident/underconfident predictions) and a cross-entropy loss with entropy-based regularization. We demonstrate that, compared to using the de-facto cross-entropy loss function, the proposed loss functions demonstrated superior performance toward this classification task. We further improved performance by constructing prediction- and model-level ensembles. In the process, we obtained state-of-the-art performance in classifying the pediatric CXR dataset into normal, bacterial pneumonia, and viral pneumonia classes. Q3: However, the following revisions are required. The title of the paper needs revision. Author response: Agreed. The title is modified as “Novel loss functions for ensemble-based medical image classification” to make it simpler and convey clarity. Q4: The length of the paper is very long, and I recommend authors to focus on the essential parts and discard the basic stuff such as what is ensemble deep learning, what is statistical analysis, etc. Author response: Thanks for these insightful comments. Indeed, addressing this comment has helped improve readability in the revised manuscript. The following changes are made to the revised manuscript: (i) Discussions regarding CXR modality-specific knowledge transfer, deep ensemble learning, and statistical analysis are removed from the introduction section for redundancy. (ii) Discussions regarding the existing segmentation losses, segmentation evaluation metrics, and existing classification loss functions are removed but adequate references are provided. (iii) The polar plots are removed for redundancy. Q5: Secondly, the manuscript should be focused on the loss functions for multiclass image classification. But the paper discusses so many other things, such as disease classification details. If authors want to keep these contents, then organize the manuscript so that it is easy for readers to read. I recommend, authors to revise the entire manuscript with the focus on “with loss functions for multiclass image classification.” Author response: We sincerely thank the reviewer for these valuable comments. Addressing Q4 has helped remove redundant information and improve readability. However, this systematic study includes several steps toward attaining state-of-the-art performance in classifying the pediatric CXR data which provides an objective way of evaluating the benefits of this method. These steps include (i) performing lung segmentation to prevent learning irrelevant features in the background, (ii) training and evaluating models with the existing and proposed loss functions, (iii) improving performance by constructing prediction- and model-level ensembles, and (iv) using visualization tools to interpret the learned behavior of the models and the ensemble. We illustrate these steps in Fig. 1 and Fig. 2. We observed that the weighted averaging of the predictions of the top-3 and top-5 model-level ensembles obtained state-of-the-art performance using the pediatric CXR data. Q6: The literature review carried out for the proposed work is not up to date. The proposed research demands the referral of some of the latest research works published recently, such as “ReCognizing SUspect and PredictiNg ThE SpRead of Contagion Based on Mobile Phone LoCation DaTa (COUNTERACT): A System of identifying COVID-19 infectious and hazardous sites, detecting disease outbreaks based on the internet of things, edge computing, and artificial intelligence”, “Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences Author response: Thanks. We have cited the COVID-19 study per the reviewer’s suggestions. [35] Ghayvat H, Awais M, Gope P, Pandya S, Majumdar S. 2021. Recognizing suspect and predicting the spread of contagion based on mobile phone location data (counteract): a system of identifying covid-19 infectious and hazardous sites, detecting disease outbreaks based on the internet of things, edge computing, and artificial intelligence. Sustainable Cities and Society 69(12):102798 Response to Reviewer #2: We thank the reviewer for the valuable comments on this study. Q1: The work lacks novelty. Literature review is poor. Proposed work should be described clearly with clear diagram/algorithm along with discussion. Introduction section and conclusion need to be revised. Author response: The principal limitation of the de-facto cross-entropy loss is that it asserts equal learning from all the classes. This adversely impacts training and classification performance during class-imbalanced training. This holds for medical images, particularly CXRs, where a class imbalance exists between the majority normal class and other minority disease classes. Although the choice of the loss function impacts model performance, to the best of our knowledge, we observed that no literature exists that performs a comprehensive analysis and selection of an appropriate loss function toward the classification task under study. The contribution of this study includes a comprehensive statistical evaluation of several existing and proposed loss functions toward a medical image classification task. This guides the researchers regarding making an appropriate selection of a loss function for the task under study. The proposed loss functions could be applied for binary, multi-class, and multi-label classification tasks. We further improve classification performance by constructing an ensemble of models trained with the existing and proposed loss functions. In the process, we observed that the ensemble delivered superior performance compared to the individual models. We made sure to include relevant references from the current year. The references are formatted per PLOS ONE requirements. The citations include those published in reputed journals like IEEE, Elsevier, Springer, and MDPI. The proposed work is briefly discussed in the introduction in lines 80 – 96. Fig. 1 and Fig. 2 illustrate the steps involved in this systematic study. The introduction section has been revised to remove contents for redundancy. The conclusion discusses the benefits and limitations of the current study and the scope for future study. Q2: I strongly suggest the authors to format the content and structure of the paper before submission. This article does not look like a research paper, just like a manual. Simultaneously, the figures included in this paper are not obvious and casual. The authors should refer to some related papers in some venue for revision. Author response: Thanks for these insightful comments. We have made the following changes to the manuscript to improve readability. (i) Discussions regarding CXR modality-specific knowledge transfer, deep ensemble learning, and statistical analysis are removed from the introduction section for redundancy; (ii) Discussions regarding the existing segmentation losses, segmentation evaluation metrics, and existing classification loss functions are removed but adequate references are provided; (iii) The polar plots are removed for redundancy; (iv) We made sure to include relevant references from the current year (2021). The references are formatted per PLOS ONE requirements. The citations include those published in reputed journals from publishers such as IEEE, Elsevier, Springer, and MDPI. (v) The figures (Fig. 1 and Fig. 2) illustrate the steps involved in this systematic study. Fig. 3, Fig. 4, and Fig. 5 illustrate the performances (in terms of AUROC, AUPRC, and confusion matrix) obtained by the individual models and the prediction- and model-level ensembles. Fig. 6 illustrates Grad-CAM-based localization of the disease ROIs achieved using the trained models and the ensemble. This provides a qualitative analysis of the learned behavior by the individual models and the ensemble. Response to Reviewer #3: We thank the reviewer for the appreciative and constructive comments on this study. Q1: Abstract should be concise yet. But should give complete overview of the work and study. Abstract should reflect the background knowledge on the problem addressed need to be added. Abstract should reflect the wide range of applications and its possible solutions need to be added. Author response: Thanks for these insightful comments. We confirmed that the abstract does not exceed the 300 words count as recommended in the PLOS ONE submission guidelines. We modified the abstract to include the background knowledge about the problem and the proposed solution. The revised abstract is given below. Note that while we provide the link for our code, we will open the site only after the manuscript is published. Medical images commonly exhibit multiple abnormalities. Predicting them requires multi-class classifiers whose training and desired reliable performance can be affected by a combination of factors, such as, dataset size, data source, distribution, and the loss function used to train deep neural networks. Currently, the cross-entropy loss remains the de-facto loss function for training deep learning classifiers. This loss function, however, asserts equal learning from all classes, leading to a bias toward the majority class. Although the choice of the loss function impacts model performance, to the best of our knowledge, we observed that no literature exists that performs a comprehensive analysis and selection of an appropriate loss function toward the classification task under study. In this work, we benchmark various state-of-the-art loss functions, critically analyze model performance, and propose improved loss functions for a multi-class classification task. We select a pediatric chest X-ray (CXR) dataset that includes images with no abnormality (normal), and those exhibiting manifestations consistent with bacterial and viral pneumonia. We construct prediction-level and model-level ensembles to improve classification performance. Our results show that compared to the individual models and the state-of-the-art literature, the weighted averaging of the predictions for top-3 and top-5 model-level ensembles delivered significantly superior classification performance (p < 0.05) in terms of MCC (0.9068, 95% confidence interval (0.8839, 0.9297)) metric. Finally, we performed localization studies to interpret model behavior and confirm that the individual models and ensembles learned task-specific features and highlighted disease-specific regions of interest. The code is available at https://github.com/sivaramakrishnan-rajaraman/multiloss_ensemble_models. Q2: In Introduction section, the drawbacks of each conventional technique should be described clearly. Introduction section can be extended to add the issues in the context of the existing work Author response: Thanks for these comments. The drawbacks of using the de-facto cross-entropy loss function for model training and the need to propose novel loss functions are described in lines 49 – 68. The need for ensemble learning applied is discussed in lines 69 – 79. A brief overview of the proposed methodology is mentioned in lines 80 – 96 in the revised manuscript. The merits, limitations, and scope for future work are discussed in lines 429 – 459. Q3: What is the motivation of the proposed work? Author response: The principal limitation of the de-facto cross-entropy loss is that it asserts equal learning from all the classes. This adversely impacts training and classification performance during class-imbalanced training. This holds for medical images, particularly CXRs, where a class imbalance exists between the majority normal class and other minority disease classes. Although the choice of the loss function impacts model performance, to the best of our knowledge, we observed that no literature exists that performs a comprehensive analysis and selection of an appropriate loss function toward the classification task under study. The contribution of this study includes a comprehensive statistical evaluation of several existing and proposed loss functions toward a medical image classification task. We further improve performance by constructing an ensemble of models trained with diverse loss functions. We observed that, unlike individual models, the weighted averaging of the predictions of top-3 and top-5 model-level ensembles delivered superior performance toward this task. This underscores that an ensemble of models trained with diverse loss functions improves performance compared to using individual models. We demonstrated these results with statistical significance analysis. Q4: Literature review techniques have to be strengthened by including the issues in the current system and how the author proposes to overcome the same. Research gaps, objectives of the proposed work should be clearly justified. Author response: The issues with training the models with the de-facto cross-entropy loss function are discussed in lines 49 – 68. Considering class-imbalanced classification tasks that are common in medical images, using the cross-entropy loss and asserting equal learning to all classes would lead to a biased estimate of the performance. To overcome these limitations, the authors of [8] proposed the focal loss function that down weights the majority class and improves the learning of the minority class. Aside from the literature discussed in lines 49 – 68, the literature does not include a comprehensive study that investigates the effects of loss functions on medical image classification, particularly using CXRs. This study aims to provide a comprehensive analysis of using the existing and proposed loss functions to improve performance in a multi-class CXR classification task. We further improved performance through constructing ensembles of models trained with various loss functions. This systematic procedure is discussed in lines 80 – 96 in the revised manuscript. We observed that the models trained with the proposed loss functions delivered superior classification performance compared to the model trained on the de-facto cross-entropy loss function. The ensemble of the models trained with diverse loss functions achieved state-of-the-art performance using the pediatric CXR data used in this study. Q5: The writing of the paper needs a lot of improvement in terms of grammar, spellings, and presentations. The paper needs careful English polishing since there are many typos and poorly written sentences. Author response: Thanks for these comments. We made sure to rectify the typos and grammatical errors and the revised manuscript has been proofread by a native English speaker. Q6: Authors can use latest related works from reputed journals like IEEE/ACM Transactions, MDPI, Elsevier, Inderscience, Springer, Taylor & Francis etc. and write the references in proper format, from year 2020-21. Author response: Thanks for these insightful comments. The revised manuscript includes several citations from the current year 2021. The references are formatted per PLOS ONE requirements. The citations include those published in reputed journals like IEEE, Elsevier, Springer, and MDPI. Q7: The authors seem to disregard or neglect some important finding in results that have been achieved in paper. So, elaborate and explain the results in more details. Improve the results and discussion section in paragraph. Author response: Thanks for these valuable comments. We made sure to discuss the results obtained in every step of this systematic study, with statistical significance analysis. We also performed qualitative analyses to interpret the learned behavior of the trained models and the ensemble. Q8: The conclusion should state scope for future work. Author response: Thanks. We have discussed the scoped for future work in lines 453 – 459 in the revised manuscript. Submitted filename: Response to Reviewers.docx Click here for additional data file. 1 Dec 2021 Novel loss functions for ensemble-based medical image classification PONE-D-21-32872R1 Dear Dr. Rajaraman, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Thippa Reddy Gadekallu Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Authors have addressed all the concerns. The research work should be shared with the science community. Reviewer #3: All the comments made by the reviewers are addressed well by the authors. No further comments are required. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Sharnil Pandya Reviewer #3: No 21 Dec 2021 PONE-D-21-32872R1 Novel loss functions for ensemble-based medical image classification Dear Dr. Rajaraman: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Thippa Reddy Gadekallu Academic Editor PLOS ONE

22 in total

1. Multiclass Support Matrix Machines by Maximizing the Inter-Class Margin for Single Trial EEG Classification.

Authors: Imran Razzak; Michael Blumenstein; Guandong Xu
Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2019-04-25 Impact factor: 3.802

2. Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks.

Authors: Paras Lakhani; Baskaran Sundaram
Journal: Radiology Date: 2017-04-24 Impact factor: 11.105

3. Delving Deep into Label Smoothing.

Authors: Chang-Bin Zhang; Peng-Tao Jiang; Qibin Hou; Yunchao Wei; Qi Han; Zhen Li; Ming-Ming Cheng
Journal: IEEE Trans Image Process Date: 2021-06-24 Impact factor: 10.856

4. CSOLNP: Numerical Optimization Engine for Solving Non-linearly Constrained Problems.

Authors: Mahsa Zahery; Hermine H Maes; Michael C Neale
Journal: Twin Res Hum Genet Date: 2017-05-24 Impact factor: 1.587

5. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases.

Authors: Stefan Jaeger; Sema Candemir; Sameer Antani; Yì-Xiáng J Wáng; Pu-Xuan Lu; George Thoma
Journal: Quant Imaging Med Surg Date: 2014-12

6. Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning.

Authors: Daniel S Kermany; Michael Goldbaum; Wenjia Cai; Carolina C S Valentim; Huiying Liang; Sally L Baxter; Alex McKeown; Ge Yang; Xiaokang Wu; Fangbing Yan; Justin Dong; Made K Prasadha; Jacqueline Pei; Magdalene Y L Ting; Jie Zhu; Christina Li; Sierra Hewett; Jason Dong; Ian Ziyar; Alexander Shi; Runze Zhang; Lianghong Zheng; Rui Hou; William Shi; Xin Fu; Yaou Duan; Viet A N Huu; Cindy Wen; Edward D Zhang; Charlotte L Zhang; Oulan Li; Xiaobo Wang; Michael A Singer; Xiaodong Sun; Jie Xu; Ali Tafreshi; M Anthony Lewis; Huimin Xia; Kang Zhang
Journal: Cell Date: 2018-02-22 Impact factor: 41.582

7. Analyzing inter-reader variability affecting deep ensemble learning for COVID-19 detection in chest radiographs.

Authors: Sivaramakrishnan Rajaraman; Sudhir Sornapudi; Philip O Alderson; Les R Folio; Sameer K Antani
Journal: PLoS One Date: 2020-11-12 Impact factor: 3.240

8. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists.

Authors: Pranav Rajpurkar; Jeremy Irvin; Robyn L Ball; Kaylie Zhu; Brandon Yang; Hershel Mehta; Tony Duan; Daisy Ding; Aarti Bagul; Curtis P Langlotz; Bhavik N Patel; Kristen W Yeom; Katie Shpanskaya; Francis G Blankenberg; Jayne Seekins; Timothy J Amrhein; David A Mong; Safwan S Halabi; Evan J Zucker; Andrew Y Ng; Matthew P Lungren
Journal: PLoS Med Date: 2018-11-20 Impact factor: 11.069

9. Detection and visualization of abnormality in chest radiographs using modality-specific convolutional neural network ensembles.

Authors: Sivaramakrishnan Rajaraman; Incheol Kim; Sameer K Antani
Journal: PeerJ Date: 2020-03-17 Impact factor: 2.984

1 in total

1. Real-time echocardiography image analysis and quantification of cardiac indices.

Authors: Ghada Zamzmi; Sivaramakrishnan Rajaraman; Li-Yueh Hsu; Vandana Sachdev; Sameer Antani
Journal: Med Image Anal Date: 2022-06-09 Impact factor: 13.828

1 in total