Osteoarthritis (OA) is a common degenerative joint inflammation that may lead to disability in severe cases [1]. Although OA is not lethal, it profoundly affects mobility [2] and the patient’s quality of life. The incessant breaking down of cartilage and continuous bone deformation are the main causes that lead to joints failure. Patients of severe OA (end-stage) will experience excruciating pain as the joint cartilages degenerate and cause bone-to-bone frictions during movements. Arthroplasty or total knee replacement is the last option available for knee OA patients to regain their mobility. However, this clinical procedure is invasive and costly. Therefore, diagnosing OA at an early stage is crucial for clinical intervention in halting disease progression and mitigating disability in later stages.Magnetic Resonance Imaging (MRI) is the safest and non-radioactive imaging technique to visualize the knee joint’s internal derangement, especially in determining OA features of the asymptomatic uninjured knee [3]. The main advantage of MRI as compared with traditional radiography is its capability to evaluate the structural changes during disease progressions [4] and provide biomarkers for early OA diagnosis [5]. Degeneration of cartilage tissues is one of the main criteria for an early stage of OA as defined by Luyten et al. [6]. Thus, delineating cartilage in biomedical images is crucial because early detection of cartilage defects allows for early medical interventions and leads to better treatments [7-9].In clinical practices, cartilage delineation is manually performed by a radiologist [2]. Manual delineation is not only a time-consuming [2, 10] task but is also prone to inter- and intra-observer variability [11, 12]. In recent years, deep convolutional neural networks (CNNs) demonstrated state-of-the-art performance in biomedical image analysis, such as breast cancer analysis [13], bone disease prediction [14], and age assessment [15]. Unlike most conventional machine learning techniques such as fuzzy logic [16], bi-histogram equalization [17], and image registrations [18], CNN requires no feature engineering but demands a handful of dataset annotation and computation power. Fortunately, the computation requirement is no longer a challenge to CNN training with the advancement of graphic cards and cloud computing.Encoder-decoder pair is the main core component in most of the existing segmentation neural networks. The encoder harvests data into features, whereas the decoder decodes the features to perform pixel-based classification; the encoder is discriminative, whereas the decoder is generative. These encoder-decoder-based CNNs (EDCNNs) reported remarkable achievements in natural scene images. The current study aims to examine the performances of various EDCNNs in delineating cartilage tissues from MR images.The contributions of this study are listed as follows. First, we propose to group the various EDCNNs into different variations or families. To the best of our knowledge, our study is pioneering as it provides a genealogical chart of EDCNNs. Second, we perform a benchmarking process and identify the best EDCNN in delineating knee cartilage tissue within MR images. This paper is organized as follows. Section 2 briefly explains about U-Net, the base version of ECDNN, and its variations. We grouped different ECDNNs into families according to their unique characteristics and natures. Section 3 illustrates the methodology, which includes the datasets, data pre-processing techniques applied, specifications of model training, and model assessment strategy in Section 3. Section 4 evaluates the performance of EDCNNs by reporting the comparative results. Section 5 summarizes the conclusion and future works.
BASE AND VARIATIONS OF EDCNNS
The general architecture of an EDCNN has two paths: a contracting path for context capturing and an expanding path to localize features precisely. U-Net [19] is the first neural network to employ the encoder-decoder pairing scheme into the network design for the segmentation task, making it the first EDCNN. This architecture was inspired by the Fully Convolutional Network [20], a CNN that can perform pixel-wise classification. The encoding path of U-Net is built with repeating blocks containing 3x3 convolutional layers, a rectified linear unit (ReLU) [21], and a 2x2 max-pooling layers with the stride of 2. For each successive block, the feature map resolutions are reduced by half, whereas the feature channels are doubled. By contrast, the decoding path of U-Net contains Up-convolution blocks to up-sample the feature maps while reducing the feature channels by half. Feature maps from the encoding path are concatenated to a decoding path after each respective down- and up-sampling process. These unique and symmetric paths yield a u-shaped architecture.
Variations of EDCNNs
In this study, we refer to U-Net as the “Base” for EDCNNs while grouping its expansions into four different variations: “Skip-Connections,” “Weight-Initialized,” “Auxiliary-Based,” and “Cascaded” as shown in (Fig. ).Base. The original U-Net has a huge drawback: the output resolution from the final layer is not the same as the input image. The feature maps were cropped from each level of the contracting path as the border pixels were lost during each convolution. To overcome this problem, we padded the feature maps in each of the convolution layers, ensuring that the output dimension is equivalent to the input size to produce a network known as UNetVanilla.Skip-Connections. The degree of connectivity within a neural network determines the information flow from one layer to another. DenseNet [22] exploits the effects of shortcut connections by directly connecting all layers with one another and performing iterative concatenation of feature maps. The improvement in connectivity helps this network converge faster. Although DenseNet was created for the classification task, a segmentation version, namely, FC- DenseNet [23], was carefully extended. FC-DenseNet inherits the following advantages of DenseNet: parameter efficiency, implicit deep supervision, and feature reuse. FC- DenseNet mitigates a large number of parameters by only up-sampling the feature maps created at the previous dense blocks. FC-DenseNet has three variations: FC-DenseNet56, FC-DenseNet67, and FC-DenseNet103 with 56, 67, and 103 layers, respectively. Unlike FC-DenseNet, LinkNet [24] provides a different type of linkage between encoder and decoder, and the input of the encoder layer is bypassed to the corresponding decoder’s output. This approach aims to recover the lost spatial information that can be utilized by the decoder and its up-sample operations. Moreover, the decoder uses fewer parameters as the decoders share knowledge learnt by the encoder at every layer.Weight-Initialized. Neural networks are normally trained from scratch, and their weights are initialized randomly. A wrong initialization will lead to exploding or vanishing weights and gradients. Studies showed that deep neural networks could converge much earlier and prevent the aforementioned scenarios with a proper initialization strategy [25, 26]. These strategies initialize weights according to a specific distribution with a formulated pair of mean and standard deviation. Apart from the manual initialization, we can replace the encoder path with sequential convolution and ReLU layers from a pre-trained CNN. For example, TernausNet [27] and AlbuNet [28] are using pre-trained VGG [29] and ResNet-34 [30] as encoder in the contracting path.Auxiliary-Based. Apart from introducing a new weight initialization strategy and skip-connections scheme, existing studies explore the potential of equipping EDCNNs with auxiliary elements such as Attention Gates (AGs) and recurrent residual modules. AG is commonly applied in image captioning [31], machine translation [32, 33], and classification tasks [34, 35]. With the help of self-attentiongating modules, AttentionUNet [36] shows that a network learns to focus on salient image regions and suppresses feature activation in irrelevant regions without introducing a substantial computational overhead. By contrast, RecurrentUNet [37] is using the recurrent residual module to accumulate features at different time-steps. This process allows the production of a relatively strong representation of features by extracting essential low-level features. RecurrentAttentionUNet [38] is introduced by combining both AGs and recurrent residual module into U-Net. This network takes advantages of three different cores: using U-Net to capture information at multiple scales while integrating low- and high-level features; stacking residual blocks to allow a network to go deeper; implementing attention modules to change the attention-aware features adaptively.Cascaded. Conventional EDCNNs come with a single pair of encoder-decoder until the birth of LadderNet [39], an ensemble structure of multiple U-Nets. LadderNet concatenates encoder-decoder pairs, introducing additional paths for information flow and improving the capability of an EDCNN to capture complex features. A weight-sharing strategy was applied to the residual blocks to constrain the increase in trainable parameters due to the chaining of encoder-decoder pairs.
EXPERIMENTAL
Comparative Study
All the EDCNNs were trained using the Osteoarthritis Initiative (OAI) datasets, a longitudinal study of knee OA. This dataset contains 4,796 participants, with X-rays and Magnetic Resonance (MR) images of participants’ knees. Although the size of this dataset is enormous, we arbitrarily chose 100 sets of Double Echo Steady State (DESS) MR images and subsequently annotated both femoral and tibial knee cartilages. Twenty sets of MR images were held out as a control set, while the remaining were partitioned into training and validation (ratio of 3:1). Isolating the control set will prevent the control set from exposure to the model during the training and validation process. The goal of a control set is to validate the models without any bias. The training and validation sets respectively contain 570 and 190 images, while the control set has 189 images.Unlike natural scene images, medical images are usually stored as Digital Imaging and Communications in Medicine (DICOM). As OAI datasets are saved as DICOM files, extraction and format conversion are necessary. We performed MR slice extraction and format conversion through python scripting. The image dimensions of the MR slices were maintained at 384 height (pixels) and 384 width (pixels).EDCNNs are prototyped using adaptive moment estimation, batch size 1 for 30 epochs, initial learning rate at 1e-3, and weight decay at 1e-4. The learning rate is controlled by a learning rate scheduler along the model training process. The scheduler reduces the learning rate by 0.1 if no improvement is seen on the validation loss for two consecutive epochs. We also utilized the early stopping algorithm to prevent a model from overfitting. We seized the model training process if the validation loss remained stagnant for the past two consecutive epochs, while the learning rate has been reduced to the lower bound at 1e-10. We conducted model training by using PyTorch.In this study, the model’s prediction output image was compared with manual annotations pixel by pixel. Through the pixel-wise comparison, a confusion matrix, as seen in Table was produced. With the four basic elements (i.e., TP, FP, FN, and TN), different metrics can be used to analyze the model’s performance. We evaluated the trained EDCNNs with the isolated control set with three different metrics.Jaccard Similarity Coefficient (JSC): JSC is used to gauge the similarity and diversity between two finite sample sets. It measures by dividing the size of the intersection with the size of the union of the sample sets. The formula is shown as equation 1:(1)Dice Similarity Coefficient (DSC): DSC is a harmonic mean between precision and recall. This value is in the range of [0, 1]. DSC is different from JSC, which only counts true positives once; however, both JSC and DSC do not take the true negatives into account. The formula is shown as equation 2:(2)Matthew’s Correlation Coefficient (MCC): MCC is commonly used in the field of machine learning as a measure to assess a binary classification task. Unlike JSC and DSC, MCC summarizes the confusion matrix elements into a value. MCC returns a value in the range [-1, 1], with perfect prediction labelled as 1; -1 indicates a completely incorrect prediction, while 0 represents that the prediction is no better than random. The formula is shown as equation 3:(3)Table reports the JSC, DSC, and MCC values of each of the EDCNNs.
RESULTS AND DISCUSSION
The comparative study of EDCNNs can be split into two parts: based on the model’s physical specifications and model’s performances. The first part investigates the model size and the total trainable parameters for each EDCNN, while the second part focuses on the performance metrics.The size of a model is generally proportional to its number of trainable parameters, i.e., the more the trainable parameters, the bigger the model size. According to Table , FCDenseNet-56 and LadderNet are the smaller models with approximately 5 megabytes (MB) and approximately 1.3 million trainable parameters. As mentioned in Section 2.1, the weight-sharing strategy in LadderNet reduces the total trainable parameters, although multiple encoder-decoder blocks are concatenated. By contrast, auxiliary-based EDCNNs (i.e., AttentionUNet, RecurrentUNet, and RecurrentAttentionUNet) have the largest size with at least 34 million trainable parameters. However, a smaller model size is likely to improve the efficiency of model serving but does not necessarily generate better performances in terms of a model’s accuracy and precision.Following JSC, DSC, and MCC, UNetVanilla slightly outperformed FCDenseNet-56 and LadderNet. However, they come with a disadvantage because the former has 22 times more trainable parameters than the latter. With additional trainable parameters, the training process for UNetVanilla will be longer than FCDenseNet-56 and LadderNet. From Table , UNetVanilla crowns all the performance metrics, although it is only a baseline model. The reasons are as follows.First, we limited each EDCNN to 30 training epochs as stated in Section 3. In each of the epoch, each EDCNN iterates through all images within the training dataset and proceeds to validation at the end of the epoch. The model state with the lowest validation loss is then retrieved. However, the EDCNNs might not be at its optimum stage as we only limited the training to 30 epochs.The second possible reason is the difference in the loss function. We implemented Binary Cross Entropy (BCE) with Logit loss as compared with Sorensen-Dice [36, 38] or custom-weighted loss function [19, 24, 27, 28]. BCE with Logit loss is numerically stable with log-sum-exp function. This feature might explain why EDCNNs could not surpass the performances of the baseline architecture.The third reason is the inconsistency of the decoder block. Several methods can increase the size of feature map in the decoding path. Examples are interpolation, up-sampling, and transpose convolution. Different approaches were chosen for the EDCNNs on the basis of their original works.Overfitting is another potential cause for a model not to perform, especially models involving recurrent modules. The recurrent layer is well known for its high possibility of overfitting. Tables and do not report the results of RecurrentUNet and RecurrentAttentionUNet due to overfitting. Apart from the early stopping algorithm, we must implement strong mechanisms to reduce the chances of overfitting.Moreover, the masking of all images is manually annotated, which is subject to a degree of errors due to intra- and inter-observer variability. Unlike natural scene images, each pixel from the MR images is not color-coded. Thus, segmenting the boundary of tissues is challenging, and the classification of a pixel near the tissue boundary is vague. Meanwhile, we considered the manual annotations as near “Ground Truth” level, accepting that minor mistakes may exist across the manual masking.In general, all the EDCNNs reported high scoring in all performance metrics. The lowest and highest performance scores across EDCNNs range within 0.77-0.83 for JSC, 0.86-0.91 for DSC, and 0.87-0.90 for MCC. As seen in (Fig. ), all EDCNNs successfully predicted the cartilage regions. The slight imperfections are the FP and FN pixels at the tip of the cartilage as well as at the boundaries. By referring to the confusion matrix element images, FCDenseNets tend to have higher False Positive (red) pixels, while TernausNet-16 has the highest number of False Negative (green) pixels.
CONCLUSION
This study provided a genealogical chart of EDCNNs by grouping architectures according to their characteristics. It then performed a benchmarking process onto 10 EDCNNs to identify the best architectures in segmenting cartilage tissue in MR images. In this comparison study, we compared EDCNNs from two perspectives: the model’s physical specifications and its segmentation performances. On the one hand, LadderNet has the least trainable parameters, and the model size is only 5 MB. On the other hand, UNetVanilla crowned the best performances by having 0.8369, 0.9108, and 0.9097 on JSC, DSC, and MCC, respectively. Therefore, LadderNet is found to be the lightweight architecture, while UNetVanilla is the best performing architecture. The outcome of this study can serve as a guideline, reference, or even a comparison standard in the task of delineating knee cartilage tissue in MR images for OA analysis. We wish to expand this study in the future by including other variations and designs of EDCNNs and performing further in-depth comparative analysis.
Table 1
Confusion matrix and its four basic elements.
-
-
Manual Annotations
-
Pixel’s Class
Cartilage
Background
Model’s Output
Cartilage
True Positive
False Positive
Background
False Negative
True Negative
Table 2
Comparing the physical specifications of EDCNNs. The model sizes are represented in MB. The smallest model size is in bold numbers.
Comparing the performances of EDCNNs in terms of JSC, DSC, and MCC onto the 20 sets of testing images. The scores are tabulated as mean and standard deviation. Results of RecurrentUNet and RecurrentAttentionUNet are excluded due to overfitting and did not result in any high confident results. The highest scores are in bold numbers.
Authors: Jenny Folkesson; Erik B Dam; Ole F Olsen; Paola C Pettersen; Claus Christiansen Journal: IEEE Trans Med Imaging Date: 2007-01 Impact factor: 10.048
Authors: Adam G Culvenor; Britt Elin Øiestad; Harvi F Hart; Joshua J Stefanik; Ali Guermazi; Kay M Crossley Journal: Br J Sports Med Date: 2018-06-09 Impact factor: 13.800