Literature DB >> 33319690

Comparative Study of Encoder-decoder-based Convolutional Neural Networks in Cartilage Delineation from Knee Magnetic Resonance Images.

Ching Wai Yong¹, Khin Wee Lai¹, Belinda Pingguan Murphy¹, Yan Chai Hum¹.

Abstract

BACKGROUND: Osteoarthritis (OA) is a common degenerative joint inflammation that may lead to disability. Although OA is not lethal, this disease will remarkably affect patient's mobility and their daily lives. Detecting OA at an early stage allows for early intervention and may slow down disease progression.
INTRODUCTION: Magnetic resonance imaging is a useful technique to visualize soft tissues within the knee joint. Cartilage delineation in magnetic resonance (MR) images helps in understanding the disease progressions. Convolutional neural networks (CNNs) have shown promising results in computer vision tasks, and various encoder-decoder-based segmentation neural networks are introduced in the last few years. However, the performances of such networks are unknown in the context of cartilage delineation.
METHODS: This study trained and compared 10 encoder-decoder-based CNNs in performing cartilage delineation from knee MR images. The knee MR images are obtained from the Osteoarthritis Initiative (OAI). The benchmarking process is to compare various CNNs based on physical specifications and segmentation performances.
RESULTS: LadderNet has the least trainable parameters with the model size of 5 MB. UNetVanilla crowned the best performances by having 0.8369, 0.9108, and 0.9097 on JSC, DSC, and MCC.
CONCLUSION: UNetVanilla can be served as a benchmark for cartilage delineation in knee MR images, while LadderNet served as an alternative if there are hardware limitations during production. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.net.

Entities: Chemical

Keywords: Comparative study; convolutional neural network; encoder-decoder neural network; knee cartilage segmentation; magnetic resonance imaging; osteoarthritis.

Mesh：

Year: 2021 PMID： 33319690 PMCID： PMC8653427 DOI： 10.2174/1573405616666201214122409

Source DB: PubMed Journal: Curr Med Imaging

INTRODUCTION

Osteoarthritis (OA) is a common degenerative joint inflammation that may lead to disability in severe cases [1]. Although OA is not lethal, it profoundly affects mobility [2] and the patient’s quality of life. The incessant breaking down of cartilage and continuous bone deformation are the main causes that lead to joints failure. Patients of severe OA (end-stage) will experience excruciating pain as the joint cartilages degenerate and cause bone-to-bone frictions during movements. Arthroplasty or total knee replacement is the last option available for knee OA patients to regain their mobility. However, this clinical procedure is invasive and costly. Therefore, diagnosing OA at an early stage is crucial for clinical intervention in halting disease progression and mitigating disability in later stages. Magnetic Resonance Imaging (MRI) is the safest and non-radioactive imaging technique to visualize the knee joint’s internal derangement, especially in determining OA features of the asymptomatic uninjured knee [3]. The main advantage of MRI as compared with traditional radiography is its capability to evaluate the structural changes during disease progressions [4] and provide biomarkers for early OA diagnosis [5]. Degeneration of cartilage tissues is one of the main criteria for an early stage of OA as defined by Luyten et al. [6]. Thus, delineating cartilage in biomedical images is crucial because early detection of cartilage defects allows for early medical interventions and leads to better treatments [7-9]. In clinical practices, cartilage delineation is manually performed by a radiologist [2]. Manual delineation is not only a time-consuming [2, 10] task but is also prone to inter- and intra-observer variability [11, 12]. In recent years, deep convolutional neural networks (CNNs) demonstrated state-of-the-art performance in biomedical image analysis, such as breast cancer analysis [13], bone disease prediction [14], and age assessment [15]. Unlike most conventional machine learning techniques such as fuzzy logic [16], bi-histogram equalization [17], and image registrations [18], CNN requires no feature engineering but demands a handful of dataset annotation and computation power. Fortunately, the computation requirement is no longer a challenge to CNN training with the advancement of graphic cards and cloud computing. Encoder-decoder pair is the main core component in most of the existing segmentation neural networks. The encoder harvests data into features, whereas the decoder decodes the features to perform pixel-based classification; the encoder is discriminative, whereas the decoder is generative. These encoder-decoder-based CNNs (EDCNNs) reported remarkable achievements in natural scene images. The current study aims to examine the performances of various EDCNNs in delineating cartilage tissues from MR images. The contributions of this study are listed as follows. First, we propose to group the various EDCNNs into different variations or families. To the best of our knowledge, our study is pioneering as it provides a genealogical chart of EDCNNs. Second, we perform a benchmarking process and identify the best EDCNN in delineating knee cartilage tissue within MR images. This paper is organized as follows. Section 2 briefly explains about U-Net, the base version of ECDNN, and its variations. We grouped different ECDNNs into families according to their unique characteristics and natures. Section 3 illustrates the methodology, which includes the datasets, data pre-processing techniques applied, specifications of model training, and model assessment strategy in Section 3. Section 4 evaluates the performance of EDCNNs by reporting the comparative results. Section 5 summarizes the conclusion and future works.

BASE AND VARIATIONS OF EDCNNS

The general architecture of an EDCNN has two paths: a contracting path for context capturing and an expanding path to localize features precisely. U-Net [19] is the first neural network to employ the encoder-decoder pairing scheme into the network design for the segmentation task, making it the first EDCNN. This architecture was inspired by the Fully Convolutional Network [20], a CNN that can perform pixel-wise classification. The encoding path of U-Net is built with repeating blocks containing 3x3 convolutional layers, a rectified linear unit (ReLU) [21], and a 2x2 max-pooling layers with the stride of 2. For each successive block, the feature map resolutions are reduced by half, whereas the feature channels are doubled. By contrast, the decoding path of U-Net contains Up-convolution blocks to up-sample the feature maps while reducing the feature channels by half. Feature maps from the encoding path are concatenated to a decoding path after each respective down- and up-sampling process. These unique and symmetric paths yield a u-shaped architecture.

Variations of EDCNNs

In this study, we refer to U-Net as the “Base” for EDCNNs while grouping its expansions into four different variations: “Skip-Connections,” “Weight-Initialized,” “Auxiliary-Based,” and “Cascaded” as shown in (Fig. ). Base. The original U-Net has a huge drawback: the output resolution from the final layer is not the same as the input image. The feature maps were cropped from each level of the contracting path as the border pixels were lost during each convolution. To overcome this problem, we padded the feature maps in each of the convolution layers, ensuring that the output dimension is equivalent to the input size to produce a network known as UNetVanilla. Skip-Connections. The degree of connectivity within a neural network determines the information flow from one layer to another. DenseNet [22] exploits the effects of shortcut connections by directly connecting all layers with one another and performing iterative concatenation of feature maps. The improvement in connectivity helps this network converge faster. Although DenseNet was created for the classification task, a segmentation version, namely, FC- DenseNet [23], was carefully extended. FC-DenseNet inherits the following advantages of DenseNet: parameter efficiency, implicit deep supervision, and feature reuse. FC- DenseNet mitigates a large number of parameters by only up-sampling the feature maps created at the previous dense blocks. FC-DenseNet has three variations: FC-DenseNet56, FC-DenseNet67, and FC-DenseNet103 with 56, 67, and 103 layers, respectively. Unlike FC-DenseNet, LinkNet [24] provides a different type of linkage between encoder and decoder, and the input of the encoder layer is bypassed to the corresponding decoder’s output. This approach aims to recover the lost spatial information that can be utilized by the decoder and its up-sample operations. Moreover, the decoder uses fewer parameters as the decoders share knowledge learnt by the encoder at every layer. Weight-Initialized. Neural networks are normally trained from scratch, and their weights are initialized randomly. A wrong initialization will lead to exploding or vanishing weights and gradients. Studies showed that deep neural networks could converge much earlier and prevent the aforementioned scenarios with a proper initialization strategy [25, 26]. These strategies initialize weights according to a specific distribution with a formulated pair of mean and standard deviation. Apart from the manual initialization, we can replace the encoder path with sequential convolution and ReLU layers from a pre-trained CNN. For example, TernausNet [27] and AlbuNet [28] are using pre-trained VGG [29] and ResNet-34 [30] as encoder in the contracting path. Auxiliary-Based. Apart from introducing a new weight initialization strategy and skip-connections scheme, existing studies explore the potential of equipping EDCNNs with auxiliary elements such as Attention Gates (AGs) and recurrent residual modules. AG is commonly applied in image captioning [31], machine translation [32, 33], and classification tasks [34, 35]. With the help of self-attention gating modules, AttentionUNet [36] shows that a network learns to focus on salient image regions and suppresses feature activation in irrelevant regions without introducing a substantial computational overhead. By contrast, RecurrentUNet [37] is using the recurrent residual module to accumulate features at different time-steps. This process allows the production of a relatively strong representation of features by extracting essential low-level features. RecurrentAttentionUNet [38] is introduced by combining both AGs and recurrent residual module into U-Net. This network takes advantages of three different cores: using U-Net to capture information at multiple scales while integrating low- and high-level features; stacking residual blocks to allow a network to go deeper; implementing attention modules to change the attention-aware features adaptively. Cascaded. Conventional EDCNNs come with a single pair of encoder-decoder until the birth of LadderNet [39], an ensemble structure of multiple U-Nets. LadderNet concatenates encoder-decoder pairs, introducing additional paths for information flow and improving the capability of an EDCNN to capture complex features. A weight-sharing strategy was applied to the residual blocks to constrain the increase in trainable parameters due to the chaining of encoder-decoder pairs.

EXPERIMENTAL

Comparative Study

All the EDCNNs were trained using the Osteoarthritis Initiative (OAI) datasets, a longitudinal study of knee OA. This dataset contains 4,796 participants, with X-rays and Magnetic Resonance (MR) images of participants’ knees. Although the size of this dataset is enormous, we arbitrarily chose 100 sets of Double Echo Steady State (DESS) MR images and subsequently annotated both femoral and tibial knee cartilages. Twenty sets of MR images were held out as a control set, while the remaining were partitioned into training and validation (ratio of 3:1). Isolating the control set will prevent the control set from exposure to the model during the training and validation process. The goal of a control set is to validate the models without any bias. The training and validation sets respectively contain 570 and 190 images, while the control set has 189 images. Unlike natural scene images, medical images are usually stored as Digital Imaging and Communications in Medicine (DICOM). As OAI datasets are saved as DICOM files, extraction and format conversion are necessary. We performed MR slice extraction and format conversion through python scripting. The image dimensions of the MR slices were maintained at 384 height (pixels) and 384 width (pixels). EDCNNs are prototyped using adaptive moment estimation, batch size 1 for 30 epochs, initial learning rate at 1e-3, and weight decay at 1e-4. The learning rate is controlled by a learning rate scheduler along the model training process. The scheduler reduces the learning rate by 0.1 if no improvement is seen on the validation loss for two consecutive epochs. We also utilized the early stopping algorithm to prevent a model from overfitting. We seized the model training process if the validation loss remained stagnant for the past two consecutive epochs, while the learning rate has been reduced to the lower bound at 1e-10. We conducted model training by using PyTorch. In this study, the model’s prediction output image was compared with manual annotations pixel by pixel. Through the pixel-wise comparison, a confusion matrix, as seen in Table was produced. With the four basic elements (i.e., TP, FP, FN, and TN), different metrics can be used to analyze the model’s performance. We evaluated the trained EDCNNs with the isolated control set with three different metrics. Jaccard Similarity Coefficient (JSC): JSC is used to gauge the similarity and diversity between two finite sample sets. It measures by dividing the size of the intersection with the size of the union of the sample sets. The formula is shown as equation 1: (1) Dice Similarity Coefficient (DSC): DSC is a harmonic mean between precision and recall. This value is in the range of [0, 1]. DSC is different from JSC, which only counts true positives once; however, both JSC and DSC do not take the true negatives into account. The formula is shown as equation 2: (2) Matthew’s Correlation Coefficient (MCC): MCC is commonly used in the field of machine learning as a measure to assess a binary classification task. Unlike JSC and DSC, MCC summarizes the confusion matrix elements into a value. MCC returns a value in the range [-1, 1], with perfect prediction labelled as 1; -1 indicates a completely incorrect prediction, while 0 represents that the prediction is no better than random. The formula is shown as equation 3: (3) Table reports the JSC, DSC, and MCC values of each of the EDCNNs.

RESULTS AND DISCUSSION

The comparative study of EDCNNs can be split into two parts: based on the model’s physical specifications and model’s performances. The first part investigates the model size and the total trainable parameters for each EDCNN, while the second part focuses on the performance metrics. The size of a model is generally proportional to its number of trainable parameters, i.e., the more the trainable parameters, the bigger the model size. According to Table , FCDenseNet-56 and LadderNet are the smaller models with approximately 5 megabytes (MB) and approximately 1.3 million trainable parameters. As mentioned in Section 2.1, the weight-sharing strategy in LadderNet reduces the total trainable parameters, although multiple encoder-decoder blocks are concatenated. By contrast, auxiliary-based EDCNNs (i.e., AttentionUNet, RecurrentUNet, and RecurrentAttentionUNet) have the largest size with at least 34 million trainable parameters. However, a smaller model size is likely to improve the efficiency of model serving but does not necessarily generate better performances in terms of a model’s accuracy and precision. Following JSC, DSC, and MCC, UNetVanilla slightly outperformed FCDenseNet-56 and LadderNet. However, they come with a disadvantage because the former has 22 times more trainable parameters than the latter. With additional trainable parameters, the training process for UNetVanilla will be longer than FCDenseNet-56 and LadderNet. From Table , UNetVanilla crowns all the performance metrics, although it is only a baseline model. The reasons are as follows. First, we limited each EDCNN to 30 training epochs as stated in Section 3. In each of the epoch, each EDCNN iterates through all images within the training dataset and proceeds to validation at the end of the epoch. The model state with the lowest validation loss is then retrieved. However, the EDCNNs might not be at its optimum stage as we only limited the training to 30 epochs. The second possible reason is the difference in the loss function. We implemented Binary Cross Entropy (BCE) with Logit loss as compared with Sorensen-Dice [36, 38] or custom-weighted loss function [19, 24, 27, 28]. BCE with Logit loss is numerically stable with log-sum-exp function. This feature might explain why EDCNNs could not surpass the performances of the baseline architecture. The third reason is the inconsistency of the decoder block. Several methods can increase the size of feature map in the decoding path. Examples are interpolation, up-sampling, and transpose convolution. Different approaches were chosen for the EDCNNs on the basis of their original works. Overfitting is another potential cause for a model not to perform, especially models involving recurrent modules. The recurrent layer is well known for its high possibility of overfitting. Tables and do not report the results of RecurrentUNet and RecurrentAttentionUNet due to overfitting. Apart from the early stopping algorithm, we must implement strong mechanisms to reduce the chances of overfitting. Moreover, the masking of all images is manually annotated, which is subject to a degree of errors due to intra- and inter-observer variability. Unlike natural scene images, each pixel from the MR images is not color-coded. Thus, segmenting the boundary of tissues is challenging, and the classification of a pixel near the tissue boundary is vague. Meanwhile, we considered the manual annotations as near “Ground Truth” level, accepting that minor mistakes may exist across the manual masking. In general, all the EDCNNs reported high scoring in all performance metrics. The lowest and highest performance scores across EDCNNs range within 0.77-0.83 for JSC, 0.86-0.91 for DSC, and 0.87-0.90 for MCC. As seen in (Fig. ), all EDCNNs successfully predicted the cartilage regions. The slight imperfections are the FP and FN pixels at the tip of the cartilage as well as at the boundaries. By referring to the confusion matrix element images, FCDenseNets tend to have higher False Positive (red) pixels, while TernausNet-16 has the highest number of False Negative (green) pixels.

CONCLUSION

This study provided a genealogical chart of EDCNNs by grouping architectures according to their characteristics. It then performed a benchmarking process onto 10 EDCNNs to identify the best architectures in segmenting cartilage tissue in MR images. In this comparison study, we compared EDCNNs from two perspectives: the model’s physical specifications and its segmentation performances. On the one hand, LadderNet has the least trainable parameters, and the model size is only 5 MB. On the other hand, UNetVanilla crowned the best performances by having 0.8369, 0.9108, and 0.9097 on JSC, DSC, and MCC, respectively. Therefore, LadderNet is found to be the lightweight architecture, while UNetVanilla is the best performing architecture. The outcome of this study can serve as a guideline, reference, or even a comparison standard in the task of delineating knee cartilage tissue in MR images for OA analysis. We wish to expand this study in the future by including other variations and designs of EDCNNs and performing further in-depth comparative analysis.

Table 1

Confusion matrix and its four basic elements.

-	-	Manual Annotations
-	Pixel’s Class	Cartilage	Background
Model’s Output	Cartilage	True Positive	False Positive
Model’s Output	Background	False Negative	True Negative

Table 2

Comparing the physical specifications of EDCNNs. The model sizes are represented in MB. The smallest model size is in bold numbers.

Variant & Architecture	Model Size (MB)	Trainable Parameters
UNetVanilla	118.0	31,045,441
FCDenseNet-56FCDenseNet-67FCDenseNet-103LinkNet-34	5.3913.4036.0083.20	1,374,8653,460,3539,319,52121,794,721
TernausNet-11TernausNet-16AlbuNet	87.40111.00134.00	22,927,39329,306,46535,117,897
AttentionUNetRecurrentUNetRecurrentAttentionUNet	133.00149.00150.00	34,878,57339,091,39339,442,925
LadderNet	5.28	1,381,821

Table 3

Comparing the performances of EDCNNs in terms of JSC, DSC, and MCC onto the 20 sets of testing images. The scores are tabulated as mean and standard deviation. Results of RecurrentUNet and RecurrentAttentionUNet are excluded due to overfitting and did not result in any high confident results. The highest scores are in bold numbers.

Variant & Architecture	Jaccard-Similarity Coefficient	Dice Similarity Coefficient	Matthew’s Correlation Coefficient
UNetVanilla	0.8369 ± 0.0285	0.9108 ± 0.0172	0.9097 ± 0.0174
FCDenseNet-56FCDenseNet-67FCDenseNet-103LinkNet-34	0.8124 ± 0.03620.8017 ± 0.03230.7706 ± 0.04170.8305 ± 0.0389	0.8956 ± 0.02250.8895 ± 0.02000.8696 ± 0.02690.9067 ± 0.0243	0.8946 ± 0.02260.8898 ± 0.01930.8719 ± 0.02460.9057 ± 0.0243
TernausNet-11TernausNet-16AlbuNet	0.8310 ± 0.02980.7873 ± 0.04300.8357 ± 0.0308	0.9072 ± 0.01810.8801 ± 0.02750.9101 ± 0.0187	0.9062 ± 0.01820.8796 ± 0.02720.9090 ± 0.0188
AttentionUNetRecurrentUNetRecurrentAttentionUNet	0.8241 ± 0.0315--	0.9028 ± 0.0195--	0.9024 ± 0.0193--
LadderNet	0.8253 ± 0.0373	0.9037 ± 0.0228	0.9025 ± 0.0232

12 in total

1. Definition and classification of early osteoarthritis of the knee.

Authors: Frank P Luyten; Matteo Denti; Giuseppe Filardo; Elizaveta Kon; Lars Engebretsen
Journal: Knee Surg Sports Traumatol Arthrosc Date: 2011-11-08 Impact factor: 4.342

2. Osteoarthritis as a whole joint disease.

Authors: A Robin Poole
Journal: HSS J Date: 2012-02-23

3. Knee cartilage segmentation and thickness computation from ultrasound images.

Authors: Amir Faisal; Siew-Cheok Ng; Siew-Li Goh; Khin Wee Lai
Journal: Med Biol Eng Comput Date: 2017-08-29 Impact factor: 2.602

4. Segmenting articular cartilage automatically using a voxel classification approach.

Authors: Jenny Folkesson; Erik B Dam; Ole F Olsen; Paola C Pettersen; Claus Christiansen
Journal: IEEE Trans Med Imaging Date: 2007-01 Impact factor: 10.048

5. Automatic Articular Cartilage Segmentation Based on Pattern Recognition from Knee MRI Images.

Authors: Jianfei Pang; PengYue Li; Mingguo Qiu; Wei Chen; Liang Qiao
Journal: J Digit Imaging Date: 2015-12 Impact factor: 4.056

Review 6. MRI-based semiquantitative scoring of joint pathology in osteoarthritis.

Authors: Ali Guermazi; Frank W Roemer; Ida K Haugen; Michel D Crema; Daichi Hayashi
Journal: Nat Rev Rheumatol Date: 2013-01-15 Impact factor: 20.543

7. Feature-Based Retinal Image Registration Using D-Saddle Feature.

Authors: Roziana Ramli; Mohd Yamani Idna Idris; Khairunnisa Hasikin; Noor Khairiah A Karim; Ainuddin Wahid Abdul Wahab; Ismail Ahmedy; Fatimah Ahmedy; Nahrizul Adib Kadri; Hamzah Arof
Journal: J Healthc Eng Date: 2017-10-24 Impact factor: 2.682

Review 8. Prevalence of knee osteoarthritis features on magnetic resonance imaging in asymptomatic uninjured adults: a systematic review and meta-analysis.

Authors: Adam G Culvenor; Britt Elin Øiestad; Harvi F Hart; Joshua J Stefanik; Ali Guermazi; Kay M Crossley
Journal: Br J Sports Med Date: 2018-06-09 Impact factor: 13.800

9. RA-UNet: A Hybrid Deep Attention-Aware Network to Extract Liver and Tumor in CT Scans.

Authors: Qiangguo Jin; Zhaopeng Meng; Changming Sun; Hui Cui; Ran Su
Journal: Front Bioeng Biotechnol Date: 2020-12-23

10. Automatic Knee Osteoarthritis Diagnosis from Plain Radiographs: A Deep Learning-Based Approach.

Authors: Aleksei Tiulpin; Jérôme Thevenot; Esa Rahtu; Petri Lehenkari; Simo Saarakkala
Journal: Sci Rep Date: 2018-01-29 Impact factor: 4.379

1 in total

Review 1. Discovering Knee Osteoarthritis Imaging Features for Diagnosis and Prognosis: Review of Manual Imaging Grading and Machine Learning Approaches.

Authors: Yun Xin Teoh; Khin Wee Lai; Juliana Usman; Siew Li Goh; Hamidreza Mohafez; Khairunnisa Hasikin; Pengjiang Qian; Yizhang Jiang; Yuanpeng Zhang; Samiappan Dhanalakshmi
Journal: J Healthc Eng Date: 2022-02-18 Impact factor: 2.682

1 in total