| Literature DB >> 32209991 |
Muhammad Arsalan1, Muhammad Owais1, Tahir Mahmood1, Jiho Choi1, Kang Ryoung Park1.
Abstract
Automatic chest anatomy segmentation plays a key role in computer-aided disease diagnosis, such as for cardiomegaly, pleural effusion, emphysema, and pneumothorax. Among these diseases, cardiomegaly is considered a perilous disease, involving a high risk of sudden cardiac death. It can be diagnosed early by an expert medical practitioner using a chest X-Ray (CXR) analysis. The cardiothoracic ratio (CTR) and transverse cardiac diameter (TCD) are the clinical criteria used to estimate the heart size for diagnosing cardiomegaly. Manual estimation of CTR and other diseases is a time-consuming process and requires significant work by the medical expert. Cardiomegaly and related diseases can be automatically estimated by accurate anatomical semantic segmentation of CXRs using artificial intelligence. Automatic segmentation of the lungs and heart from the CXRs is considered an intensive task owing to inferior quality images and intensity variations using nonideal imaging conditions. Although there are a few deep learning-based techniques for chest anatomy segmentation, most of them only consider single class lung segmentation with deep complex architectures that require a lot of trainable parameters. To address these issues, this study presents two multiclass residual mesh-based CXR segmentation networks, X-RayNet-1 and X-RayNet-2, which are specifically designed to provide fine segmentation performance with a few trainable parameters compared to conventional deep learning schemes. The proposed methods utilize semantic segmentation to support the diagnostic procedure of related diseases. To evaluate X-RayNet-1 and X-RayNet-2, experiments were performed with a publicly available Japanese Society of Radiological Technology (JSRT) dataset for multiclass segmentation of the lungs, heart, and clavicle bones; two other publicly available datasets, Montgomery County (MC) and Shenzhen X-Ray sets (SC), were evaluated for lung segmentation. The experimental results showed that X-RayNet-1 achieved fine performance for all datasets and X-RayNet-2 achieved competitive performance with a 75% parameter reduction.Entities:
Keywords: Cardiomegaly; Cardiothoracic ratio; Chest anatomy segmentation; X-Ray-Net
Year: 2020 PMID: 32209991 PMCID: PMC7141544 DOI: 10.3390/jcm9030871
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.241
Comparison of previous methods and X-RayNet for chest anatomy segmentation.
| Type | Methods | Strength | Weakness |
|---|---|---|---|
|
| Lung segmentation using Hull-CPLM [ | Selects the ROI for lung detection | Preprocessing is required |
| Nongrid registration lung segmentation [ | Sift-flow modeling for registration provides an advantage | Boundary refinement is required | |
| Probabilistic lung shape model [ | Probabilistic shape model mask helps in shape segmentation | Single threshold creates the segmentation error | |
| Otsu thresholding [ | Excludes the noise area for lung nodule segmentation | Gamma correction is required | |
| Fuzzy c-means clustering [ | Better performance compared to K-means | The lower value of β requires more iterations | |
| Active contour and morphology [ | Active contour can estimate the real lung boundary | The iterative method takes many iterations | |
| Salient point-based lung segmentation [ | Interpolation of salient points approximate lung boundary well | Results are affected by overlapped regions | |
| Harris corner detector [ | Convolutional mask refines the contour | Edge detection is affected by noise | |
| Region growing [ | Region growing methods are good towards the real boundary | ROI is required | |
|
| Structural correcting adversarial network | Adversarial training is good for a small number of training images | Critic network requires fully connected layer and consumes a lot of parameters |
| Domain adaptation [ | Domain adaption is good to enhance segmentation performance | FCN-based segmentation consumes many parameters | |
| Lung segmentation by criss-cross attention [ | Image-to-image translation is used for augmentation | Three separate deep models of ResNet101, UNet, and MUNIT are used | |
| Similar structure as AlexNet [ | Semantic segmentation is close to real boundary | Patch-based deep learning scheme is computationally expensive | |
| FCN, U-Net, and SegNet for CXR segmentation [ | Semantic segmentation provides good results for multiclass segmentation | FCN consumes many trainable parameters owing to fully connected layer | |
| U-Net [ | U-Net is popular for medical image segmentation | Preprocessing is required | |
| Mask-RCNN [ | Multiclass efficient segmentation is performed | Region proposals are also required with pixel-wise annotation | |
| ResNet [ | Dropping 5th convolutional block from VGG-16 reduces the number of parameters | Clavicle bone segmentation is not considered | |
| X-RayNet | 12 residual mesh streams enhance features to provide good segmentation performance | Data augmentation is required to artificially increase the amount of data |
* Handcrafted local featuresare with conventional image processing schemes.
Figure 1Flowchart of the proposed method.
Key architectural differences between X-RayNet and previous approaches.
| Method | Other Architectures | X-RayNet |
|---|---|---|
| ResNet [ | Only adjacent convolutional layers have residual skip paths | Both adjacent and nonadjacent layers have residual skip connections. There are paths between the encoder and decoder. |
| 1 × 1 convolution is employed as bottleneck layer in all ResNet variants | 1 × 1 convolution is used to connect three blocks of the decoder based on nonidentity mapping | |
| Max-pooling layers are without indices information | Max-pool to max-unpool indices information is shared between the corresponding encoder and decoder block | |
| All variants use fully connected layers for classification purposes | The fully connected layers are not used to make the network a fully convolutional network (FCN) for semantic segmentation | |
| Average pooling is employed at the end of the network | Max-pooling layers and max-unpooling layers are used in each encoder and decoder block | |
| IrisDenseNet [ | Encoder and decoder consist of 13 convolutional layers each, resulting in a total of 26 convolutional layers | Encoder and decoder consist of eight and nine (3 × 3) convolutional layers, respectively |
| Uses dense connectivity in encoder with depth-wise concatenation | Residual connectivity between encoder and decoder by elementwise addition | |
| First two blocks have two convolutional layers and the rest of the blocks have three convolutional layers in the encoder and decoder | Two convolutional layers in each encoder and decoder convolutional block, where one convolutional layer is at the end of the network to produce respective class masks | |
| The decoder is the same as the VGG-16 network without feature reuse by dense connectivity | Both encoder and decoder use the residual mesh connectivity for feature reuse | |
| FRED-Net [ | Only uses residual skip connections between adjacent convolutional layers of same block | Uses residual skip connections for adjacent convolutional layers and between encoder and decoder externally |
| There is no skip connection between encoder and decoder | Inner and outer residual connections for spatial information flow | |
| The overall network has six skip paths | The overall network has 12 residual skip paths that create the residual mesh | |
| Overall network is based on nonidentity mapping | Among the 12 residual paths that create a residual mesh, nine are with identity mapping and three are with nonidentity mapping | |
| The ReLU is used after the elementwise addition that represents the postactivation only | The network is based on pre- and post-activation | |
| SegNet [ | 26 convolutional layers | 17 convolutional layers |
| No residual connectivity that causes vanishing gradient problem | Vanishing gradient problem is handled by residual mesh | |
| Each block has a different number of convolutional layers | All the blocks have the same two convolutional layers | |
| 512-depth block used twice to increase the depth of the network | Used 512-depth block once for X-RayNet-1 and 512-depth block is not used in X-RayNet-2 | |
| OR-Skip-Net [ | There is no internal connectivity between the convolutional layers in the encoder and decoder | Both encoder and decoder convolutional layers are connected with residual mesh for feature empowerment |
| The outer skip connections are with nonidentity mapping | The encoder-to-decoder connections are with identity mapping | |
| Only pre-activation is used as ReLU exists before elementwise addition | The network is based on pre- and post-activation | |
| Four residual connections are used | 12 residual skip connections are used | |
| Vess-Net [ | 16 convolutional layers are used | 16 convolutions are used with an extra convolution in the decoder for fine edge connectivity |
| The first convolutional layer has no internal or external residual connection | The features from the first convolutional layer are important for edge information for the minor class, like the clavicle bones; therefore, it is internally and externally connected | |
| All the convolutional layers are internally connected with each other inside the encoder and decoder with nonidentity mapping | Most of the internal layers of the encoder and decoder are connected using identity mapping | |
| 10 residual paths | 12 residual paths | |
| U-Net [ | 23 convolutional layers are used | 17 convolution layers are used |
| Up convolutions are used in the expansive part for upsampling | The unpool layer in combination with normal convolution is used for upsampling | |
| 1 × 1 convolution is used at the end of the network | 1 × 1 convolution is only used in the decoder internal residual connections | |
| Feature concatenation is utilized for empowerment | Feature elementwise addition is utilized for feature empowerment | |
| Cropping is required owing to border pixel loss during convolution | The feature map size is controlled by indices information transfer between pooling and unpooling layers |
Figure 2X-RayNet residual mesh schematic.
Figure 3Proposed X-RayNet architecture for chest X-Ray (CXR) semantic segmentation: (a) X-RayNet-1 without filter reduction and (b) X-RayNet-2 with filter reduction.
X-RayNet encoder with residual mesh and feature map size of each of the following: EB, EC, OIS, IIS, and pool (indicating encoder block, encoder convolution, outer identity stream, inner identity stream, and pooling layer, respectively). The layer with “**” denotes that the layer includes batch normalization (BN) and the ReLU unit, where “*” indicates that only BN is included with the layer. The table is based on an input image size of 447 × 447 × 3.
| Block | Name/Size | Number of Filters | Output Feature Map Size (Width × Height × Number of Channels) | Number of Trainable Parameters (EC + BN) |
|---|---|---|---|---|
|
| EC-1_1 **/3 × 3 × 3 | 64 | 350 × 350 × 64 | 1792 + 128 |
| EC-1_2 /3 × 3 × 64 | 64 | 36,928 | ||
| E-Add-1 (EC-1_1 + EC-1_2) using IIS | - | |||
| BN + ReLU | 128 | |||
|
| Pool-1/2 × 2 To decoder (OIS-2) | - | 175 × 175 × 64 | - |
|
| EC-2_1 **/3 × 3 × 64 To E-Add-2 | 128 | 175 × 175 × 128 | 73,856 + 256 |
| EC-2_2 */3 × 3 × 128 | 128 | 147,584 | ||
| E-Add-2 (EC-2_1 + EC-2_2) using IIS | - | - | ||
| BN + ReLU | 256 | |||
|
| * Pool-2/2 × 2 To decoder (OIS-3) | - | 87 × 87 × 128 | - |
|
| EC-3_1 **/3 × 3 × 128 To E-Add-3 | 256 | 87 × 87 × 256 | 295,168 + 512 |
| EC-3_2 /3 × 3 × 256 | 256 | 590,080 + 512 | ||
| E-Add-3 (EC-3_1 + EC-3_2) using IIS | - | - | ||
| BN + ReLU | ||||
|
| * Pool-3/2 × 2 To decoder (OIS-4) | - | 43 × 43 × 256 | - |
|
| EC-4_1 **/3 × 3 × 256 To E-Add-4 | 512 | 43 × 43 × 512 | 1,180,160 + 1024 |
| EC-4_2 */3 × 3 × 512 | 512 | 2,359,808 | ||
| E-Add-4 (EC-4_1 + EC-4_2) using IIS | - | - | ||
| BN + ReLU | 1024 | |||
|
| * Pool-4/2 × 2 | - | 21× 21 × 512 | - |
X-RayNet decoder with residual mesh and feature map size of each of the following: DB, DC, OIS, INIS and unpool (indicating decoder block, decoder convolution, OIS, inner nonidentity stream, and unpooling layer, respectively). The layer with “**” denotes that the layer includes batch normalization (BN) and the ReLU unit, where “*” indicates that only BN is included with the layer; “^” shows that the path comes from the encoder corresponding block using the OIS (OIS-1 to OIS-4), where MConv represents the last convolutional layer that generates the class masks. The table is based on an input image size of 350 × 350 × 3.
| Block | Name/Size | Number of Filters | Output Feature Map Size (Width × Height × Number of Channels) | Number of Trainable Parameters |
|---|---|---|---|---|
|
| Unpool-4 | - | 43 × 43 × 512 | - |
|
| DCon-4_2 **/3 × 3 × 512 | 512 | 2,359,808 + 1024 | |
| INIS-4 */1 × 1 × 512 | 256 | 43 × 43 × 256 | 131,328 + 512 | |
| DCon-4_1 */3 × 3 × 512 | 256 | 1,179,904 | ||
| Add-5 | - | - | ||
| BN + ReLU | 512 | |||
|
| * Unpool-3 | - | 87 × 87 × 256 | - |
|
| DCon-3_2 **/3 × 3 × 256 | 256 | 590,080 + 512 | |
| INIS -3 */1 × 1 × 256 | 128 | 87 × 87 × 128 | 32,896 + 256 | |
| DCon-3_1 **/3 × 3 × 256 | 128 | 295,040 | ||
| Add-6 | - | - | ||
| BN + ReLU | 256 | |||
|
| * Unpool-2 | - | 175 × 175 × 128 | - |
|
| DCon-2_2 **/3 × 3 × 128 | 128 | 147,584 + 256 | |
| INIS -2 */1 × 1 × 128 | 64 | 175 × 175 × 64 | 8256 + 128 | |
| DCon-2_1 **/3 × 3 × 128 | 64 | 73,792 | ||
| Add-7 | - | - | ||
| BN + ReLU | 128 | |||
|
| * Unpool-1 | - | 350 × 350 × 64 | - |
|
| DConv-1_2 **/3 × 3 × 64 | 64 | 36,928 + 128 | |
| DConv-1_1 /3 × 3 × 64 | 2 | 36,928 | ||
| Add-8 | - | |||
| MConv **/3 × 3 × 64 | 4 | 350 × 350 × 4 | 2308 | |
| BN + ReLU | 8 |
Figure 4Sample CXR images and ground truths for the Japanese Society of Radiological Technology (JSRT) dataset.
Figure 5Data augmentation strategy used to artificially increase the training data; H-Flip represents the horizontal flip.
Figure 6Training loss and accuracy curve (per epoch) for X-RayNet.
Figure 7Examples of chest anatomical structure segmentation by X-RayNet for the JSRT dataset: (a) original CXR image; (b) ground-truth mask; (c) predicted mask result by X-RayNet; false positives (FP) (shown in black for each class), false negatives (FN) (shown in yellow for each class), and true positives (TP) (shown in blue, green, and red for the lung, heart, and clavicle bone classes, respectively). CTR_P and CTR_G represent the CTR predicted by the proposed method and CTR by ground-truth mask.
Accuracies of X-RayNet and existing methods for the JSRT dataset (unit: %).
| Type | Method | Lungs | Heart | Clavicle Bones | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc | J | D | Acc | J | D | Acc | J | D | ||
|
| Peng et al. [ | 97.0 | 93.6 | 96.7 | - | - | - | - | - | - |
| Candemir et al. [ | - | 95.4 | 96.7 | - | - | - | - | - | - | |
| Jangam et al. [ | - | 95.6 | 97.4 | - | - | - | - | - | - | |
| Wan Ahmed et al. [ | 95.77 | - | - | - | - | - | - | - | - | |
| Vital et al. [ | - | - | 95.9 | - | - | - | - | - | - | |
| Iakovidis et al. [ | - | - | 91.66 | - | - | - | - | - | - | |
| Chondro et al. [ | - | 96.3 | - | - | - | - | - | - | - | |
| Hybrid voting [ | - | 94.9 | - | - | 86.0 | - | - | 73.6 | - | |
| PC post-processed [ | - | 94.5 | - | - | 82.4 | - | - | 61.5 | - | |
| Human Observer [ | - | 94.6 | - | - | 87.8 | - | - | 89.6 | - | |
| PC [ | - | 93.8 | - | - | 81.1 | - | - | 61.8 | - | |
| Hybrid ASM/PC [ | - | 93.4 | - | - | 83.6 | - | - | 66.3 | - | |
| Hybrid AAM/PC [ | - | 93.3 | - | - | 82.7 | - | - | 61.3 | - | |
| ASM tuned [ | - | 92.7 | - | - | 81.4 | - | - | 73.4 | - | |
| AAM whiskers BFGS [ | - | 92.2 | - | - | 83.4 | - | - | 64.2 | - | |
| ASM default [ | - | 90.3 | - | - | 79.3 | - | - | 69.0 | - | |
| AAM whiskers [ | - | 91.3 | - | - | 81.3 | - | - | 62.5 | - | |
| AAM default [ | - | 84.7 | - | - | 77.5 | - | - | 50.5 | - | |
| Mean shape [ | - | 71.3 | - | - | 64.3 | - | - | 30.3 | - | |
| Dawoud [ | - | 94.0 | - | - | - | - | - | - | - | |
| Coppini et al. [ | - | 92.7 | 95.5 | - | - | - | - | - | - | |
| Deep feature-based methods | Dai et al. FCN [ | - | 92.9 | 96.3 | - | 86.5 | 92.7 | - | - | - |
| Dong et al. [ | 95.5 | - | 90.2 | |||||||
| Mittal et al. [ | 98.73 | 95.10 | - | - | - | - | - | - | - | |
| Oliveira et al. FCN [ | 95.05 | 97.45 | 89.25 | 94.24 | 75.52 | 85.90 | ||||
| Oliveira et al. U-Net [ | 96.02 | 97.96 | 89.21 | 94.16 | 86.54 | 92.58 | ||||
| Oliveira et al. SegNet [ | 95.54 | 97.71 | 89.64 | 94.44 | 87.30 | 93.08 | ||||
| Novikov et al. InvertedNet [ | 94.9 | 97.4 | 88.8 | 94.1 | 83.3 | 91.0 | ||||
| ContextNet-1 [ | 95.8 | - | - | - | - | - | - | - | ||
| ContextNet-2 [ | - | 96.5 | - | - | - | - | - | |||
| ResNet50 (512, C = 4) ~* [ | 93.9 | 96.8 | 88.3 | 93.7 | 79.4 | 88.3 | ||||
| ResNet50 (512, C = 4) * [ | 95.3 | 97.6 | 89.4 | 94.3 | 84.9 | 91.8 | ||||
| ResNet50 (512, C = 6) * [ | 94.5 | 97.2 | 89.3 | 94.3 | 84.3 | 91.5 | ||||
| ResNet50 (512, C = 8) * [ | 94.9 | 97.4 | 89.7 | 94.5 | 84.7 | 91.6 | ||||
| ResNet101 (512, C = 4) * [ | 95.3 | 97.6 | 90.4 | 94.9 | 85.2 | 92.0 | ||||
| ResNet50 (256, C = 4) * [ | 95.0 | 97.4 | 89.8 | 94.6 | 82.3 | 90.2 | ||||
| ResNet101 (256, C = 4) * [ | 94.9 | 97.4 | 90.1 | 94.7 | 79.6 | 88.5 | ||||
| BFPN [ | - | 87.0 | 93.0 | - | 82.0 | 91.0 | - | - | - | |
| OR-Skip-Net [ | 98.92 | 96.14 | 98.02 | 98.94 | 88.8 | 94.01 | 99.7 | 83.79 | 91.07 | |
| X-RayNet-1 (proposed method) | 99.06 | 96.65 | 98.29 | 99.16 | 90.99 | 95.22 | 99.8 | 88.72 | 93.94 | |
| X-RayNet-2 (proposed method) | 98.93 | 96.14 | 98.02 | 98.96 | 89.30 | 94.25 | 99.8 | 86.65 | 92.73 | |
~ represents the experiment without data augmentation. * ResNet50 and ResNet101 are used as the backbone network for Mask-RCNN; 512/ 256 shows that the input image size is (512 × 512)/(256 × 256), where C represents the number of the convolutional layer in the mask prediction branch of Mask-RCNN by Wang et al. [47]. ACC means accuracy, J shows jaccard, and D means dice score.
Figure 8Examples of X-Ray images from the (a) Montgomery County chest X-Ray set (MC) and (b) Shenzhen chest X-Ray set (SC) datasets with corresponding ground truths.
Figure 9Examples of lung segmentation by X-RayNet for the MC dataset: (a) original image; (b) ground-truth mask; (c) segmented image by X-RayNet (TP is presented in blue, FP in black, and FN in yellow).
Figure 10Examples of lung segmentation by X-RayNet for the SC dataset: (a) original image; (b) ground-truth mask; (c) segmented image by X-RayNet (TP is presented in blue, FP in black, and FN in yellow).
Accuracies of X-RayNet and other methods for the Montgomery County (MC) dataset (unit: %).
| Type | Method | Acc | J | D |
|---|---|---|---|---|
| Handcrafted local feature-based methods | Candemir et al. [ | - | 94.1 | 96.0 |
| Peng et al. [ | 97.0 | - | - | |
| Vajda et al. [ | 69.0 | - | - | |
| Learned/deep feature-based methods | Souza et al. [ | 96.97 | 88.07 | 96.97 |
| Feature selection with BN [ | 77.0 | - | - | |
| Feature selection with MLP [ | 79.0 | - | - | |
| Feature selection with RF [ | 81.0 | - | - | |
| Feature selection and Vote [ | 83.0 | - | - | |
| Bayesian feature pyramid network [ | - | 87.0 | 93.0 | |
| X-RayNet-1 (proposed method) | 99.11 | 96.36 | 98.14 | |
| X-Ray-Net-2 (proposed method) | 98.72 | 94.96 | 97.40 |
* The results for [64] and [65] are taken from [2]. BN, batch normalization; MLP means multi layer perceptron, and RF shows random forest. ACC means accuracy, J shows jaccard, and D means dice score.
Accuracies of X-RayNet and other methods for the Shenzhen X-ray set (SC) dataset (unit: %).
| Type | Method | Acc | J | D |
|---|---|---|---|---|
| Handcrafted local feature-based methods | Peng et al. [ | 97.0 | - | - |
| Vajda et al. [ | 92.0 | - | - | |
| Learned/deep feature-based methods | Feature selection with BN [ | 81.0 | - | - |
| Feature selection with MLP [ | 88.0 | - | - | |
| Feature selection with RF [ | 89.0 | - | - | |
| Feature selection and Vote [ | 91.0 | - | - | |
| Bayesian feature pyramid network [ | - | 87.0 | 93.0 | |
| X-RayNet-1 (proposed method) | 97.70 | 91.82 | 95.64 | |
| X-Ray-Net-2 (proposed method) | 97.32 | 90.56 | 95.0 |
* The results for [64] and [65] are taken from [2]. ACC means accuracy, J shows jaccard, and D means dice score.
Accuracies of X-RayNet trained on MC and tested on the SC dataset and vice versa (unit: %).
| Method | Train | Test | Acc | J | D |
|---|---|---|---|---|---|
| X-RayNet-1 | MC | SC | 96.27 | 87.74 | 93.24 |
| X-RayNet-1 | SC | MC | 98.10 | 92.52 | 96.06 |
Figure 11Sample image of chest anatomy segmentation for pixel count: (a) original image, (b) predicted mask by X-RayNet (FP (shown in black for each class), FN (shown in yellow for each class), and TP (shown in blue, green, and red for the lung, heart, and clavicle bone classes, respectively)), and (c) procedure for calculating CTR, CTR_P, and CTR_G represent the CTR predicted by the proposed method and that predicted by the ground-truth mask.