Jun Zhao1, Xiaosong Zhou1, Guohua Shi1, Ning Xiao1, Kai Song1, Juanjuan Zhao1, Rui Hao2, Keqin Li3. 1. College of Information and Computer, Taiyuan University of Technology, Taiyuan, China. 2. College of Information, Shanxi University of Finance and Economics, Taiyuan, China. 3. Hunan University and State University of New York, Albany, NY USA.
Abstract
Deep convolutional networks have been widely used for various medical image processing tasks. However, the performance of existing learning-based networks is still limited due to the lack of large training datasets. When a general deep model is directly deployed to a new dataset with heterogeneous features, the effect of domain shifts is usually ignored, and performance degradation problems occur. In this work, by designing the semantic consistency generative adversarial network (SCGAN), we propose a new multimodal domain adaptation method for medical image diagnosis. SCGAN performs cross-domain collaborative alignment of ultrasound images and domain knowledge. Specifically, we utilize a self-attention mechanism for adversarial learning between dual domains to overcome visual differences across modal data and preserve the domain invariance of the extracted semantic features. In particular, we embed nested metric learning in the semantic information space, thus enhancing the semantic consistency of cross-modal features. Furthermore, the adversarial learning of our network is guided by a discrepancy loss for encouraging the learning of semantic-level content and a regularization term for enhancing network generalization. We evaluate our method on a thyroid ultrasound image dataset for benign and malignant diagnosis of nodules. The experimental results of a comprehensive study show that the accuracy of the SCGAN method for the classification of thyroid nodules reaches 94.30%, and the AUC reaches 97.02%. These results are significantly better than the state-of-the-art methods.
Deep convolutional networks have been widely used for various medical image processing tasks. However, the performance of existing learning-based networks is still limited due to the lack of large training datasets. When a general deep model is directly deployed to a new dataset with heterogeneous features, the effect of domain shifts is usually ignored, and performance degradation problems occur. In this work, by designing the semantic consistency generative adversarial network (SCGAN), we propose a new multimodal domain adaptation method for medical image diagnosis. SCGAN performs cross-domain collaborative alignment of ultrasound images and domain knowledge. Specifically, we utilize a self-attention mechanism for adversarial learning between dual domains to overcome visual differences across modal data and preserve the domain invariance of the extracted semantic features. In particular, we embed nested metric learning in the semantic information space, thus enhancing the semantic consistency of cross-modal features. Furthermore, the adversarial learning of our network is guided by a discrepancy loss for encouraging the learning of semantic-level content and a regularization term for enhancing network generalization. We evaluate our method on a thyroid ultrasound image dataset for benign and malignant diagnosis of nodules. The experimental results of a comprehensive study show that the accuracy of the SCGAN method for the classification of thyroid nodules reaches 94.30%, and the AUC reaches 97.02%. These results are significantly better than the state-of-the-art methods.
Thyroid nodules, described as abnormal growths of glandular tissue, are the most common thyroid disorder [2]. Over the past 30 years, thyroid cancer has been one of the most prevalent and fastest-growing cancers of all types [15]. Therefore, early diagnosis of the benignity or malignancy of nodules is essential to reduce the morbidity and mortality of thyroid cancer [8]. Ultrasonography has become the most preferred choice for diagnosing benign and malignant thyroid nodules. However, there are still some challenges in analyzing thyroid ultrasound images. First, ultrasound images are susceptible to speckle noise and echo fluctuations, making the texture distribution in ultrasound images blurred and non-uniform [37]. Second, the diagnosis of thyroid ultrasound images is subjective and highly dependent on the physicians’ extensive experience and cognitive ability [14]. Conversely, the use of computer-aided diagnosis systems (CADs) can significantly reduce physicians’ workload and misdiagnosis rate. Thyroid image classification has become a research hotspot for computer-aided thyroid disease diagnosis [5].
Traditional methods of thyroid nodule classification
Acharya et al. [1] used Gabor transform to extract the features of thyroid benign and malignant images and compared the classification performance of SVM, MLP, KNN, and C4.5 classifiers. Raghavendra et al. [26] extracted high-order spectral (HOS) entropy features from particle swarm optimization (PSO) and support vector machine (SVM) models and distinguished benign and malignant lesions. Prochazka et al. [23] used dual-threshold binary decomposition to extract direction-independent features for random forest (RF) and SVM classifiers. The traditional training method is computationally inexpensive and does not require a large number of training images. However, there are still apparent limitations: 1) rely on many manually extracted image features and classifier selection, 2) is a tedious and unstable process, and 3) may lead to poor generalization ability.
Thyroid nodules classification method based on deep learning
Compared with traditional methods, deep learning methods can extract global and local features more accurately. In 2017, Ma et al. [16] applied the convolutional neural network for the first time to identify benign and malignant thyroid nodules. Wang et al. [36] designed an effective EM algorithm to train a CNN-based nodule classification model. Zhou et al. [50] proposed an online transfer learning (OTL) method to improve the diagnostic effect of ultrasound examination of thyroid nodules. Wang et al. [37] extracted multiple image features with different angles in one inspection for an attention-based feature aggregation network.All the above methods are based on single modality data for training and evaluation. In contrast, the actual medical imaging process expects to fuse data from different domains. Still, the following problems exist in the construction of models: 1) The scale of medical datasets remains a significant bottleneck for deep learning models. Data collection and manual annotation for each new modality or new domain are both time-consuming and expensive. Especially for thyroid imaging, there are fewer extant large-scale thyroid image datasets due to the specificity of thyroid location. 2) The distribution differences between different types of data, known as dataset deviations or domain shifts phenomenon, where deep networks trained on a large labeled dataset cannot be well generalized to new datasets and new tasks, resulting in significant degradation of the generalization performance the model.We adopt a domain adaptation (DA) algorithm [38] to address the above challenges. The DA algorithm aims to learn models from the source domain data distribution but works well for target domains with different but related data distribution. The principle behind DA is that the source and target domains can learn collaboratively and transfer their learned knowledge to each other during the entire training process, making the model robust to noise in the data. Currently, there is no work on effectively using cross-modal data to construct a DA framework for nodule diagnosis in thyroid ultrasound images.In general, the working pattern of the ultrasound physician is to combine information from both ultrasonography reports and ultrasound images and then to come up with a diagnosis. This model stimulated our interest in exploring the content of the reports. We find that the performance of image generation and image classification tasks can be improved by transferring the semantic-intensive feature representation associated with the images in the reports. In contrast, existing models lack the reasoning ability to imitate a physicians’ interpretation of semantic information and ignore important domain and expert knowledge [41] related to the specific task of thyroid diagnosis. Therefore, in our approach, we will incorporate disease keywords extracted from ultrasonography reports as textual information in multimodal data, as shown in Fig. 1.
Fig. 1
From left to right, the original ultrasound images of the four benign/malignant nodules randomly sampled in the dataset, the corresponding “Ultrasound Findings” in the ultrasonography report, and the domain knowledge are shown. Among them, the red text description is based on the relevant disease keywords selected by TI-RADS as the standard
From left to right, the original ultrasound images of the four benign/malignant nodules randomly sampled in the dataset, the corresponding “Ultrasound Findings” in the ultrasonography report, and the domain knowledge are shown. Among them, the red text description is based on the relevant disease keywords selected by TI-RADS as the standardIn this paper, we propose a new multi-task cascaded deep learning framework for diagnosing thyroid ultrasound images. First, we propose a self-attention-based semantic consistency generative adversarial network as a domain adaptation backbone to improve the quality of generated images. Second, to jointly analyze multimodal data features, the critical domain knowledge extracted from ultrasonography reports is fed into the generator structure through text modeling to promote the semantic consistency of generated images. Finally, the network integrates a modified classification model, ResNet-50, which uses combined features to classify benign or malignant thyroid nodules in ultrasound images.The main contributions of this paper are summarized as follows:We propose an effective model: semantic consistency generative adversarial network (SCGAN). To the best of our knowledge, this work is the first to apply cross-modal domain adaptation based on generative adversarial networks to the classification task of benign and malignant thyroid nodules.We propose a new cross-modal alignment self-attention module (CASAM) to facilitate domain adaption for achieving higher generative performance. The semantic alignment layer is used in CASAM to efficiently guide the semantic alignment process of image and knowledge features.We introduce two advanced techniques: the visual discrepancy loss to dynamically balance the need for the generator to learn domain invariant features, and the cross-domain fusion zero-centered gradient penalty (CD-GP) is incorporated into the discriminator to synthesize more realistic and knowledge semantic consistent images.Extensive experiments show that our proposed SCGAN achieves good results in thyroid nodules’ ultrasound image generation task and is well validated in the image classification results.The rest of the paper is organized as follows. We present related work on domain adaption, generative adversarial networks, and attention mechanisms for medical images in Section 2. The details of our approach are presented in Section 3. Section 4 describes our thyroid ultrasound image dataset and experimental evaluation results to validate the effectiveness of our approach. Finally, the conclusion and future work are drawn in Section 5.
Related work
Domain Adaptation
In the context of medical image analysis, most prospective studies on domain adaptation have focused on adjusting data distribution from various clinical centers, scanning protocols, and scanning sites. Dou et al. [6] pioneered a plug-and-play adversarial domain adaptation network (PnP-AdaNet), which combines multiple adversarial learning domain adaptation layers to spatially align the potential features of the target domain and the source domain. They tested on cardiac MRI/CT images. Zhang et al. [49] introduced a collaborative unsupervised domain adaptation (CoUDA) algorithm for medical image diagnosis. This algorithm via the collective intelligence of two peer subnets to conduct transferability-aware domain adaptation on whole-slide images (WSI) and microscopy images (MSI) of colon datasets. However, it is often difficult to seek a source domain with the same feature and categorical space as the target domain. Therefore, this paper focuses on more realistic and challenging scenarios to address the correlation problems of cross-domain data observed in different feature spaces, namely heterogeneous domain adaptation (HDA) [44].
Generative adversarial network
The domain-invariant representation of classification tasks from the source dataset to the target dataset has been extensively studied [38] by generating adversarial networks [45]. Chen et al. [7] investigated the domain adaptation framework, SIFA, which applies a deep supervision mechanism of synergistic image and feature alignment to deal with the transfer of domains due to adversarial learning, and extensive experiments on bidirectional cross-modality adaptation on multiple tasks. Ren et al. [27] considered the joint feature distribution between the source and target domain images and classified histological images obtained in different staining procedures via adversarial learning. Gu et al. [11] explored a two-step progressive transfer learning technique to improve the recognition performance of cross-domain skin diseases, and at the same time, adopted cycle-consistent adversarial learning to expand the model to cross-modal learning tasks such as melanoma detection.
Attention Mechanism
Although existing adversarial domain adaptation methods are effective in different tasks, the semantic correlation between domains has not been elucidated yet. Nowadays, attention mechanism has become a necessary element to capture inter-domain dependencies of the model. Wang et al. [40] added transferable attention for the domain adaptation (TADA) model and focused its application on core regions to enhance the transferability of images. Wang et al. [34] argued that complementing the attention branch in the Thorax-Net enhances the correlation between class labels and pathological abnormal locations. Furthermore, the three attention modules [35] can be merged into a unified framework for joint learning of channels, elements, and scales. In the thyroid ultrasound nodule diagnosis, we will demonstrate an improved version of a well-established self-attention mechanism to improve further diagnostic performance, which helps localize important regions of ultrasound images and enhancing cross-domain features’ correlatability.
Methods
This section illustrates the proposed semantic consistency generative adversarial network (SCGAN) for ultrasound image nodule classification. First, we introduce the selection criteria of domain knowledge and the processing of its integration into deep networks. Second, we present the overall structure of SCGAN, including the composition of the generator and discriminator, and focus on the contribution of the cross-modal alignment self-attention module to semantic consistency. Then, we explain the proposed visual discrepancy loss and regularization method. Finally, we give details of the modifications of the classifier.
Domain knowledge
Ultrasonography report preprocessing
The ultrasonography report [13] summarizes all clinical findings and physician impressions identified during the ultrasound study examination. Ultrasonography reports usually contain comprehensive patient information, but they may also contain inconclusive descriptions or irrelevant to the disease. For example, in the “Ultrasound Findings” of the ultrasonography report, as shown in Fig. 1, normal/abnormal conditions are recorded for each site of the thyroid examination, such as location, size, and severity of the nodules. Besides, patients’ personal information, medical history, and suspicious findings may lead to additional or follow-up studies. Therefore, parsing the content of ultrasonography reports is a complex and challenging task.The Thyroid Imaging Reporting and Data System (TI-RADS) [32] provides standardized terminology to describe thyroid nodule features in ultrasound images. Using TI-RADS as a guide, we screen the disease keywords in ultrasonography reports as domain knowledge, such as boundary, calcification, and echo pattern. By learning text embedding, this domain knowledge can facilitate the acquisition of semantic information in ultrasonography reports and improve the diagnostic performance of the leading classification tasks.
LSTM for Text Modeling
We use a pre-trained text encoder ϕ to learn the semantic information described by domain knowledge. Each textual description t is encoded as one-hot vectors that are then mapped to embeddings and added with contextual information. The text embedding ϕ(t) is fed into the LSTM proposed by [10]. At each time step, the obtained text embedding sequence {ϕ1,...,ϕ} takes the current text as input and iteratively applies the transfer function to generate the hidden state h:
which allows the extraction of high-dimensional semantic vectors from domain knowledge. Domain knowledge contains meaningful disease features, and the key is to maintain diversity and independence among them. To this end, we extract the hidden state corresponding to each disease keyword and obtain a text representation sequence The advantage of this strategy is that it enables the network to select relevant semantic features adaptively, ensuring that they are helpful for disease labeling (as shown in the experimental results).
Our network architecture is shown in Fig. 2. It consists of a pre-trained text encoder, a domain adaptation generator, and a discriminator. The generator is trained to generate images from the text describing the content, and the discriminator is trained to determine the authenticity of the images conditional on the semantics defined by the given text. We use the following notations: the domain adaptation generator is denoted as G: , the discriminator is denoted as D: {0,1}, where is the dimension of the embedded text representation, is the dimension of the image, and is the dimension of the noise input in G.
Fig. 2
Overview of our proposed SCGAN, consisting of a text encoder (top left), a domain adaptation generator G(bottom left), a discriminator D(top right), and a classifier (bottom right). G has two inputs, Z and generated by the text encoder, both of which are implemented in upblocks (gray boxes) for cross-domain fusion. The SACAM contained in upblocks promotes semantic alignment during the fusion process. Similarly, D distinguishes the authenticity of an image by a series of downblocks (gray boxes, I represents the synthesized image, I represents the real image). The classifier is the modified classification model ResNet-50. In particular, the adversarial loss refers to the hinge version of the adversarial loss, is the visual discrepancy loss, and is the cross-domain fusion zero-centered gradient penalty function
Overview of our proposed SCGAN, consisting of a text encoder (top left), a domain adaptation generator G(bottom left), a discriminator D(top right), and a classifier (bottom right). G has two inputs, Z and generated by the text encoder, both of which are implemented in upblocks (gray boxes) for cross-domain fusion. The SACAM contained in upblocks promotes semantic alignment during the fusion process. Similarly, D distinguishes the authenticity of an image by a series of downblocks (gray boxes, I represents the synthesized image, I represents the real image). The classifier is the modified classification model ResNet-50. In particular, the adversarial loss refers to the hinge version of the adversarial loss, is the visual discrepancy loss, and is the cross-domain fusion zero-centered gradient penalty functionThe generator G has two inputs, the text sequence of the source domain, and the other is the noise vector sampled from the Gaussian distribution to guarantee the diversity of the generated images. First, Z is fed into the fully connected layer and then sent to a series of upblocks and to upsample the images, which are used to integrate semantic information and image features during the image generation process. G uses upblocks as its network backbone, including convolutional layers, a self-attention layer, residual blocks, and an upsample layer. The self-attention layer brings more non-linearity to G, which is conducive to generating semantically consistent images from different textual descriptions. Therefore, G synthesizes realistic pseudo-target domain images by . Then I is regularized using visual discrepancy loss to be consistent with the corresponding region in the original image.The discriminator D attempts to compete with G by distinguishing between the synthetic pseudo target domain image I and the real target domain image I. D converts I into a feature map and downsamples it through a series of downblocks. Here, the intermediate layers of D have a smaller receptive field that forces G to pay more attention to finer details. The last few layers generally derive information from the larger image region and guide G to produce an image with better global consistency. Then is replicated and spliced onto the image features. Formally, D has to distinguish three input pairs composed of text: real images with matching text, real images with mismatched text, and synthetic images I.
B. Cross-modal alignment self-attention module
For the data heterogeneity between source-domain text representation and target-domain images, we propose the cross-modal alignment self-attention module (CASAM). The self-attention module efficiently computes long-range dependencies between features, allowing the generator to model the relationship between widely separated spatial regions effectively. CASAM leverages semantic association to effectively guide the alignment process while generating attention to important image features and text representation to provide more prominent and meaningful embedding for image generation tasks.As shown in Fig. 3, the module accepts two inputs: image feature map F and text representation sequence . First, according to the attention mechanism adopted in AttnGAN [42], the three-dimensional image features (width× height×channel) of F ∈ R are flattened into a two-dimensional sequence (wh×channel, where wh=width×height), and transformed into the query feature map Q to facilitate the calculation of attention. The formula is as follows:
Fig. 3
Details of the cross-modal alignment self-attention module (CASAM). The semantic alignment layer can focus on the source domain features corresponding to the target domain pixels. denotes dot product operation
Details of the cross-modal alignment self-attention module (CASAM). The semantic alignment layer can focus on the source domain features corresponding to the target domain pixels. denotes dot product operationTwo convolution layers with 1 × 1 filters are applied on to generate feature maps K and V, respectively:
Intuitively, the key K focuses on matching with Q, while the other projection value V can be better optimized to refine Q to obtain better F.We add a semantic alignment layer (SAL) to the module to strengthen the semantic relevance between Q and K by metric learning [22]. Here, we use:
as the geometric similarity to measure the relationship between the potential feature space of Q and K. In consideration of building a reasonable distance metric [4, 19, 22], the cosine similarity [9, 18, 33] is chosen in this paper as:The cosine similarity focuses on the similarity description of semantic classes. For the feature vector of each image subregion of Q, the better the alignment, the shorter the distance.The attention maps weight to the feature maps Q and K are generated to achieve more discriminative feature representation, and the attention map A is obtained as:The aggregation operation is defined as follows:
where the more refined features are captured by the dot product between A and V for feature adaptation. The obtained attention weights are normalized using the softmax function to convert the values into relative probabilities. The features are updated by collecting the attention weights of each acquired feature and the original feature mapping to obtain contextual information.
C. Visual discrepancy loss
We propose a new visual discrepancy loss for the generator. Visual discrepancy loss is encouraged to capture disparity features. If there is no discrepancy loss, then the requirement for G to learn the invariant domain information will be weaker. Thus, co-training visual discrepancy loss is an implicit facilitator for improving network adaptation and plays a crucial role in improving the quality and consistency of the final generated images. The L2 norm of the feature mapping between the real image I and the generated image I is defined as:
where ϕ(⋅) represents the process of extracting image feature maps.
D. Cross-domain fusion zero-centered gradient penalty
Recently, Mescheder et al. [17] introduced a zero-centered gradient penalty, adding regular terms to make the discriminator apply zero-centered gradient penalty to the input. Extending it to our domain adaptation task. We propose a cross-domain fusion zero-centered gradient penalty (CD-GP) function to improve the discriminator’s generalization capability. We choose to impose penalty terms on the real and generated data, respectively:where α and β are hyperparameters that balance the effectiveness of the gradient penalty and cannot both be zero.Compared with adding discriminators to ensure the semantic consistency of the generated images, our CD-GP does not introduce additional networks to compute the semantic similarity and therefore does not increase the complexity of the domain adaptation process or the training parameters.
E. Objective function
To stabilize and converge the training process of SCGAN, inspired by the SAGAN architecture [47], we evaluate the authenticity of the generated images and their consistency with the input semantics by minimizing the hinge version of the adversarial loss [3]. Formally, we represent the two outputs of D as: , the unconditional image score, and , the conditional image score. Correspondingly, the objective functions for D are formulated as and , respectively:is the real data distribution, is the generated data distribution, and is the mismatching data distribution.On the other side, G is trained to generate images that could trick D into giving high scores on visually realistic images and match the text. Similarly, the objective functions to be minimized by G are and , respectively:Taking into account the adversarial loss, visual discrepancy loss, and cross-domain fusion zero-centered gradient penalty, our total loss is defined as the weighted sum of these losses, as follows:
λ1 and λ2 are regularization parameters to balance the trade-off between , , and other terms.
Modified classification model ResNet-50
Each residual block of the ResNet-50 [12] network uses a bottleneck structure, which helps overcome the problem of gradient disappearance in large models. To adapt the ResNet-50 network to our problem of classifying benign and malignant nodules, the base layer of the model is frozen, and then custom layers are added to form the final framework. Therefore, we remove its last fully connected layer and add three fully connected layers of 2048, 1024, and 2 neurons, respectively. The weights of the final fully connected layer are fine-tuned by using a back-propagation technique which uses a gradient descent optimization algorithm to minimize the cost function. The final output of the model is obtained using the sigmoid activation function.
Result
Datasets
Our research works use images from a dataset provided by the local hospital to acquire ultrasound examination images and ultrasonography reports of 1083 patients, and the hospital institutional review board approves the entire collection process Due to the variable size of nodules, we exclude nodules with tumor size < 0.60 cm or > 3.00 cm and finally include 1937 nodules from ultrasound examinations in the final analysis. Their available ultrasonography reports correlate with the ultrasound findings of 867 patients. Ultrasound images are screened by experienced thyroid ultrasound physicians (physicians with more than eight years of experience in thyroid ultrasound imaging) based on suspicious features in TI-RADS, solid components, hypoechoic, or markedly hypoechoic, microgranular or irregular margins, microcalcifications, and ultra-wide shapes. “Ultrasound Findings” are classified into two categories: benign or malignant. There are 1032 benign nodules and 905 malignant nodules. We use a nested 10-fold cross-validation independent evaluation model. The dataset is divided into the training set, validation set, and test set. The training and test datasets are divided by patients, and there is no overlap between the two datasets. The training set and the validation set consist of 1800 images, and the training set isolates approximately 10% of the data as the validation set. The test set consists of 137 images. The data set distribution is shown in Table 1.
Table 1
Distribution of data in our dataset
Benign
Malignant
Total
Training Set
865
754
1619
Validation Set
97
84
181
Test Set
70
67
137
Total
1032
905
1937
Distribution of data in our dataset
Ultrasound Image Preprocessing
As shown in Fig. 4 to extract regions of interest (ROIs) containing nodules, the metadata text (e.g., information about the scanner, location, patient.) placed on the images are discarded to obtain the actual ultrasound image regions. We count the horizontal and vertical diameters of all nodules so that the nodule with the cross marker symbols is in the center of the patch image. It is finally decided to fill the patch size with zero to a square of 64 × 64 pixels size to maintain the image aspect ratio, and the pixels in the image are normalized to zero mean and unit variance.
Fig. 4
The overall process of image preprocessing. Where the blue dashed line indicates the vertical diameter of the nodule and the green dashed line indicates the horizontal diameter
The overall process of image preprocessing. Where the blue dashed line indicates the vertical diameter of the nodule and the green dashed line indicates the horizontal diameter
Evaluation metrics
Classification results are quantitatively evaluated by the mean and standard deviation of the obtained accuracy, sensitivity/recall, specificity, and area under the receiver operating curve (AUC). In this paper, the inception score (IS) [31] is chosen to measure the quality of the images generated by SCGAN. IS is the classical metric for evaluating GAN. Since IS does not reflect whether the generated images depend well on the given text representation, we combine it with physician evaluation. The semantic consistency of SCGAN is evaluated by experienced ultrasound physicians comparing the generated images with the corresponding domain knowledge description. We consider that physicians need to perform two tasks: one is to discriminate the authenticity of the image and determine whether the image matches the corresponding semantic information; the other is to diagnose the benignity or malignancy of the nodule.
Implementation details
The entire network is implemented using the TensorFlow framework based on Python 3.6 and trained on a workstation with Ubuntu 18.04 LTS, 2.90 GHz Intel(R) Xeon(R) W-2102 CPU, and two NVIDIA GTX Titan XP GPUs. For the text encoder, the dimension is set to 128, and the length of words is set to 30. In order to compare with previous work, the parameters of our text encoder are fixed during training. In the generator, the dimension of is set to 512. In the experiments, the network is trained using Adam optimizer with β1 = 0.9, β2 = 0.999. On our dataset, training is set up with 300 epochs and a minibatch size of 16. We choose a learning rate of 1e− 3 for the classifier and 2e− 4 for the rest of the architecture. The decay factor is 0.5 per 100 epochs. The target domain image enters the classifier for classification in the testing phase without involving GAN and other algorithm designs.
Experimental setup and analysis
For our proposed method, we set up three variants:Remove the CASAM of SCGAN, additional loss functions and , that is, directly concatenating text representation with image features in G (DAGAN).Use CASAM to fuse text representation and image features in G to test the contribution of CASAM in improving domain adaptation (SCGAN).Only remove the visual discrepancy loss of SCGAN (SCGAN).The effectiveness of our proposed method is demonstrated by designing several experimental sessions as follows.In our research, the SCGAN model is based on an intelligent combination of knowledge and images. To evaluate the advantages of this cross-modal domain adaptation approach for feature extraction, first, we construct GANs model for nodule feature extraction using only images as input. The experimental results obtained by different classification methods are shown in Table 2 and Fig. 5. Here, the SCGAN model is simplified to DCGAN [25] when no domain knowledge is added, and only unimodal data is used for feature extraction. The accuracy, sensitivity, specificity, and AUC obtained using the DCGAN+modified ResNet-50 model are 85.26 ± 1.62, 87.46 ± 3.14, 83.14 ± 1.69, 84.80 ± 2.51, respectively. By adding class labels to DCGAN as auxiliary information to form ACGAN [20], the ACGAN+modified ResNet-50 model shows a slight improvement in all metrics. However, the classification performance of the above methods is far inferior to that of the GAN model using a multimodal combination of domain knowledge and images. Among them, the DAGAN model, the most basic variant of SCGAN, has metrics of 89.93 ± 0.88, 91.34 ± 1.93, 88.57 ± 0.90, 92.98 ± 1.78, respectively. Compared with DAGAN, the metrics of SCGAN are improved by another 4.37%, 2.09%, 6.59%, and 4.04%, respectively. It suggests that integrating the domain knowledge from ultrasonography reports into the deep learning model can effectively improve the classification performance of nodules. It can also be concluded that the standard deviation of the classification results is smaller when domain knowledge is used, which means that the inclusion of domain knowledge can effectively improve the stability of nodule classification. In addition, to verify the classification stability of SCGAN applied to unbalanced samples, we randomly reduce the number of malignant nodules by half, but the parameters of the fixed pre-training model remain unchanged, denoted as SCGAN⋆. In the case of unbalanced data sets, the fluctuation of each metric value is slight, and the classification performance of SCGAN is excellent.
Table 2
Comparison of the classification performance of different image generation models with our SCGAN
Methods
Results(%)
Accuracy
Sensitivity
Specificity
AUC
DCGAN [25]
85.26 ± 1.62
87.46 ± 3.14
83.14 ± 1.69
84.80 ± 2.51
ACGAN [20]
87.45 ± 1.33
90.45 ± 2.76
84.57 ± 1.39
90.39 ± 2.09
DAGAN
89.93 ± 0.88
91.34 ± 1.93
88.57 ± 0.90
92.98 ± 1.78
SCGAN⋆
94.26 ± 0.63
94.37 ± 0.20
94.89 ± 0.53
96.79 ± 0.53
SCGAN
94.30 ± 0.48
93.43 ± 0.35
95.14 ± 0.42
97.02 ± 0.57
Fig. 5
ROC analysis of different image generation models with our SCGAN and its variants for thyroid nodule classification
Comparison of the classification performance of different image generation models with our SCGANROC analysis of different image generation models with our SCGAN and its variants for thyroid nodule classification
Ablation Study
To evaluate whether the self-attention mechanism can help the domain adaptation process to generate higher quality and semantically consistent images. We use both direct concatenation (i.e., DAGAN) and CASAM alignment (i.e., variant SCGAN) for the cross-domain fusion of text representation and images in G, respectively. Compared with DAGAN, SCGAN further improves the quantization performance, as shown in Tables 3 and 5, indicating that achieving alignment between domains in a brute force manner does not resolve the strong heterogeneity that exists between domains. DAGAN is essentially a pixel-level superposition of data from two different modalities. The mixing of data from different imaging principles affects the feature extractor’s judgment on target data’s feature distribution. Conversely, CASAM does not affect the independence of the feature distribution for each domain data. In particular, the semantic alignment layer can calculate the similarity between the generated image and the textual description before generating new image features. It can discover the semantic relationship between each pixel and words, mapping the image features to the corresponding fine-grained text representation.
Table 3
Classification performance comparison of our SCGAN and its variants
Classification performance comparison of our SCGAN and its variantsAlso, we quantitatively and qualitatively investigate the effects of and . Compared with SCGAN, SCGAN adds a gradient penalty to the discriminator to ensure the quality of the generated images. That is because reduces the gradient of to the lowest point of the loss function curve while ensuring the smoothness of its adjacent regions, while other input images, such as I, are placed on the high point of the curve. As shown in Fig. 8, the IS is significantly improved, indicating that gives the generator a more explicit convergence target, guiding the generator to generate more realistic images and semantically consistent with ultrasonography reports. Further, our proposed SCGAN adds to learn discrepancy features. In principle, in our cross-modal domain adaptation task, the data of these two modalities are different in the visual layer but converge in the semantic layer. If the generator only learns the low-level visual layer features in the source domain, the prediction results mapped in the target domain will deviate from our expectations and penalize by the adversarial loss. However, the results of our SCGAN converge significantly, indicating that G learns high-level semantic layer features. Thus, can reverse encourage G to deceive D in case of domain shifts, requiring G to capture high-level semantic domain invariant features across the source and target domains. As shown in Table 3, the accuracy and specificity are significantly improved, in Table 5, the IS is also boosted. The loss curves of the generator and discriminator in SCGAN are shown in Fig. 6. The experimental results prove the scientific validity of our techniques.
Fig. 8
Box and whisker plot analysis of the inception score (IS) for different image generation methods
Fig. 6
The loss curve of generator and discriminator in SCGAN
The loss curve of generator and discriminator in SCGANSpecifically, we tune the parameters λ1 and λ2 in the loss function Equation (15), and the results are shown in Table 4 and Fig. 7. The IS significantly increases from 4.14 to 4.23 when λ2 is changed from 0 to 2. Meanwhile, the IS increases to 4.26 when λ1 is changed from 0 to 0.2, verifying the effectiveness of combining these two techniques. The IS score significantly decreased when λ2 changed from 2 to 4 or λ1 changed from 0.2 to 0.5. It may be that the penalty is too large, leading to the loss of some more important features. Therefore, in SCGAN, we set λ1 and λ2 to 0.2 and 2, respectively.
Table 4
Ablation study of loss function parameter adjustment results
Results(%)
SCGAN(λ1, λ2)
0, 0
0, 2
0, 4
0.2, 2
0.5, 2
Inception Score
4.14 ± 0.23
4.23 ± 0.15
3.97 ± 0.30
4.26 ± 0.18
4.21 ± 0.37
Fig. 7
Inception score (IS) analysis for different parameters of the loss function
Ablation study of loss function parameter adjustment resultsInception score (IS) analysis for different parameters of the loss function
Architecture Analysis
Table 5 reports the IS scores of SCGAN and other compared methods. We can observe that our model obtains the best score, significantly improving the IS from 2.58 to 4.26. Compared with DCGAN and ACGAN, we believe that the inclusion of domain information can guide the direction of the generator to generate images, which gives the generator has less freedom to generate images using random noise and reduces the uncertainty of the image generation process. In contrast, the multi-generator and multi-discriminator structures in StackGAN [48] and AttnGAN [42] make the quality of the generated images in the initial layer affect the final refinement, so the effect is poor. In conclusion, SCGAN can generate visually more realistic images with higher quality and better diversity than existing methods (Fig. 8).The inception score (IS) of our proposed SCGAN and its variants compared with different image generation methodsBox and whisker plot analysis of the inception score (IS) for different image generation methodsIn Table 6, we compare with the losses used in different methods. For example, SD-GAN [46] proposed a contrastive loss to improve the consistency between images generated by the same text description. Oord et al. [21] measured the dependence of two mutual information by learning the InfoNCE loss function and obtained a useful representation between the information. Wang et al. [39] used triplet loss to make video patches from the same trajectory closer in the embedding space than random patches. However, in contrastive loss and InfoNCE loss, all positive and negative matching pairs of each sample need to be sampled separately, and our does not need to dig the negative of information, which can reduce the complexity of training. Adding triplet loss to the baseline reduces the quality score of the generated image. This result shows that the better disentanglement of triplet loss may separate the connections between features too much and reduce the smoothness of interpolation.
Table 6
Compare the inception score (IS) of and other different losses
Compare the inception score (IS) of and other different lossesTable 7 gives the performance metrics of the classification models of SCGAN when pre-trained with VGGNet [29], GoogLeNet [30], ResNet-50, ResNet-101 and ResNet-152, respectively. The results show that the highest accuracy values are achieved using ResNet-50. Moreover, the trade-off between classification results and network optimization is crucial. Considering the dimensionality and parameter complexity of deep networks such as ResNet-101 and ResNet-152, and the relatively stable performance obtained with the ResNet series, we choose to use ResNet-50. Therefore, we use the modified ResNet-50 model to train our dataset and use it as a baseline classification method.
Table 7
Performance of our classification models of SCGAN when using pre-trained VGGNet, GoogLeNet, ResNet-50, ResNet-101, and ResNet-152, respectively
CNNs
Results(%)
Accuracy
Sensitivity
Specificity
AUC
VGGNet [29]
78.83
79.10
78.57
81.81
GoogLeNet [30]
81.75
83.58
80.00
84.95
ResNet-50
84.67
88.06
81.43
88.28
ResNet-101
83.21
86.57
80.00
86.93
ResNet-152
82.48
85.70
80.00
86.14
Performance of our classification models of SCGAN when using pre-trained VGGNet, GoogLeNet, ResNet-50, ResNet-101, and ResNet-152, respectivelyFigure 9 visualizes 24 images generated using DAGAN, SCGAN and SCGAN. Through human perception, we can find that compared with benign nodules, malignant nodules contain calcification (abnormal white spots) and irregular edge contours. From the perspective of the image quality generated, DAGAN without domain adaptation synthesizes nodules with irregular shapes, rough texture distribution, and lack of rich details. In contrast, the details of the nodules generated by CASAM gradually become clearer. However, the marginal area of some nodules changes greatly, which may be related to less marginal semantic information. The images generated by our SCGAN model are visually convincing. Among them, the internal grayscale difference of the nodules is obvious, and the tissue texture is clear. Benign nodules have smooth borders, and clustered calcifications accompany malignant nodules. It shows that the effective combination of CASAM and the two losses can potentially ensure the quality of the generated image, including the shape and texture distribution of the nodules.
Fig. 9
Visualization results of 24 images generated using DAGAN, CASAM (i.e., variant SCGAN), and SCGAN
Visualization results of 24 images generated using DAGAN, CASAM (i.e., variant SCGAN), and SCGAN
Physician Evaluation
We collaborate with three senior physicians who treat thyroid diseases. The whole process is divided into two parts. First, the three physicians independently determine the authenticity of the images, the semantic consistency, and the benignity or malignancy of the nodules and give their respective diagnoses. Then, in the second part, the three physicians could discuss and give the final results of the consultation. The accuracy of each physician and their mean values are shown in Table 8. Overall, our proposed model performs better than ultrasound physicians. The experiment results indicate that the highest individual score of the three physicians is 75.67%, and the consultation score is higher than the average value of the three physicians in determining the authenticity of the ultrasound images. For the diagnosis of benign and malignant nodules, the consultation score is higher than the three-person independent score, and its overall accuracy is higher than that of authenticity discrimination. We discuss further with the physicians and analyze the experimental results in detail. Physicians have a more accurate judgment of nodules with apparent benign or malignant features, such as thos with a regular shape, clear borders, or obvious calcification. In contrast, physicians need to observe cases over time in conjunction with review results, such as nodules with an irregular shape, blurred or uneven borders, and hypoechogenicity. The rate of misdiagnosis by physicians is higher when there are similar symptoms to thyroiditis or multiple endocrine adenomas. Therefore, physician consultation can provide more diagnostic experience for definitive classification results compared to individual judgment. The physicians also indicate that discriminating the authenticity of an image is more challenging than discriminating the benignity or malignancy of a nodule, which demonstrated the effectiveness of SCGAN’s image generation capabilities. While discriminating the authenticity of the images, the physicians also evaluate the semantic consistency of the images. The results show that the image features match their associated knowledge descriptions, demonstrating the strength of our model in acquiring high-level semantic features.
Table 8
Evaluation results of three physicians on the authenticity of the images and the benignity and malignancy of the nodules
Physician
Real/Generate Image
Benign/Malignant Nodule
Accuracy
Accuracy
Sensitivity
Specificity
Physician 1
68.00
89.67
90.00
89.33
Physician 2
75.67
91.33
90.67
92.00
Physician 3
70.33
90.67
84.67
96.67
Average of physicians
71.33
90.56
88.45
92.67
Consultation score
73.67
93.67
93.33
94.00
Evaluation results of three physicians on the authenticity of the images and the benignity and malignancy of the nodules
Comparison with State-of-the-Arts
Table 9 shows the performance comparison between the proposed SCGAN and nine state-of-the-art classification methods for thyroid nodules. The results show that the proposed model achieves a better classification performance. Since most of the datasets used for training models in the paper are derived from private datasets and the code is not open source, it is impossible to directly compare SCGAN with others’ methods on the same datasets. Therefore, Table 9 lists the performance of these methods as recorded in the original published literature. Refs. [1, 23, 26] records the classification results of traditional methods. Refs. [36, 37, 50] records the classification results of deep learning methods under a single modality.The remaining three methods are similar to our proposed SCGAN in that they all choose to extract features from multimodal data. Among them, Yang et al. [43] and Qin et al. [24] both chose to extract features from images by fusing features from conventional ultrasound images with elasticity images. The former used information from different modalities to train DScGANS models to facilitate the diagnosis of benign and malignant thyroid nodules. The latter focused on comparing the effects of different fusion strategies and different classification network structures on classification performance. Compared to Qin, our method has higher sensitivity and similar accuracy, specificity, and AUC. However, all the above methods are constrained by the limited availability of annotated data. Differently, Shi et al. [28] instead used standardized terminology to assist in the extraction of ultrasound image features in KACGAN to facilitate thyroid nodule image enhancement. This method is similar to our idea, but our method does better in cross-modal alignment and obtains higher metric values. As mentioned above, cross-domain fusion using multimodal data to improve the classification performance of thyroid nodules has become a trend in thyroid nodule diagnosis.
Table 9
Performance comparison of the SCGAN model with nine other existing methods for thyroid nodule classification
Methods
Modality
Sample
Results(%)
Accuracy
Sensitivity
Specificity
AUC
Acharya et al. [1]
US(Texture features+C4.5)
48 malignant, 223 benign
94.30
Not Given
Not Given
Not Given
Raghavendra et al. [26]
US(HOS + PSO +SVM)
56 malignant, 288 benign
97.71
Not Given
Not Given
Not Given
Prochazka et al. [23]
US(histogram features +RF)
20 malignant, 40 benign
95.00
95.00
95.00
97.12
Wang et al. [37]
US
341 malignant, 705 benign
87.32 ± 0.0007
84.22 ± 0.0023
Not Given
90.06 ± 0.0007
Wang et al. [36]
US
524 malignant, 470 benign
88.25
90.00
86.50
92.86
Zhou et al. [50]
US
1311 malignant, 4291 benign
Not Given
98.70
98.80
98.00
DScGANS [43]
US+USE
1489 malignant, 1601 benign
90.5 ± 0.06
88.1 ± 0.08
92.6 ± 0.07
91.4 ± 0.04
Qin et al. [24]
US+USE
617 malignant, 539 benign
94.7 ± 0.53
92.77 ± 1.04
97.96 ± 1.13
98.77 ± 1.05
KACGAN [28]
US+Text
905 malignant, 1032 benign
91.46 ± 0.46
90.63 ± 0.38
92.65 ± 0.16
95.32
Proposed SCGAN
US+Text
905 malignant, 1032 benign
94.30 ± 0.48
93.43 ± 0.35
95.14 ± 0.42
97.02 ± 0.57
a Values are expressed as mean ± standard deviation
b Std is not provided in some sources
c US means ultrasound image, USE means ultrasound elasticity image
Performance comparison of the SCGAN model with nine other existing methods for thyroid nodule classificationa Values are expressed as mean ± standard deviationb Std is not provided in some sourcesc US means ultrasound image, USE means ultrasound elasticity image
Conclusion
In this paper, we propose a new deeply fused semantic consistency generative adversarial network (SCGAN) to diagnose benign and malignant nodules in thyroid ultrasound images. The method organically combines image features with textual information. The domain adaptation process of these two cross-modal data is accomplished jointly through the self-attention mechanism and metric learning, using their semantic consistency to reduce domain shifts in the training process. The addition of two new techniques to guide the hinge loss based on adversarial learning promotes the convergence of the network and improves the quality of image generation. The experimental results demonstrate the effectiveness of our SCGAN in improving the performance of target domain classification networks and have potential clinical applications.We will work on a training model that can be applied to more types of ultrasound images and domain knowledge in future work. For example, the inclusion of richer knowledge information such as blood flow signals or ultrasound elasticity images improves diagnostic accuracy. Besides, the embedding process of our domain knowledge relies on pre-trained text encoders, which, unlike natural datasets, require parameter tuning for medical datasets. In the next step, we will add an attention mechanism to the text encoder to achieve the most advanced performance.