Literature DB >> 34907442

Convolutional Neural Network-Based Computer-Assisted Diagnosis of Hashimoto's Thyroiditis on Ultrasound.

Wanjun Zhao¹, Qingbo Kang², Feiyan Qian³, Kang Li², Jingqiang Zhu¹, Buyun Ma⁴.

Abstract

PURPOSE: This study investigates the efficiency of deep learning models in the automated diagnosis of Hashimoto's thyroiditis (HT) using real-world ultrasound data from ultrasound examinations by computer-assisted diagnosis (CAD) with artificial intelligence.
METHODS: We retrospectively collected ultrasound images from patients with and without HT from 2 hospitals in China between September 2008 and February 2018. Images were divided into a training set (80%) and a validation set (20%). We ensembled 9 convolutional neural networks (CNNs) as the final model (CAD-HT) for HT classification. The model's diagnostic performance was validated and compared to 2 hospital validation sets. We also compared the accuracy of CAD-HT against seniors/junior radiologists. Subgroup analysis of CAD-HT performance for different thyroid hormone levels (hyperthyroidism, hypothyroidism, and euthyroidism) was also evaluated.
RESULTS: 39 280 ultrasound images from 21 118 patients were included in this study. The accuracy, sensitivity, and specificity of the HT-CAD model were 0.892, 0.890, and 0.895, respectively. HT-CAD performance between 2 hospitals was not significantly different. The HT-CAD model achieved a higher performance (P < 0.001) when compared to senior radiologists, with a nearly 9% accuracy improvement. HT-CAD had almost similar accuracy (range 0.87-0.894) for the 3 subgroups based on thyroid hormone level.
CONCLUSION: The HT-CAD strategy based on CNN significantly improved the radiologists' diagnostic accuracy of HT. Our model demonstrates good performance and robustness in different hospitals and for different thyroid hormone levels.

Entities: Chemical

Keywords: Hashimoto’s thyroiditis; artificial intelligence; convolutional neural networks; radiologists; ultrasound

Mesh：

Year: 2022 PMID： 34907442 PMCID： PMC8947219 DOI： 10.1210/clinem/dgab870

Source DB: PubMed Journal: J Clin Endocrinol Metab ISSN： 0021-972X Impact factor: 5.958

Hashimoto’s thyroiditis (HT) is a typical, organ-specific, autoimmune disease, and it is the most common chronic lymphocytic thyroiditis (1). HT is characterized by autoimmune-mediated destruction of the thyroid gland, involving the apoptosis of thyroid epithelial cells, with diffuse lymphocytic infiltration of the thyroid by predominantly thyroid-specific B and T cells and follicular destruction. Consequently, these typically result in the painless enlargement of the thyroid gland, fibroblastic proliferation, calcification, and vascular proliferation, and it is the main reason for primary hypothyroidism in the United States (1). With the increasing attention of the general population to thyroid diseases, the prevalence of HT has been increasing in the past few decades (2). The incidence of HT is >0.3 to 1.5 per 1000 cases every year (3), whereas at autopsy, 40% to 45% of women and 20% of men are diagnosed with HT in the United Kingdom and the United States (4). In China, there are more than 40 million people with primary hypothyroidism, of whom 80% are caused by HT (5). Hence, early diagnosis of HT would be important in monitoring the disease course more efficiently, thus tailoring treatment protocols and delaying thyroid failure. Conventionally, HT diagnosis is confirmed by demonstrating the presence of autoantibodies to thyroglobulin antibodies (TgAb) and thyroid peroxidase antibodies (TPOAb) (6). However, the serological presentation can vary significantly and a critical proportion of patients can have low or even 0 autoantibody levels (10%-15%) (6). Moreover, the invasive nature of fine-needle aspiration (FNA) biopsy diminishes its applicability and appropriateness in the clinical diagnosis of benign disorders (7). Thus, ultrasound (US; or ultrasonography), which is an essential noninvasive imaging modality, can help physicians make efficient clinical diagnosis processes. However, compared with other thyroid disorders, US characteristics of HT are more difficult to distinguish because different HT pathologies can exhibit different US features (8). This association may be due to hypo-echogenicity, wherein inflammatory cells infiltrate into the thyroid. Moreover, pseudo-nodules and inhomogeneous parenchyma have also been observed in HT(9). The US features of nodular HT can significantly vary and can sometimes be associated with other benign and malignant thyroid nodules (10). These findings imply that it is difficult to identify the subtle sonographic differences between normal and HT images, as powerful features are needed to distinguish such differences. Hence, it is crucial to improve the accuracy of US diagnosis and thus render this imaging modality as the primary screening method of HT. Conventional image recognition techniques, such as analysis of grayscale histograms and computerized grayscale US, are limited by the fact that echogenicity varies according to the adjustment of the US settings and with the different stages of the disease (10). Recently, computer-assisted diagnosis (CAD) with novel artificial intelligence (AI) has been widely developed for automated efficient US image analysis, which uses a standardized image acquisition procedure to train and developed the CAD deep learning model (11). The deep convolutional neural network (CNN), which is a deep learning technique, has demonstrated the implementation of deep learning in the assessment of medical images (12). Utilization of multiple layers of image analysis filters in CNN allows a feature map to be generated via a systematic convolution of multiple filters across the image, which is then used as the input to the subsequent layer. Images are processed with respect to the manifestation of pixels as the input and to the desired classification as the output. However, most of the existing CADs of US have been used in the diagnosis of benign and malignant nodules of the thyroid rather than diffuse thyroid diseases (13). Thus, in this study, we aimed to evaluate the capability of deep learning models with CNN to provide an automated diagnosis of HT using real-world US data from clinical thyroid US examinations. Four CNN models with 9 versions were compared and ensemble. Eventually, we selected the CAD of HT (HT-CAD) as the most accurate model and evaluated its performance in the diagnosis of different HT types.

Methods

Patients and Ultrasound Images

This retrospective multicohort study was approved by the institutional review board of the West China Hospital, Sichuan University, Sichuan, China (no. 20210217), and the requirement to obtain informed consent was waived. The US images of HT and non-HT individuals were collected from 2 hospitals in China from September 2008 to February 2018. Eligibility criteria for included US images in our study were (1) conventional US examination before biopsy and surgical treatment, (2) with or without HT diagnosed with biopsy or postsurgical pathology, and (3) age of ≥18 years. The exclusion criteria were (1) images without thyroid tissues and (2) thyroid nodules were found to account for more than 50% of the thyroid tissue when combined with thyroid nodules. All US images were assessed using either IU22 (Philips, Eindhoven, The Netherlands) or DC-80s (Mindray, Shenzhen, China) with their default modes of thyroid examination. Both apparatus were equipped with 5 to 13 MHz linear probes. All patients were examined in supine position with their backs extended, thus providing us with good exposure of their lower thyroid margins. Both thyroid lobes and the isthmus were scanned in longitudinal and transverse planes, which were acquired according to the American College of Radiology accreditation standards (14). Two senior thyroid radiologists with ≥10 years of clinical experience performed all examinations. HT or non-HT diagnosis was based on the histopathological findings of FNA biopsy or thyroid surgery. According to Mizukami et al (15), only cases associated with lymphoplasmacytic infiltration with germinative center formation, oxyphilic cell metaplasia (Hürthle), atrophy, and fibrosis of thyroid follicles were classified as HT. The following variables were also considered: results of thyroid function tests (thyroid-stimulating hormone, free triiodothyronine, and free thyroxine) and the levels of anti-TgAb (antibody ID:AB_1875964) and TPOAb (antibody ID:AB_10698496). The reference TgAb and TPOAb ranges were <115 IU/mL and <34 IU/mL, respectively.

Data Preprocess

All thyroid US images extracted from the thyroid imaging database at the 2 hospital sites were in jpeg format. To maintain a high quality of US images, all thyroid images were screened, and low-quality images containing severe artifacts or significant image resolution reductions were removed. All images were screened by 2 radiologists with ≥5 years of experience in US imaging. We divided all thyroid images into a training set (80%) and a validation set (20%). Since the amount of training thyroid US images in our data set was limited, extensive data augmentations operations, including rotation, flipping, scaling, random brightness transform, random contrast, random gamma transform, and perspective transform, were performed during neural network training. Data augmentation was used to increase both the size and the diversity of the training data set. Gaussian filters (16) are used to transform the input image in the data augmentation step. After data augmentation, all images were resized to 512 × 512 pixels, and a mean normalization was performed as follows: where X represents the original US image, μ represents the mean pixel value and σ denotes the standard deviation among all training images. Consequently, X reflected the normalized images used for network training.

Network Architecture

In this study, CNNs (17) were used to train the deep learning algorithm, in which image input features were mapped to the hidden layers, comprising multiple convolutional, pooling, and fully connected layers. This algorithm learns hierarchical representations from the input imaging data, and a trained model makes predictions on the input data. The filters in the convolutional layer of our CNN models (18) were directly automatically learned from image data. We evaluated various representative and commonly used CNN architectures for HT classification on the basis of US images, including Visual Geometry Group (VGG) network (19), residual network (20), dense network (21), and efficient network (22). All these networks are milestones in the development of CNN architectures and serve as dominate baselines in many image classification tasks, including medical images. Layers are functional units of neural networks, in which abstract features of the input images are learned and subsequently stored. We used the VGG model with 19 layers (VGG19); residual network models with 18, 50, and 152 layers; dense network models with 169 and 264 layers; and efficient network model versions b0, b4, and b7. Additionally, to further improve the respective classification performance and generalization ability, we used model ensemble and test time augmentation (TTA) techniques (23). More specifically, majority voting is used for the model ensemble. By contrast, the TTA denotes that the horizontal and vertical flips of the original images are fed into the trained models during model inference, in addition to the original image, and the average of the results is taken as the final result. We used cross entropy as the loss function: where N denotes the total amount of training images; y represents the ground truth label of the th image [ie, 1 for positive class (HT) and 0 for negative class (non-HT)]; and p stands for the probability that the th image is positive as predicted by the model. All models were implemented by the PyTorch (24) framework; we also utilized pretrained weights on ImageNet to accelerate model convergence. The adam optimizer with an initial learning rate of 0.0003 was used to train all networks. The learning rate was halved when there was no loss decline on the validation set for 20 epochs. The batch size of training for all models was 16. For all the networks, we selected the model with the lowest validation loss for performance comparison. Thyroid images from the validation set were provided to 3 junior US radiologists (1-3 years of experience) and 3 senior radiologists (>10 years of experience) who were blind to the classification and did not review any other images from the same patients acquired during the original US examination. Their diagnostic performances were then compared with the best CNN models.

Performance Evaluation and Statistical Analysis

These models were developed using Python 3.4.3. We compared the accuracy of the CNN models. Analysis of receiver operating characteristic curves was performed to calculate the optimal area under the curve (AUC) for HT and normal thyroid tissues. Differences among various AUCs were compared using the DeLong test (25). Sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively), and the F1 value were also calculated (26). To evaluate the classification agreement between HTs and non-HTs, the Fleiss’s κ value (27) was calculated for each set. All statistical analyses were performed using the SPSS software for Windows, version 20.0 (SPSS, Chicago, IL, USA) and R Language (version 3.5.2). All P-values were 2-tailed, and a P-value of <0.05 was considered statistically significant.

Results

Baseline Characteristics

Between September 2008 and February 2018, 56 720 potential US images were retrospectively collected in this study. Among them, 17 440 images were excluded as a result of our inclusion and exclusion criteria. Consequently, a total of 39 280 US images of 21 118 patients were finally included in this study. We randomly obtained 31 424 US images from 14 889 patients in the training set (16 533 images with HT and 14 890 images without HT) and 7856 images from 6229 patients in the validation set (4133 images with HT and 3723 images without HT). Table 1 shows the baseline characteristics of the training and validation sets. The clinical characteristics of the patients were relatively similar between the 2 sets, and there were no significant differences between the 2 sets.

Table 1.

Baseline characteristics of patients with HT or non-HT in training set and validation set

	HT				Non-HT
	All	Training Set	Validation set		All	Training Set	Validation set
Items	(n = 10739)	(n = 7463)	(n = 3276)	P value	(n = 10 379)	(n = 7426)	(n = 2953)	P-value
Age	36.92±14.81	37.00±14.85	36.68±14.80	0.949	44.73±14.85	44.61±14.78	45.48±14.84	0.880
Gender				0.406				0.437
Male	2684	1835	849		2594	1840	754
Female	8055	5578	2477		7785	5586	2199
TSH	4.87±17.39	4.97±17.63	4.79±13.85	0.720	5.35±51.56	5.44±52.72	4.82±45.02	0.694
FT3	15.40±88.00	14.96±79.95	16.76±104.15	0.464	8.75±39.20	8.66±39.15	9.27±44.67	0.640
FT4	34.79±164.36	34.19±158.36	36.44±175.98	0.666	20.12±44.70	20.15±54.00	19.94±31.42	0.904
TgAb	845.75±1099.25	843.45±1092.08	861.68±1189.79	0.646	32.87±129.61	32.21±126.00	33.03±137.08	0.870
TPOAb	373.53±859.29	371.35±859.17	380.10±877.72	0.736	13.19±10.03	13.21±10.00	13.05±9.49	0.579
Hyperthyroidism	2093 (19.49)	1476(20.05)	617 (18.83)	0.267	2004 (19.31)	1424 (19.18)	580 (19.64)	0.607
Hypothyroidism	2863 (26.66)	1969 (26.38)	894 (27.29)	0.340	2736 (26.36)	1975 (26.60)	761 (25.77)	0.403

Qualitative variables are in n (%), and quantitative variables are in mean ± SD.

Abbreviations: FT3, free triiodothyronine; FT4, free thyroxine; TgAb, TPOAb, thyroid peroxidase antibodies; TSH, thyroid-stimulating hormone.

Baseline characteristics of patients with HT or non-HT in training set and validation set Qualitative variables are in n (%), and quantitative variables are in mean ± SD. Abbreviations: FT3, free triiodothyronine; FT4, free thyroxine; TgAb, TPOAb, thyroid peroxidase antibodies; TSH, thyroid-stimulating hormone.

Diagnostics Accuracy of the CNN models

Figure 1 shows the flowchart with all the related processes performed in this study. The 9 basic CNN models and the models with TTA achieved high performance in terms of identifying HTs in the validation set (shown in Table 2). Finally, the ensemble model with the TTA, which we called the HT-CAD model, demonstrated the highest diagnostic accuracy when compared with the other basic CNN models or models with TTA. The accuracy, sensitivity, specificity, PPV, NPV, AUC, F1, and κ value of the ensemble model with TTA were 0.892, 0.890, 0.895, 0.904, 0.880, 0.940, 0.892, and 0.784, respectively.

Figure 1.

Table 2.

Diagnostic performance of the final ensembled model and the 9 basic version convolutional neural network models with test time augmentation

Model	Accuracy	Sensitivity	Specificity	PPV	NPV	AUC	F1 (avg)	κ value
VGG19	0.842	0.835	0.850	0.860	0.823	0.900	0.842	0.684
VGG19 (TTA)	0.851	0.845	0.858	0.868	0.833	0.918	0.851	0.702
ResNet18	0.850	0.846	0.856	0.867	0.833	0.917	0.850	0.700
ResNet18 (TTA)	0.865	0.867	0.862	0.875	0.854	0.928	0.864	0.729
ResNet50	0.860	0.855	0.865	0.875	0.843	0.922	0.859	0.719
ResNet50 (TTA)	0.870	0.867	0.874	0.884	0.856	0.931	0.870	0.740
ResNet152	0.864	0.859	0.868	0.879	0.848	0.926	0.863	0.727
ResNet152 (TTA)	0.874	0.871	0.878	0.888	0.859	0.932	0.874	0.748
DenseNet169	0.860	0.856	0.865	0.875	0.844	0.923	0.860	0.720
DenseNet169 (TTA)	0.871	0.863	0.879	0.888	0.852	0.931	0.870	0.741
DenseNet264	0.866	0.860	0.874	0.883	0.849	0.930	0.866	0.732
DenseNet264 (TTA)	0.876	0.867	0.886	0.894	0.857	0.932	0.876	0.752
EfficientNet-b0	0.864	0.874	0.853	0.868	0.859	0.924	0.864	0.727
EfficientNet-b0 (TTA)	0.874	0.879	0.869	0.882	0.867	0.933	0.874	0.748
EfficientNet-b4	0.870	0.879	0.860	0.875	0.865	0.930	0.870	0.739
EfficientNet-b4 (TTA)	0.878	0.882	0.874	0.886	0.870	0.935	0.878	0.756
EfficientNet-b7	0.874	0.880	0.868	0.881	0.867	0.933	0.874	0.748
EfficientNet-b7 (TTA)	0.881	0.885	0.877	0.889	0.873	0.937	0.881	0.762
Ensemble model	0.889	0.887	0.892	0.901	0.877	0.938	0.889	0.778
Ensemble (TTA) model	0.892	0.890	0.895	0.904	0.880	0.940	0.892	0.784

Abbreviations: DenseNet, Dense Nework; EfficientNet, Efficient Network; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; κ value, the Fleiss’s κ value; ResNet, Residual Network; TTA, test time augmentation; VGG, Visual Geometry Group Network.

Flowchart of the procedures in the development of deep learning models for Hashimoto’s thyroiditis (HT) diagnosis on ultrasound. Using data sets from 2 hospitals, the deep learning model with convolutional neural networks was trained to differentiate HT. Abbreviations: FN, false negative; FP, false positive; NPV, negative predictive value; PPV, positive predictive value; TN, true negative; TP, true positive. Diagnostic performance of the final ensembled model and the 9 basic version convolutional neural network models with test time augmentation Abbreviations: DenseNet, Dense Nework; EfficientNet, Efficient Network; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; κ value, the Fleiss’s κ value; ResNet, Residual Network; TTA, test time augmentation; VGG, Visual Geometry Group Network.

Performance of the HT-CAD in Different Hospitals

As HT-CAD showed the best performance over other models, we further investigated whether its diagnostic accuracy was influenced by different hospital settings (Fig. 2, Table 3). In the validation set of 2 hospitals, the HT-CAD offered almost the same levels of accuracy (0.901 vs 0.887) for the 2 subgroups. Thus, our method achieved similar levels of performance in 2 different clinical settings. Comparisons pertaining to sensitivity, specificity, PPV, NPV, and AUC also confirmed that HT-CAD application to US images acquired from different US equipment and evaluated by different technicians did not demonstrate any statistically significant differences between hospitals (all Ps > 0.05).

Figure 2.

Receiver operating characteristic (ROC) curves of the HT-CAD model on different hospitals. Orange line shows the performance of HT-CAD model on all validated Hashimoto’s thyroiditis (HT) images, including images from Hospitals A and Hospital B; the area under the curve (AUC) is 0.940. Green line indicates the performance of HT-CAD model on HT images from A hospital, and the AUC is 0.949. Purple line indicates the performance of HT-CAD model on HT images from B hospital, and the AUC is 0.936. There is no statistical difference (P > 0.05).

Table 3.

Comparison the performance of HT-CAD in different two hospitals

	Accuracy(95% CI)	Sensitivity (95%C I)	Specificity (95% CI)	PPV	NPV	AUC	F1 (avg)	κ value
Performance
All	0.892 (0.881-0.902)	0.890 (0.868-0.911)	0.895 (0.874-0.913)	0.904	0.880	0.940	0.892	0.784
Hospital A	0.901 (0.890-0.911)	0.898 (0.878-0.916)	0.902 (0.884-0.0.919)	0.892	0.892	0.949	0.886	0.798
Hospital B	0.887 (0.876-0.898)	0.884 (0.866-0.903)	0.891 (0.875-0.909)	0.911	0.873	0.936	0.896	0.780
P-value
All vs Hospital A	0.127	0.135	0.188	—	—	—	—	—
All vs Hospital B	0.314	0.265	0.377	—	—	—	—	—
Hospital A vs Hospital B	0.071	0.069	0.104	—	—	—	—	—

Abbreviations: AUC, area under the curve; κ value, Fleiss’s κ value; NPV, negative predictive value; PPV, positive predictive value.

Comparison Between the HT-CAD Model and Radiologists

Three senior and 3 junior US radiologists who were blind to cytology data performed differential diagnoses using US images from the validation set. Table 4 shows the radiologists’ performances. The accuracy, sensitivity, specificity, PPV, NPV, AUC, F1, and κ value of the senior radiologists were 0.801, 0.805, 0.797, 0.805, 0.797, 0.801, 0.805, and 0.602, respectively. By contrast, the accuracy, sensitivity, specificity, PPV, NPV, AUC, F1, and κ value of the junior radiologists were 0.653, 0.660, 0.647, 0.662, 0.646, 0.654, 0.661, and 0.308, respectively. These findings underline that senior radiologists outperformed their junior colleagues with a significant accuracy improvement of nearly 15% in the validation sets (P < 0.001). However, when these skilled senior radiologists were compared to the HT-CAD model, results underline that our model achieved higher performance in terms of identifying HT patients (P < 0.001), with a nearly 9% accuracy improvement.

Table 4.

The comparison of diagnostic performance between HT-CAD and senior or junior radiologists

	Accuracy(95%CI)	Sensitivity(95%CI)	Specificity(95%CI)	PPV	NPV	AUC	F1(avg)	κ value
Performance
CNN model	0.892 (0.881-0.902)	0.890 (0.868-0.911)	0.895 (0.874-0.913)	0.904	0.880	0.940	0.892	0.784
Radiologists
Senior	0.801 (0.784-0.818)	0.805 (0.786-0.822)	0.797 (0.778-0.814)	0.805	0.797	0.801	0.805	0.602
Junior	0.654 (0.639-0.667)	0.660 (0.644-0.676)	0.647 (0.626-0.667)	0.662	0.646	0.654	0.661	0.308
P-value
Senior vs CNN model	<0.001	<0.001	<0.001
Junior vs CNN model	<0.001	<0.001	<0.001
Senior vs junior	<0.001	<0.001	<0.001

Abbreviations: PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; κ value, Fleiss’s κ value; CNN, convolutional neural networks.

The comparison of diagnostic performance between HT-CAD and senior or junior radiologists Abbreviations: PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; κ value, Fleiss’s κ value; CNN, convolutional neural networks.

Performance of the Model on Different Thyroid Hormone Levels

Considering that differences in the thyroid hormone levels are bound to interfere with the acquired US images, we analyze the subgroups based on different thyroid hormone levels (Table 5, Fig. 3). HT-CAD accuracy in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups was 0.871, 0.888, and 0.894, respectively, whereas the HT-CAD sensitivity in these subgroups was 0.911, 0.883, and 0.896, respectively. Finally, HT-CAD specificity in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups was 0.674, 0.874, and 0.908, respectively.

Table 5.

Comparison performance of HT-CAD in different subgroups by thyroid hormone levels

	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	PPV	NPV	AUC	F1 (avg)	κ value
Performance
All	0.892 (0.881-0.902)	0.890 (0.868-0.911)	0.895 (0.874-0.913)	0.904	0.880	0.940	0.892	0.784
Group A (with hyperthyroidism)	0.871 (0.861-0.880)	0.911 (0.893-0.929)	0.674 (0.656-0.692)	0.922	0.660	0.861	0.920	0.586
Group B (with hypothyroidism)	0.888 (0.877-0.897)	0.883 (0.861-0.905)	0.874 (0.854-0.891)	0.950	0.754	0.931	0.920	0.731
Group C (with euthyroidism)	0.894 (0.884-0.902)	0.896 (0.874-0.915)	0.908 (0.889-0.925)	0.879	0.908	0.947	0.887	0.787
P-value
All vs Group A	<0.001	<0.001	<0.001				—	—
All vs Group B	0.384	0.219	0.003				—	—
All vs Group C	0.625	0.247	0.084				—	—
Group A vs Group C	<0.001	<0.001	<0.001				—	—
Group B vs Group C	0.289	0.084	<0.001				—	—
Group A vs Group B	0.005	<0.001	<0.001				—	—

Abbreviations: AUC, area under the curve; κ value, Fleiss’s κ value; NPV, negative predictive value; PPV, positive predictive value.

Figure 3.

Receiver operating characteristic (ROC) curves of the HT-CAD model on different thyroid hormone levels. (A) The ROC curve of the HT-CAD model in the hyperthyroidism subgroup. (B) The ROC curve of the HT-CAD model in the hypothyroidism subgroup. (C) The ROC curve of the HT-CAD model in the euthyroidism subgroup. The red dots indicate the diagnostic sensitivities and specificities of senior radiologists. The green dots indicate the diagnostic sensitivities and specificities of junior radiologists. Compared to the senior and junior radiologists, the HT-CAD model showed the better diagnostic performance in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups.

Comparison performance of HT-CAD in different subgroups by thyroid hormone levels Abbreviations: AUC, area under the curve; κ value, Fleiss’s κ value; NPV, negative predictive value; PPV, positive predictive value. Receiver operating characteristic (ROC) curves of the HT-CAD model on different thyroid hormone levels. (A) The ROC curve of the HT-CAD model in the hyperthyroidism subgroup. (B) The ROC curve of the HT-CAD model in the hypothyroidism subgroup. (C) The ROC curve of the HT-CAD model in the euthyroidism subgroup. The red dots indicate the diagnostic sensitivities and specificities of senior radiologists. The green dots indicate the diagnostic sensitivities and specificities of junior radiologists. Compared to the senior and junior radiologists, the HT-CAD model showed the better diagnostic performance in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups. It is shown that HT-CAD had almost similar accuracy (0.871-0.894) and sensitivity (0.883-0.911) in all 3 different thyroid hormone level subgroups. By contrast, the 3 subgroups exhibited significant variations in specificity (0.674-0.908). Furthermore, compared with the hypothyroidism and euthyroidism subgroups, the hyperthyroidism subgroup demonstrated the lowest accuracy, specificity, and the highest sensitivity, with a statistically significant difference (all Ps < 0.05). Among the 3 subgroups, the hypothyroidism subgroup had the lowest sensitivity and the euthyroidism subgroup had the highest accuracy and specificity.

HT-CAD Model Visualization

The regions that were automatically extracted and learned by the HT-CAD model were mapped and visualized by pseudocolor on the corresponding pixels (Fig. 4). The obtained heatmap revealed a strong association with the decisions made by the HT-CAD model. The HT-CAD model heat map can not only distinguish between HT and non-HT US images but also identify the area of the thyroid tissue with the most typical characteristics of HT in the US image, thus distinguishing this area from the normal thyroid tissue. Our results show that the edge fitting of the HT-CAD model heat map was approximately consistent with clinical judgment.

Figure 4.

Visualization of HT-CAD model of Hashimoto’s thyroiditis (HT). (A and C) The original ultrasonic images of HT patients. (B and D) Heat map of HT-CAD model based on 2 HT ultrasonic images.

Discussion

HT is now considered the most common autoimmune disease and endocrine disorder in the developed countries and the main cause of hypothyroidism (28). Although histological findings of diffuse lymphocytic infiltration with numerous lymphoid follicles and germinal centers remain the gold standard for HT diagnosis (29), FNA is rarely performed separately, and it is practically never applied for HT diagnostic purposes. Presently, HT diagnosis is commonly established by the identification of a combination of clinical features, such as positive TPOAb and TgAb (30), which is not completely reliable. US, as the main imaging examination related to thyroid diseases, can be a very promising modality in the primary screening of HT(8). However, HT is more difficult for radiologists to recognize compared with nodules. Additionally, HT is commonly combined with thyroid nodules, such as nodule goiter, adenoma, and cancer (31), whereas HT with thyroid nodules can induce significant interference in the US diagnosis of thyroid cancer (32). Thus, improving the accuracy of ultrasonic identification of HT can play a prominent role, which may not be limited to the early diagnosis of the disease only, and can also aid in the identification of thyroid nodules. To date, only a limited number of studies have implemented CAD techniques for HT detection. Three previous studies (33-35) used an image-processing algorithm to segment into the ultrasonic regions of HT by homogeneous or inhomogeneous texture information; however, without using the deep learning method. The accuracy of these methods in the diagnosis of HT was between 80% and 84.6%, which is far from being clinically satisfactory. Furthermore, Ma et al (36) used the CNN algorithm in the diagnosis of HT from single-photon emission computerized tomography images but not from US images. To our knowledge, this study is the first to evaluate the deep learning algorithm as an aid for HT ultrasonic diagnosis. We focused on developing a deep learning AI-assisted strategy for clinical diagnosis regarding HT. The HT-CAD model not only improved the ultrasonic diagnostic accuracy of HT, reaching 89.2%, but also managed to identify the region of HT in the thyroid tissue, which can provide efficient and rapid help to radiologists in the diagnostic processes. The ultrasonic characteristics of HT were heterogeneous and vague, thus leading to difficulties in the accurate recognition and consistent interpretation of HT by radiologists. By contrast, the deep learning method offers significant advantages in terms of overcoming heterogeneity issues using an automated learning procedure. The diagnostic reproducibility of the AI model was due to the consistency offered by the deep learning technique. We strictly screened ultrasonic images retrospectively with pathological results as the training and validation data sets, and we eliminated images with only clinical diagnosis, which provided a high quality image data basis for the deep learning model. Additionally, it is worth noting that our HT-CAD adopts ATT and ensembles technologies, which, in turn, increases the richness and diversity of the training data set, integrating the advantages, and finding more learned characteristics of the combined CNN models. Our HT-CAD model was validated in 2 hospitals with no statistical difference between them, which confirmed the stability and robustness of HT-CAD. However, more hospital centers are needed for a more extensive validation of this model. Furthermore, the accuracy of HT-CAD in this study was significantly higher than those of both junior and senior radiologists. This finding suggests that the use of HT-CAD by radiologists in the evaluation of HT can greatly improve the accuracy of ultrasonic diagnosis, thus facilitating early disease screening, detection, and interventions. Conferred by the high and efficient computing speed of a computer, the HT-CAD model has the advantage of assessing all images, thus allowing radiologists to work far more efficiently. Importantly, the results of our subgroup analysis pertaining to the accuracy of HT-CAD for different thyroid hormone levels were still as high as 87.1% to 89.4%, which demonstrates the robustness of this model. Additionally, these finding pinpoints that different thyroid hormone levels can have a negligible effect on the subsequent HT ultrasonic diagnosis. By comparing the sensitivity of HT-CAD on different thyroid hormone levels, we identified that the sensitivity of our model remained constant in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups, ranging from 88.3% to 91.1%. Consequently, this means that the missed diagnosis rate remained low and HT-CAD was conducive to HT screening. By contrast, when it comes to the specificity of HT-CAD on different thyroid hormone levels, our results revealed significant differences between these 3 subgroups (67.4%, 87.4%, and 90.8%, respectively). More specifically, specificity in the hypothyroid and euthyroid subgroups was significantly higher than the specificity results of the senior radiologists. This in turn verifies that HT-CAD had a low error rate and can thus have a perfect clinical performance with low risk of misdiagnosis. However, the specificity of HT-CAD in the hyperthyroidism subgroup was lower than the specificity of the senior radiologists. This could be explained by the fact that patients with hyperthyroidism have rich thyroid blood flow and significantly irregular hyperplasia of glands, which makes it easier for patients without HT to be automatically considered as having HT imaging characteristics. This is also consistent with the difficulty that US physicians have in terms of identifying HT under visual inspection. Our future work will focus on improving the accuracy of our model on this type of HT patients. CNN model learns multiple levels of feature representations from the input data by using the deep architecture of many convolutional layers. Differencing from manual identification, the image features learned and recognized by the CNN model are not 2-dimensional but high dimensional. Visualization of these features extracted by a CNN model would make the classification considered reliable and accepted by clinicians, which is the research direction of computer experts and great progress were made (37). Studies have shown that radiologists’ accuracy was improved significantly when reading with CAD (38). Hence, CAD systems have been approved by the U.S. Food and Drug Administration to be applied as a second opinion but not as a primary reader or prescreener (39,40). Therefore, although HT-CAD cannot simply replace the manual diagnosis of HT, it could improve the ability of US radiologists to perform accurate, efficient, and early diagnosis of this disease. Our study has several limitations that must be considered. First, all images in the training and validation sets underwent pathological examination (FNA/surgery) instead of a normal screening setting for HT. The different prevalence might have significantly affected the accuracy of our model between different populations, which could in turn undermine the generalization of our results. Second, our model was implemented by reviewing US images in 2 hospitals only. Furthermore, a larger dataset acquired from different hospitals with different types or models of US equipment is necessary to create a more comprehensive training set. The performance of our AI system is expected to be greatly improved by the inclusion of more data; thus, it is necessary to expand our sets to real-world data from other hospitals. Third, the acquired US images in the training model also included images of patients who had HT with thyroid nodules. In this study, we did not analyze the interference of such images on the accuracy of our model. The ultrasonic diagnosis of HT in patients with thyroid nodules can be very challenging in thyroid US. Consequently, as a future step in our research, we intend to investigate the interference and influence of HT with thyroid nodules on our model, thus enabling us to extend our results and perform AI model–assisted diagnosis in these patients as well.

Conclusion

In conclusion, the HT-CAD strategy based on CNN significantly improved the radiologists’ diagnostic accuracy of HT. For different hospitals and thyroid hormone levels, HT-CAD demonstrated its good performance and robustness. A larger HT database is needed to improve the accuracy of HT in the future. Conclusively, the HT-CAD model is a significantly valuable method in the diagnosis of HT, and it can thus be tested in prospective clinical trials.

38 in total

Review 1. Medical Image Analysis using Convolutional Neural Networks: A Review.

Authors: Syed Muhammad Anwar; Muhammad Majid; Adnan Qayyum; Muhammad Awais; Majdi Alnowami; Muhammad Khurram Khan
Journal: J Med Syst Date: 2018-10-08 Impact factor: 4.460

2. Hashimoto's thyroiditis: celebrating the centennial through the lens of the Johns Hopkins hospital surgical pathology records.

Authors: Patrizio Caturegli; Alessandra De Remigis; Kelly Chuang; Marieme Dembele; Akiko Iwama; Shintaro Iwama
Journal: Thyroid Date: 2013-02 Impact factor: 6.568

Review 3. Clinical aspects of Hashimoto's thyroiditis.

Authors: Giorgio Radetti
Journal: Endocr Dev Date: 2014-08-29

4. A convolutional neural network-based model observer for breast CT images.

Authors: Gihun Kim; Minah Han; Hyunjung Shim; Jongduk Baek
Journal: Med Phys Date: 2020-02-29 Impact factor: 4.071

5. A multi-scale residual network for accelerated radial MR parameter mapping.

Authors: Zhiyang Fu; Sagar Mandava; Mahesh B Keerthivasan; Zhitao Li; Kevin Johnson; Diego R Martin; Maria I Altbach; Ali Bilgin
Journal: Magn Reson Imaging Date: 2020-09-01 Impact factor: 2.546

Review 6. On the Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities.

Authors: Mauricio Reyes; Raphael Meier; Sérgio Pereira; Carlos A Silva; Fried-Michael Dahlweid; Hendrik von Tengg-Kobligk; Ronald M Summers; Roland Wiest
Journal: Radiol Artif Intell Date: 2020-05-27

Review 7. Ultrasonography in the diagnosis of Hashimoto's thyroiditis.

Authors: Guihua Wu; Dazhong Zou; Haiyun Cai; Yajun Liu
Journal: Front Biosci (Landmark Ed) Date: 2016-06-01

8. Effectiveness evaluation of computer-aided diagnosis system for the diagnosis of thyroid nodules on ultrasound: A systematic review and meta-analysis.

Authors: Wan-Jun Zhao; Lin-Ru Fu; Zhi-Mian Huang; Jing-Qiang Zhu; Bu-Yun Ma
Journal: Medicine (Baltimore) Date: 2019-08 Impact factor: 1.817

9. Using Deep Neural Network to Diagnose Thyroid Nodules on Ultrasound in Patients With Hashimoto's Thyroiditis.

Authors: Yiqing Hou; Chao Chen; Lu Zhang; Wei Zhou; Qinyang Lu; Xiaohong Jia; Jingwen Zhang; Cen Guo; Yuxiang Qin; Lifeng Zhu; Ming Zuo; Jing Xiao; Lingyun Huang; Weiwei Zhan
Journal: Front Oncol Date: 2021-03-16 Impact factor: 6.244

10. High-Resolution Radar Target Recognition via Inception-Based VGG (IVGG) Networks.

Authors: Wei Wang; Chengwen Zhang; Jinge Tian; Xin Wang; Jianping Ou; Jun Zhang; Ji Li
Journal: Comput Intell Neurosci Date: 2020-07-18

1 in total

1. Deep learning to diagnose Hashimoto's thyroiditis from sonographic images.

Authors: Qiang Zhang; Sheng Zhang; Yi Pan; Lin Sun; Jianxin Li; Yu Qiao; Jing Zhao; Xiaoqing Wang; Yixing Feng; Yanhui Zhao; Zhiming Zheng; Xiangming Yang; Lixia Liu; Chunxin Qin; Ke Zhao; Xiaonan Liu; Caixia Li; Liuyang Zhang; Chunrui Yang; Na Zhuo; Hong Zhang; Jie Liu; Jinglei Gao; Xiaoling Di; Fanbo Meng; Linlei Zhang; Yuxuan Wang; Yuansheng Duan; Hongru Shen; Yang Li; Meng Yang; Yichen Yang; Xiaojie Xin; Xi Wei; Xuan Zhou; Rui Jin; Lun Zhang; Xudong Wang; Fengju Song; Xiangqian Zheng; Ming Gao; Kexin Chen; Xiangchun Li
Journal: Nat Commun Date: 2022-06-29 Impact factor: 17.694

1 in total