Literature DB >> 34160618

Predicting Malignancy in Pediatric Thyroid Nodules: Early Experience With Machine Learning for Clinical Decision Support.

Lebohang Radebe^1,2, Daniëlle C M van der Kaay³, Jonathan D Wasserman^4,5, Anna Goldenberg^1,2,6,7.

Abstract

OBJECTIVE: To develop a machine learning tool to integrate clinical data for the prediction of non-benign thyroid cytology and histology. CONTEXT: Papillary thyroid carcinoma is the most common endocrine malignancy. Since most nodules are benign, the challenge for the clinician is to identify those most likely to harbor malignancy while limiting exposure to surgical risks among those with benign nodules.
METHODS: Random forests (augmented to select features based on our clinical measure of interest), in conjunction with interpretable rule sets, were used on demographic, ultrasound, and biopsy data of thyroid nodules from children younger than 18 years at a tertiary pediatric hospital. Accuracy, false-positive rate (FPR), false-negative rate (FNR), and area under the receiver operator curve (AUROC) are reported.
RESULTS: Our models predict nonbenign cytology and malignant histology better than historical outcomes. Specifically, we expect a 68.04% improvement in the FPR, 11.90% increase in accuracy, and 24.85% increase in AUROC for biopsy predictions in 67 patients (28 with benign and 39 with nonbenign histology). We expect a 23.22% decrease in FPR, 32.19% increase in accuracy, and 3.84% decrease in AUROC for surgery prediction in 53 patients (42 with benign and 11 with nonbenign histology). This improvement comes at the expense of the FNR, for which we expect 10.27% with malignancy would be discouraged from performing biopsy, and 11.67% from surgery. Given the small number of patients, these improvements are estimates and are not tested on an independent test set.
CONCLUSION: This work presents a first attempt at developing an interpretable machine learning based clinical tool to aid clinicians. Future work will involve sourcing more data and developing probabilistic estimates for predictions.

Entities: Chemical

Keywords: thyroid nodule malignancy

Mesh：

Year: 2021 PMID： 34160618 PMCID： PMC8824766 DOI： 10.1210/clinem/dgab435

Source DB: PubMed Journal: J Clin Endocrinol Metab ISSN： 0021-972X Impact factor: 5.958

Papillary thyroid carcinoma (PTC) is the most common endocrine malignancy. PTC is also the most common malignancy overall in young women aged 15 to 29 years. The annual incidence is rising worldwide, with a female predominance emerging in early adolescence (1). Although thyroid nodules in children are significantly more likely to be malignant than in adults, even in children, roughly 70% of nodules are benign (2, 3). The challenge to the clinician is therefore to stratify those patients most likely to harbor malignancy and to prioritize surgery for these patients, while limiting exposure to surgical risks among those with benign nodules. Clinical evaluation, ultrasound, and fine-needle aspiration cytology are fundamental modalities in assessing the likelihood that a nodule is malignant, yet each is associated with a substantial proportion of indeterminate results and each lacks the ability to accurately exclude malignancy in a substantial proportion of nodules (4-7). While some clinical features, including rock-hard texture and presence of cervical adenopathy, are closely associated with malignancy, there are no absolute predictors of benign disease. Gannon and colleagues (5) reviewed a large cohort of pediatric thyroid nodules and determined that ultrasound alone cannot satisfactorily identify sonographic features to adequately exclude malignancy. Similarly, ultrasound cannot be used to refine malignancy risk for cytologically indeterminate pediatric nodules (8). Finally, in a recent meta-analysis of pediatric cytopathology series, 5% to 28% of fine-needle aspiration biopsy (FNAB) specimens were nondiagnostic and 3.3% to 38% were cytologically indeterminate (7). In several recent pediatric case series, including one at our own institution, the malignancy rate among surgically managed patients with thyroid nodules was 51% to 64% (2, 7, 9). This defines a high rate of surgery for benign disease and highlights the inherent limitations of current interpretation of the preoperative diagnostic modalities to accurately identify patients with high risk of malignancy for surgery. In an effort to refine preoperative risk assessment, the McGill Thyroid Nodule Score (MTNS) was developed to integrate clinical, sonographic, and cytologic data (10, 11). It uses expert-defined weighted parameters to generate a likelihood score. This has recently been applied to 2 small pediatric series and appears to show some promise (12, 13). These works adopt an expert-opinion approach to selecting included variables and relative weighting, rather than a top-down derivation of the predictive score based on the primary data. Using a retrospectively acquired data set of 198 pediatric patients, we sought to derive a computational model to integrate clinical, radiological, and cytological features to refine prediction of malignancy, so as to stratify those patients most likely to harbor malignancy and to identify those that may qualify for expectant management. Machine learning (ML) algorithms, a subfield of artificial intelligence, have become increasingly popular in medicine because of their ability to learn highly complex patterns from data. Popular applications include disease identification, for example, through predicting cancer using radiological images (14); understanding the human genome, for example, through recognizing patterns in DNA sequences and understanding the mechanisms of gene expression using large, complex genetic, and genomic data sets (15); and improving patient health outcomes, for example, through optimizing caregiver workflows and resources or predicting patients most at risk for adverse events (16). These applications have produced state-of-the-art results, frequently surpassing clinical intuition. Random forest (RF) is a type of ML method that offers interpretability while retaining predictive power of other ML algorithms (17). RFs have been applied to medical data sets for which prediction power as well as interpretability are often necessary. For example, RFs were used to predict early rejection in kidney transplantation patients using features including: demographic data (eg, sex, age), number of years on dialysis, cytometry crossmatch, and total number of human leukocyte antigen mismatches between donor and recipient (18). Using a small sample size of just 80 patients, RFs were able to not only predict which patients were more likely to suffer rejection with an accuracy of 85%, but they were able to identify key risk factors associated with acute rejection. With respect to thyroid cancer diagnosis, RFs were used to classify nodules as benign vs malignant using tissue microarray data of 100 benign and 105 malignant thyroid lesions (19). In this instance, the RFs achieved an accuracy of 91% and were used not only for prediction but for understanding which variables were important for distinguishing between benign and malignant nodules. In another study, an RF classifier performed better than a radiologist at predicting malignancy of thyroid nodules using 11 sonographic features extracted from 2064 adult thyroid nodules (20). In this case, the RF classifier achieved an area under the curve of 0.938 compared to the radiologist’s area under the curve of 0.843. While the advantages of applying RFs in the medical realm are clear, our work pushes their application a step further. This paper presents interpretable and easily implementable diagnostic aids for physicians who want to predict the need for thyroid biopsies and surgery for pediatric patients. To our knowledge, our work is the first to approach this task using demographic, ultrasound, and biopsy data already collected from patient records. Furthermore, our computational pipeline selects only the most statistically important variables for the construction of the final model, a model that transforms the RF into an even more interpretable rule set (21). Each rule is presented to the physician with an associated accuracy—namely, the number of patients used to construct that rule, and the respective rates of benign and nonbenign histology. The innovation of the present rule set is that there does not exist a single decision tree that could capture the complexity of the proposed decision-making process yet the clarity of the proposed model is on par with a decision tree. These historical data are then used to generate predictions about future patients based on their similarity to previous patients, thus demystifying the model. Additionally, we can refine and strengthen the confidence of rules as more data are observed in the clinic, thereby ensuring that the resulting model is the state of the art as it continues to be used in the clinic. Ultimately, we envision broad clinical adoption of such models to aid in the determination of the need for biopsy and surgery, resulting in substantial reduction of the burden of unnecessary thyroidectomy and improved quality of life for the population of patients with thyroid nodules.

Materials and Methods

Experimental Design

This study was approved by the SickKids Research Ethics Board. Consent was waived for retrospective chart review. Medical records were reviewed for 198 consecutive patients with thyroid masses at a single tertiary care institution. As an inherent limitation of retrospective studies, complete data were available for a minority of cases. Exclusion criteria included inadequate quality of original ultrasound (primarily for older studies) or absence of data recorded in the medical chart. In addition, historically, a significant proportion of patients proceeded directly to thyroidectomy without prior biopsy. This was predicated on a recognition of high rates of malignancy when compared to adult nodules and lack of data supporting use of FNAB in children. This practice has since been superseded at our institution, but is still reflected in this historical data set. In total, 55 out of 198 patients (27.78%) proceeded directly to thyroidectomy, and therefore all were excluded from analysis. Where available, preoperative ultrasounds were prospectively reviewed using prespecified criteria by 3 radiologists, blinded to surgical histology and clinical outcome, as previously described (4). Of the remaining 143 patients, preoperative ultrasound studies of adequate diagnostic quality were available for 140 patients (69 patients with malignant nodules and 71 benign) and thus were potentially eligible to be included in our study subject to missingness. Final diagnosis was determined based on surgical histology. For nonoperative cases, a minimum 2-year follow-up without disease progression was established as the criterion for likely benign disease.

Predicting the Need for Biopsy and Surgery

Following retrospective chart analysis, we dichotomized the cohort into those with thyroid malignancy (based on surgical histology) or with presumed benign disease (based on either histology or, for patients managed nonoperatively, absence of clinical progression after a minimum of 2 years’ follow-up). The objective of this analysis was to test the hypothesis that a computational approach could use preoperative variables to predict nodules unlikely to be benign, thus “needing” surgery. We then applied ML algorithms to identify those factors most predictive of malignant histology. In a subsequent iteration, we “blinded” the algorithms to cytopathology and applied a similar approach to the prediction of cytopathology, based exclusively on clinical parameters and sonographic features. The rationale for doing so was to identify whether a computational model could achieve a satisfactory prediction of nodules unlikely to have benign cytology. Those nodules predicted by the model to have Bethesda 3 to 6 cytopathology were categorized as “nonbenign” cytology. Derivation of test performance is detailed later. We assessed this data set using the MTNS, as modified for pediatrics by Canfarotta et al (12). Inasmuch as the present series did not assess nodules for interval growth, this variable was excluded from the MTNS calculations.

Statistical Analysis

In our study, we had one patient with new-onset Hashimoto thyroiditis who presented with a thyroid mass and was contemporaneously found to have a thyrotropin level of 89.60 (when the range without this patient was [0.01-17.00]), and one person with a purely cystic composition of the nodule. These are too few patients (outliers) for an algorithm to draw conclusions. From a modeling perspective, outliers are excluded because it is dangerous to assume their presentation is a consistent pattern across all patients and therefore build this pattern into the model if we have only 1 or 2 patients who fit the criteria. Given that our data set is small, withholding a separate test set was not feasible. Thus, we used k-fold cross-validation (CV) to estimate the performance we can expect on a withheld test set. CV works by partitioning patients into different groups, using k-1 folds for training and retaining one for testing, and repeating for all folds. To address the class imbalance for the surgery prediction model, the majority class was downsampled to create a class imbalance ratio of at most 40:60.

Discovering Complex Patterns in the Data

We built an RF approach to discover complex patterns between variables in the training data: The algorithm creates sets of rules that are applied sequentially in order to iteratively split a cohort of patients into smaller and smaller groups based on sets of similar characteristics (17). These complex patterns are used to group patients of similar histology so that the final rule applied separates benign from malignant patients. Individual trees in the forest are not interpretable in isolation, and the trees may seem counterintuitive when viewed individually. Therefore, our final models are interpretable rule sets based on the RFs, described later. A new test patient is classified as having likely benign or likely nonbenign histology by feeding the information through the rule sets. This limitation of RFs is addressed later.

Creating Models for a Clinical Setting

Given the need for use in a clinical setting, we created an easy-to-use, interpretable model. We did this by using only the most predictive variables to reduce the size and complexity of the model while retaining predictive accuracy. In our clinical context, failing to detect a malignant nodule—a false negative (FN)—is worse than performing a biopsy or surgery on what turns out to be a benign nodule—a false positive (FP); the former could result in a patient who has cancer not receiving adequate intervention. Thus, we redefine importance in the standard RF mean decrease in accuracy importance feature algorithm: instead of treating FNs and FPs as equally bad, an important predictor prioritizes decreasing FNs first, followed by FPs. In this way, we first minimized false-negative rate (FNR)—the proportion of people for whom the model fails to detect a malignant nodule but who actually have malignant nodules (FN/FN + true positive), followed by the false-positive rate (FPR)—the proportion of people who incorrectly undergo a biopsy or incorrectly undergo surgery out of all people who have benign nodules (FP/FP + true negative). Our procedure consists of 3 steps. First, individual feature performance was calculated in the same way as RF using out-of-bag observations (17), however, using FNR and FPR as importance metrics. Second, we ranked features by sorting them based on the largest decrease in FNR followed by FPR. The feature with the smallest decrease was deemed least important (17). If there was a tie in the lowest FNR and FPR, we randomly selected a feature to remove. Third, having established an importance measure, we then selected the optimal subset of features for prediction using backwards feature elimination (22) and the mean decrease in FNR/FPR criteria described earlier. We started with the full set of demographic, ultrasound, and biopsy predictors and iteratively removed the least informative feature at each step. The optimal number of features was determined by looking at the size of the predictor set that resulted in the lowest FNR followed by lowest FPR during the backwards feature elimination process; if there was a tie, the model with the smallest number of predictors was chosen. After performing CV to estimate the performance we can expect on a withheld test set, we reperformed the whole procedure on the entire data set and the model with the optimal number of important predictors was selected as the final model.

Applicability to Clinical Settings

Given that we are building a model that will be used in the clinical domain, interpreting the exact role of individual features for prediction can be difficult for each individual patient. Thus, following model building and assessment of feature importance, we extracted a set of rules to represent the forest. These rules have the advantage of being shorter and interpretable by clinicians. We used inTrees, a package in R, to perform this function (21). inTrees extracts rules from RFs based on minimizing the error rate and maximizing the frequency of observations that were used to construct the rule. Thus, we use it to transform our RF into rule sets. From a clinical perspective, determining the likelihood or increase in odds of having malignant histology is preferable to strictly assigning binary labels. One challenge with rules sets is that they assign hard labels to each rule. In addition, given the sample size, accurate probabilities cannot be derived for our rule set. In lieu of probabilities, the relative frequencies of the number of patients classified by each rule will be used as an indication of the accuracy of each rule. This means that when a patient is classified as likely benign histology by our final rule set, the clinician will be able to see the number of historical patients who were classified using this rule. In this way, the rule set acts as a diagnostic aid to clinicians, not as a black box that merely puts out labels.

Results

Predicting the Need for Biopsy

Of the 140 potentially eligible patients in the entire cohort, only 67 had no missing values for all entries, and thus had sufficient data for biopsy prediction. Of these patients, 28 had benign histology and 39 had nonbenign. Table 1 shows historical patient outcomes. If a nodule was suspicious (based on clinical suspicion and/or sonographic features), biopsy was performed. If nodules were strongly suspected of being benign, biopsy was deferred. Outcome was then determined by histology and/or clinical follow-up as defined in “Materials and Methods.”

Table 1.

Historical cytologic outcomes based on predictions made on clinical impressions for all patients

		Prediction based on clinical impression
		Benign	Nonbenign/Uncertain
Cytology or 2-year follow-up	Benign	5	23
	Nonbenign	0	39

If a nodule was suspicious and therefore deemed nonbenign or uncertain (based on clinical suspicion and/or sonographic features), biopsy was performed. If nodules were strongly suspected of being benign, biopsy was deferred. Outcome was then determined by histology and/or clinical follow-up as defined in “Materials and Methods.”

Historical cytologic outcomes based on predictions made on clinical impressions for all patients If a nodule was suspicious and therefore deemed nonbenign or uncertain (based on clinical suspicion and/or sonographic features), biopsy was performed. If nodules were strongly suspected of being benign, biopsy was deferred. Outcome was then determined by histology and/or clinical follow-up as defined in “Materials and Methods.” Of the 5 nodules without biopsy (and with adequate follow-up), none progressed to malignancy, suggesting adequate stratification of very low-risk nodules. We acknowledge a significant negative bias included here, as most patients with very low-risk presentation (for example unambiguous colloid cysts) may not have had ongoing follow-up at the tertiary site, and would have been excluded from this analysis. Thus this category is certainly underrepresented. Among patients who were felt to merit FNAB, 24 had benign cytology (Bethesda 2) and 39 had indeterminate or malignant cytology. We asked whether a computational model could better identify those patients who would ultimately go on to have benign cytology, based solely on clinical variables and ultrasound, thereby avoiding biopsy altogether. Table 2 describes the outcome of these models derived from the retrospective data. Both models, RF and rule set, reduced the biopsy rate at the expense of the FN (nodules for which biopsy was not recommended, but that would turn out to be malignant) rate.

Table 2.

Machine learning prediction of benign and nonbenign cytology

	Accuracy, % (± SD)	False-negative rate, % (± SD)	False-positive rate, % (± SD)	Area under receiver operator curve, % (± SD)
Historical practice (clinical formulation)	65.67	0.00	82.14	58.93
Random forest	83.55 ± 1.58	12.50 ± 4.79	21.43 ± 9.22	83.04 ± 2.48
Rule set	77.57 ± 5.07	10.27 ± 6.78	14.10 ± 5.43	83.78 ± 4.46

Compares the results of historical practice to random forest classifier and the simplified rule set using 4 measures of performance.

Machine learning prediction of benign and nonbenign cytology Compares the results of historical practice to random forest classifier and the simplified rule set using 4 measures of performance. The rule set is the final derived model. Also shown are the test characteristics of the RF model. Our model was able to identify the need for biopsy with an accuracy of 77.57% (± 5.07% SD) compared with the historical accuracy of 65.67%, an increase of 11.90%. For all patients who have nonbenign cytology, we expect 10.27% (± 6.78% SD) would be incorrectly identified as not needing a biopsy compared to a historical rate of 0.00%. This means our prediction would perform worse than the clinical impression alone by 10.27%. For those with benign histology, we expect 14.10% (± 5.43% SD) would be incorrectly identified as needing a biopsy compared to a historical rate of 82.14%. This means our prediction would perform better than the historical rate by 68.04%. The area under the receiver operator curve (AUROC) is 83.78 (± 4.46% SD) compared to the historical practice of 58.93, indicating a 24.85% increase in performance. In summary, using the rule set model to identify nodules for biopsy would reduce the number of biopsies performed unnecessarily (for benign nodules) while maintaining a low miss rate for nodules with nonbenign cytology. Our final biopsy rule set, trained using all the historical data, is presented in Table 3. Comparison of historical outcomes, with current practice (based on sonographic risk assessment) (23) and the rule set is summarized in Table 4. We also asked whether the MTNS, as modified for pediatrics (12), could discriminate between benign and nonbenign cytology. The MTNS, however, relies heavily on the results of cytology to generate a score to predict malignancy, thus it cannot be used in its current incarnation to ascertain the need for biopsy. We analyzed the data from our series using the MTNS, after excluding the cytology scores and these are presented in Supplementary Figure 1 (24).

Table 3.

Example of a final rule set to determine indication for biopsy

Rule No.	Rule	Decision	Historical No. of patients that correctly satisfy specific rule	Historical No. of patients that correctly satisfy all rules
1	Composition of nodule is entirely solid	Likely nonbenign—recommend biopsy	12/12	12/12
	LNs appear normal
	Tumor is unifocal
2	Nodule > 50% cystic	Likely benign-defer biopsy	7/7	19/19
	Margin is regular
	LNs appear normal
3	Composition of nodule is entirely solid	Likely nonbenign–recommend biopsy	4/4	23/23
	Hypoechoic halo is either absent OR complete but not partial
	Margin is irregular/microlobulated/ spiculated
4	Composition of nodule is entirely solid	Likely nonbenign–recommend biopsy	9/10	32/33
	Margin is indistinct
	Tumor is unifocal OR multifocal (unilaterally)
5^a	Composition of nodule is mixed solid/ cystic < 50% cyst	Likely benign–defer biopsy	10/11	42/44
	Hypoechoic halo is absent OR complete (ie, not absent)
	Tumor is unifocal OR multifocal (bilaterally)
6	Composition of nodule is entirely solid	Likely nonbenign—recommend biopsy	6/7	48/51
	LNs are enlarged but normal appearing OR are suspicious for metastasis
7^a	LNs are not visualized or are visualized contralateral to primary tumor (but not ipsilaterally)	Likely benign—defer biopsy	3/4	51/55
	Tumor is multifocal (unilaterally)
8	Composition of nodule is entirely solid	Likely nonbenign—recommend biopsy	3/4	54/59
	Margin is indistinct
	Tumor is multifocal and bilateral
9	Otherwise	Likely nonbenign—recommend biopsy	3/8	57/67

Abbreviation: LN, lymph node.

aThese misclassification are the least acceptable type of error—patients classified as likely benign when they were not.

Table 4.

Comparison of decision making for the need for biopsy according to historical practice, current practice and our biopsy rule set model

Cytology or 2-year follow-up^a	Predicted to not need/need biopsy
	Historical data set	All patients evaluated according to current practice^b	Rule set model
Benign	5^a/23	13/15	20/8
Nonbenign	0/39	3/36	2/37

Only patients with minimum 2-year follow-up were included.

Decision to pursue biopsy based on clinical and sonographic features.

Example of a final rule set to determine indication for biopsy Abbreviation: LN, lymph node. aThese misclassification are the least acceptable type of error—patients classified as likely benign when they were not. Comparison of decision making for the need for biopsy according to historical practice, current practice and our biopsy rule set model Only patients with minimum 2-year follow-up were included. Decision to pursue biopsy based on clinical and sonographic features.

Modeling Histology Among Patients With Benign, Insufficient or Indeterminate Biopsy Results

We also asked whether an ML model could predict malignant histology among nodules with nonmalignant cytology (Bethesda 1-5). The rationale for this was that any patient with malignant (Bethesda 6) cytology would de facto merit surgery. The converse, at least in children, is not necessarily true, in that the FNR of benign fine-needle aspiration cytology (malignant histology in a nodule with benign biopsy) is higher than in adults (6,7). Additionally, sampling error may lead to missed malignancy among large nodules greater than 3 cm. As such, we elected to include nodules with benign (Bethesda 2) cytology in our analysis. Included in these data were 3 clinically “FN” nodules with benign cytology, which were ultimately demonstrated to harbor malignancy, based on surgical histology. Of the 140 potentially eligible patients, 53 had benign, insufficient, or indeterminate biopsy results and sufficient data to build a model. Of these, 42 had benign histology and 11 were malignant. Table 5 shows patient outcomes according to historical practice. The malignancy rate in those with nonmalignant cytology who underwent surgery was 11 out of 40 (27.5%). Stated otherwise, 72.5% of patients underwent potentially avoidable surgery for benign disease, had more accurate preoperative stratification been available. We therefore set out to determine whether a predictive model could reduce this rate. The results of our final rule set are included in Table 6. Our model predicted malignancy with an accuracy of 77.47% (± 2.71% SD) compared with the historical accuracy of 45.28%, an increase of 32.19%. If this model were to replace current practice, we expect 11.67% (± 1.32% SD) would be triaged to nonoperative management, compared to a historical practice of 0.00%. While at face value this seems a high “miss” rate, it must be interpreted in the context of the typically indolent nature of papillary thyroid carcinoma in children, for whom opportunity for surgical salvage with excellent outcomes exists. Ongoing follow-up of such patients would still afford the opportunity for surgical cure with progression of underlying disease, while nonprogressive disease could be monitored indefinitely.

Table 5.

Historical outcomes based on predictions made on clinical impressions for patients with nonmalignant cytology (Bethesda 1-5)

		Prediction based on clinical impression
		Suspected benign—managed nonoperatively	Uncertain (cannot exclude malignancy) —underwent surgery
Histology	Benign	13	29
	Malignant	0	11

If a nodule was uncertain (based on clinical suspicion and/or sonographic features and/or biopsy results), surgery was performed. If nodules were strongly suspected of being benign, surgery was deferred. Outcome was then determined by histology and/or clinical follow-up as defined in “Materials and Methods”.

Table 6.

Machine learning prediction for benign vs nonbenign histology

	Accuracy, % (± SD)	False-negative rate, % (± SD)	False-positive rate, % (± SD)	Area under receiver operator curve, % (± SD)
Historical practice	45.28	0.00	69.05	65.48
Random forest classifier	83.24 ± 4.33	29.17 ± 17.18	14.09 ± 8.79	78.37 ± 4.96
Rule set	77.47 ± 2.71	11.67 ± 1.32	45.83 ± 20.83	61.64 ± 10.28

Compares the results of historical practice to random forest classifier and the simplified rule set using 4 measures of performance.

Historical outcomes based on predictions made on clinical impressions for patients with nonmalignant cytology (Bethesda 1-5) If a nodule was uncertain (based on clinical suspicion and/or sonographic features and/or biopsy results), surgery was performed. If nodules were strongly suspected of being benign, surgery was deferred. Outcome was then determined by histology and/or clinical follow-up as defined in “Materials and Methods”. Machine learning prediction for benign vs nonbenign histology Compares the results of historical practice to random forest classifier and the simplified rule set using 4 measures of performance. This model would still endorse “unnecessary” surgical management in 45.83% of patients; however, this reflects a reduction of 23.22% over historical practice. In other words, this model would spare 1 patient in 4 unnecessary surgery. The AUROC is 61.64 (± 10.28% SD) compared to a historical value of 65.48%, indicating a 3.84% decrease in performance. Our final surgery rule set, trained using all the historical data, is presented in Table 7. We compare the historical data with predictions based on the “modified” MTNS, current practice based on American Thyroid Association criteria (23) and the rule set model in Table 8.

Table 7.

Final surgery decisional rule set (after biopsy)

Rule No.	Rule	Decision	Historical No. of patients that correctly satisfy specific rule	Historical No. of patients that correctly satisfy all rules
1	Margin is regular	Likely benign–defer surgery	27/27	27/27
	Cytology is benign or inadequate (Bethesda 1 or 2)
2	There are no echogenic foci	Likely benign-defer surgery	5/5	32/32
	Solid component is hypoechoic or markedly hypoechoic
	LNs are not visualized or are visualized contralateral to the primary tumor (but not ipsilaterally)
	Cytology is benign (Bethesda 2)
3	LNs are not visualized or are visualized contralateral to primary tumor (but not ipsilaterally)	Likely benign-defer surgery	4/4	36/36
	Cytology is inadequate (Bethesda 1)
4	Solid component is hypoechoic or markedly hypoechoic	Likely nonbenign–consider surgery	6/7	42/43
	LNs are not visualized or are visualized contralateral to the primary tumor (but not ipsilaterally)
	Cytology is indeterminate (Bethesda 3-5)
5	Solid component is isoechoic, hyperechoic or mixed echogenicity	Likely nonbenign–consider surgery	3/5	45/48
	Hypoechoic halo is absent
	Margin is irregular/microlobulated/ spiculated OR indistinct
	Cytology is benign or indeterminate (Bethesda 2-5)
6	Otherwise	Likely nonbenign–consider surgery	2/5	47/53

Abbreviation: LN, lymph node.

Table 8.

Comparison of decision making for the need for surgery according to historical practice, Modified McGill Thyroid Nodule Score, cytology alone, and our surgery rule set model

Histology or 2-year follow-up	Predicted to not need/need surgery
	All patients evaluated according to historical practice	Modified McGill Thyroid Nodule Score ≥ 8/ ≥ 9	Cytology alone	Rule set model
Benign	13/29	36/6	40/14	36/6
		39/3
Nonbenign	0/11	3/33	2/22	0/11
		4/32

Final surgery decisional rule set (after biopsy) Abbreviation: LN, lymph node. Comparison of decision making for the need for surgery according to historical practice, Modified McGill Thyroid Nodule Score, cytology alone, and our surgery rule set model

Discussion

This approach represents a first-pass effort at applying an ML solution to identifying those patients and those nodules that would most benefit from biopsy and from surgical intervention. While clearly not appropriate for clinical decision-making at present, these analyses clearly demonstrate an opportunity for applying computational approaches to retrospective data to refine clinical decision-making. These models are presently limited by an unacceptable “miss rate.” Ongoing refinement of the models and larger multi-institutional data sets may help reduce this rate to one that approaches a clinically acceptable rate. At no point would we envision an ML approach to supersede clinical intuition, experience, and data integration. Rather, this would be an ancillary tool to help refine and supplement diagnostic modalities, which themselves are fraught with imperfect predictive capacity, as evidenced by the high operative rates for benign disease.

Improvement in Prediction

Our models predict nonbenign cytology and malignant histology better than historical practice, with lower FPRs (electing for biopsy or surgery in the context of benign nodules), higher accuracy, and higher AUROC rates. Specifically, our biopsy predictions see improvement across all 3 measures, with notable improvements in the FPR; we expect a 68.04% improvement in the FPR, 11.90% increase in accuracy, and 24.85% increase in the AUROC. This indicates that our biopsy model comprises a simple set of rules is expected to outperform historical practice in determining the need for biopsy. In addition, there are also significant refinements in identifying those patients with indeterminate or inadequate cytology most likely to harbor malignancy, with a notable increase in accuracy; we expect an 23.22% decrease in FPR, 32.19% increase in accuracy, and 3.84% decrease in the AUROC. For most complex problems in the medical realm, researchers and physicians alike accept trade-offs between minimizing the FNRs and FPRs. For both our models, the improvements in FPR, accuracy, and AUROC came at the expense of increases in FNR, the most clinically unacceptable type of error. Specifically, for patients with nonbenign cytology, we expect 10.27% would be discouraged from performing biopsy, and 11.67% of those with malignancy would be incorrectly steered away from surgery if relying on the prediction models alone. With relatively small cohort sizes, our models are limited in their ability to distinguish between benign and malignant disease. Given this limitation, we are encouraged by these preliminary results as they demonstrate learning is taking place and point to future directions of research. Thus, while the increase in the FNR is problematic, we expect our models to improve with increasing sample size.

Generalizability and Interpretability

To anticipate how well these models will perform, it is important to consider the concept of model overfit. Overfitting is a commonly encountered obstacle in ML that affects the ability of models to generalize to unseen data. Specifically, lack of generalizability occurs when a model performs so well that it learns patterns in the training data that are not present in the overall population. Instead, these patterns are nuances of a particular set of patients. In our case, this would mean our models perform well on our cohort, but would fail to predict well on 2 types of unseen patients: those that will be seen at our same institution in the future, or those from other institutions. Given the rarity of pediatric thyroid cancer, our models are trained on a relatively small number of patients. Since the sample size is so small and RF does well extracting patterns from the data, we want to mitigate the likelihood that RF is overfitting. Simplifying the RF to the rule set accomplishes 2 things: First, the model is more likely to generalize to withheld data and second, the model is more interpretable. The improvement in the metrics measured demonstrates the improvement in generalizability we anticipated when converting from RFs to rule sets. Specifically, predicting malignant histology using the rule set results, instead of the RFs, results in a substantial improvement in the FNR of 17.5%—the metric we are most concerned with limiting. Thus, for patients with malignant histology, the rule set performs better at predicting malignancy. With respect to other metrics, the overall accuracy and AUROC decreased by 5.77% and 3.84% respectively due to the improvements in FNR coming at the expense of inappropriately including patients with benign histology. While the FPR increased by 31.74%, this new metric remains 23.22% better than historical performance, indicating an overall increase in the quality of the predictions in terms of the metrics we are most concerned with. In terms of predicting nonbenign cytology, by converting from RFs to rule sets, FNR and FPR saw improvements in performance by 2.23% and 7.33%, with marginal improvements in the AUROC of 0.74%, indicating a slight increase in performance when converting from the RF to rule set. Second, rule sets are more straightforward to follow and actionable in the context of clinical care than are RFs. As a list of sequential rules that split the patients into groups based on their characteristics, the rule sets can be examined on 2 levels. First, the rules themselves are based on the relationship between the different features extracted from ultrasound and biopsy results. Thus, combinations of features can be examined to gain deeper insight into how these features are related on a biological level. Second, each rule is presented to the clinician with the number of patients from the training data who were predicted using this rule, providing the clinician an indication of the rule’s historical accuracy. This historical data can then be used to make predictions on the likelihood of benign or nonbenign histology based on the similarity of a particular patient with the previous patients, thereby demystifying the model and providing an interpretable way of predicting the need for biopsies and surgeries.

Limitations and Next Steps

We acknowledge 2 major limitations related to the small sample sizes used to build these models. First, these models may be learning patterns from a cohort of patients that are fundamentally different from patients at other institutions or patients that will be seen in the future. When discussing generalizability, we addressed how it can be improved by using a rule set instead of an RF; in this approach, predictions are improved by changing the type of model we are using. While using a different model addresses the issue of overfitting, it does not fix biases that are learned because of differences between cohorts of patients. These differences can be overcome only by training on data that are representative of all types of pediatric thyroid patients, and thus would require more training data. Thus validation on a large external data set will be an important subsequent step in the refinement of this ML approach. Second, we want to provide clinicians a probabilistic estimate of whether a risk-benefit favors biopsy and surgery: Either a likelihood estimate or change in odds of a patient benefitting from a biopsy or surgery are more clinically useful compared with hard predictions of likely benign vs likely malignant histology. When discussing interpretability, we addressed how we overcame this challenge: Each rule has an associated relative frequency of patients from the training data who were captured by that rule, providing the clinician an indication of the rule’s historical accuracy. While these relative frequencies are helpful as they provide a clinician historical evidence as to the accuracy of a particular rule, generating a probabilistic interpretation would still be ideal because it is the more clinically relevant measure. However, this probabilistic interpretation is hindered by the small sample size because currently a consistent probability estimate cannot be generated. Given the rarity of pediatric thyroid cancer, sourcing more data can be a challenge. Thus, future work will involve collecting more data to refit the models to improve the prediction accuracy, test these models on an external data set, and generate consistent probability estimates for the rule set.

Conclusion

This study summarizes initial experiences using an ML approach to integrate clinical and sonographic data to model cytologic outcomes and to integrate clinical, sonographic, and cytologic data to model likelihood of malignancy. In routine practice, clinicians integrate these data routinely to identify those patients most appropriate for biopsy and/or surgery; however, this “gestalt” approach is limited by the experience of the clinician and the completeness of the available data. While the present retrospective study did not generate a model adequate to replace existing practice, largely because of the limitation in cohort size with sufficient data points, the improved accuracy and AUROC for identifying biopsy and surgical candidates are encouraging. This serves as a proof of principle that systematic data ascertainment and mathematical modeling may eventually facilitate the development of a powerful tool to help guide clinical decision-making and to avoid unnecessary interventions. Expansion of the training data sets using additional pediatric and adult data may accomplish this goal.

18 in total

Review 1. Management Guidelines for Children with Thyroid Nodules and Differentiated Thyroid Cancer.

Authors: Gary L Francis; Steven G Waguespack; Andrew J Bauer; Peter Angelos; Salvatore Benvenga; Janete M Cerutti; Catherine A Dinauer; Jill Hamilton; Ian D Hay; Markus Luster; Marguerite T Parisi; Marianna Rachmiel; Geoffrey B Thompson; Shunichi Yamashita
Journal: Thyroid Date: 2015-07 Impact factor: 6.568

2. Increase in the incidence of differentiated thyroid carcinoma in children, adolescents, and young adults: a population-based study.

Authors: Lucas Bonachi Vergamini; A Lindsay Frazier; Fernanda Laurinavicius Abrantes; Karina Braga Ribeiro; Carlos Rodriguez-Galindo
Journal: J Pediatr Date: 2014-03-12 Impact factor: 4.406

3. Pediatric thyroid FNA biopsy: Outcomes and impact on management over 24 years at a tertiary care center.

Authors: Elmira Amirazodi; Evan J Propst; Catherine T Chung; Dimitri A Parra; Jonathan D Wasserman
Journal: Cancer Cytopathol Date: 2016-07-14 Impact factor: 5.284

4. Utility of adult-based ultrasound malignancy risk stratifications in pediatric thyroid nodules.

Authors: Claudia Martinez-Rios; Alan Daneman; Lydia Bajno; Danielle C M van der Kaay; Rahim Moineddin; Jonathan D Wasserman
Journal: Pediatr Radiol Date: 2017-10-05

Review 5. Machine Learning in Medical Imaging.

Authors: Maryellen L Giger
Journal: J Am Coll Radiol Date: 2018-02-02 Impact factor: 5.532