| Literature DB >> 35976137 |
Giulia Brigante1,2, Clara Lazzaretti1, Elia Paradiso1, Federico Nuzzo1, Martina Sitti1, Frank Tüttelmann3, Gabriele Moretti4, Roberto Silvestri4, Federica Gemignani4, Asta Försti5,6, Kari Hemminki7,8, Rossella Elisei9, Cristina Romei9, Eric Adriano Zizzi10, Marco Agostino Deriu10, Manuela Simoni1,2,11, Stefano Landi4, Livio Casarini1,11.
Abstract
To identify a peculiar genetic combination predisposing to differentiated thyroid carcinoma (DTC), we selected a set of single nucleotide polymorphisms (SNPs) associated with DTC risk, considering polygenic risk score (PRS), Bayesian statistics and a machine learning (ML) classifier to describe cases and controls in three different datasets. Dataset 1 (649 DTC, 431 controls) has been previously genotyped in a genome-wide association study (GWAS) on Italian DTC. Dataset 2 (234 DTC, 101 controls) and dataset 3 (404 DTC, 392 controls) were genotyped. Associations of 171 SNPs reported to predispose to DTC in candidate studies were extracted from the GWAS of dataset 1, followed by replication of SNPs associated with DTC risk (P < 0.05) in dataset 2. The reliability of the identified SNPs was confirmed by PRS and Bayesian statistics after merging the three datasets. SNPs were used to describe the case/control state of individuals by ML classifier. Starting from 171 SNPs associated with DTC, 15 were positive in both datasets 1 and 2. Using these markers, PRS revealed that individuals in the fifth quintile had a seven-fold increased risk of DTC than those in the first. Bayesian inference confirmed that the selected 15 SNPs differentiate cases from controls. Results were corroborated by ML, finding a maximum AUC of about 0.7. A restricted selection of only 15 DTC-associated SNPs is able to describe the inner genetic structure of Italian individuals, and ML allows a fair prediction of case or control status based solely on the individual genetic background.Entities:
Keywords: differentiated thyroid cancer; machine learning; single nucleotide polymorphism
Year: 2022 PMID: 35976137 PMCID: PMC9513665 DOI: 10.1530/ETJ-22-0058
Source DB: PubMed Journal: Eur Thyroid J ISSN: 2235-0640
Figure 1Datasets and project’s pipeline. (A) Summary of dataset composition, highlighting the progressive refinement of the SNP selection process. Dataset 1 SNPs were extracted from a GWAS (12), while datasets 2 and 3 SNPs were genotyped ad hoc for potentially informative SNPs. The 34 SNPs significantly associated with DTC in dataset 1 were genotyped ad hoc and checked for relevance in the independent dataset 2. Then, 15 SNPs highly associated with DTC in both datasets 1 and 2 were further genotyped ad hoc in the independent dataset 3. (B) Procedure for statistical SNP discovery and subsequent ML implementation. After SNPs selection, we tested the capability of the 15 selected SNPs to provide a DTC genetic signature in the merged datasets 1, 2 and 3 with Bayesian statistics for population genetics. Then, ML methods were run to confirm the case/control state of individuals using the selected 15 SNPs as input variables. An extended dataset was built by merging the two largest datasets (1 and 3), to obtain a pool of randomly chosen ‘training’ (80% of the merged dataset) and ‘testing data’ (20%). After finding the most effective ML algorithms, a validation analysis was set on the dataset 2. Shaded colours highlight involved datasets. Yellow = dataset 1; green = dataset 2; light-blue = dataset 3.
Figure 2SNP selection. (A) Criteria used for SNP selection. (B) SNP subsets. Among the 171 SNPs selected from the literature (Supplementary Table 1), only 34 were associated with DTC in dataset 1 (P < 0.05; Supplementary Table 2) and genotyped in dataset 2. Fifteen SNPs were finally selected from dataset 2 as variables for ML analysis and genotyped in dataset 3 (bold). Panels A and B have matched colours and letters.
Summary of training, testing and validation ML datasets.
| ML dataset | Origin of dataseta | No. of Individuals | Cases (%) | Controls (%) |
|---|---|---|---|---|
| Training | 80% Datasets 1 + 3 | 1086 | 58.2 | 41.8 |
| Testing | 20% Datasets 1 + 3 | 272 | 62.1 | 37.9 |
| Validation | 100% Dataset 2 | 201 | 65.7 | 34.3 |
a Summary of datasets after removing individuals with missing data in the genotype (% cases; % controls): dataset 1 = 949 (59.5; 40.5); dataset 2 = 201 (65.7; 34.3); dataset 3 = 409 (57.7; 42.3).
Characteristics of study population.
| Dataset 1 | Dataset 2 | Dataset 3 | ||||
|---|---|---|---|---|---|---|
| Cases ( | Controls ( | Cases ( | Controls ( | Cases ( | Controls ( | |
| Females (%) | 507 (78%) | 320 (74%) | 167 (71%) | 61 (60%) | 287 (71%) | 243 (62%) |
| Age (years) | 37.8 ± 0.85 | 46.8 ± 0.97 | 49.7 ± 14.0 | 43.7 ± 11.4 | 44.8 ± 12.7 | 43.8 ± 9.6 |
| Weight (kg) | 71.0 ± 1.31 | 70.4 ± 1.37 | 74.5 ± 15.0 | 71.3 ± 16.1 | 79.3 ± 17.9 | 69.2 ± 14.5 |
| BMI (kg/m2) | 25.3 ± 0.39 | 25.2 ± 0.38 | 26.9 ± 4.9 | 26.1 ± 5.0 | 27.5 ± 4.7 | 23.9 ± 3.7 |
Values are expressed as number and percentages (%) or average and standard error.
List of the 15 SNPs associated with DTC in datasets 1 and 2.
| SNP ID | Genomic location | Gene | Description |
|---|---|---|---|
| rs965513 | chr9:97793827 | Papillary thyroid carcinoma susceptibility candidate 2/Forkhead box E1 | |
| rs3758249 | chr9:97851858 | Papillary thyroid carcinoma susceptibility candidate 2/Forkhead box E1 | |
| rs7048394 | chr9:97843151 | Papillary thyroid carcinoma susceptibility candidate 2/Forkhead box E1 | |
| rs944289 | chr14:36180040 | Papillary thyroid carcinoma susceptibility candidate 3 | |
| rs6759952 | chr2:217406996 | Disrupted in renal carcinoma 3 | |
| rs966423 | chr2:217445617 | Disrupted in renal carcinoma 3 | |
| rs1203952 | chr20:22633494 | Forkhead box A2 | |
| rs10238549 | chr7:110540965 | Inner mitochondrial membrane peptidase subunit 2 | |
| rs7800391 | chr7:110568186 | Inner mitochondrial membrane peptidase subunit 2 | |
| rs1799814 | chr15:74720646 | Aryl hydrocarbon hydroxylase | |
| rs7617304 | chr3:158745312 | Retinoic acid receptor responder 1 | |
| rs4808708 | chr19:17890877 | Solute carrier family 5 member 5 | |
| rs10781500 | chr9:136374886 | Caspase recruitment domain-containing protein 9 | |
| rs1061758 | chr9:34652333 | Interleukin 11 receptor subunit alpha | |
| rs10877887 | chr12:62603400 | Long intergenic non-protein coding RNA 1465/microRNA Let-7i |
Odds ratio estimates for the 15 SNPs PRS quintiles. DTC state obtained in the three merged datasets was considered, using the bottom quintile (0–20%) as the reference group. The multivariate logistic regression model included the adjustment of ORs for age, BMI and gender. wPRS, weighted polygenic risk score; PRS, unweighted polygenic risk score.
| Quintile | wPRS | PRS | ||||
|---|---|---|---|---|---|---|
| ORadj | 95% CI | ORadj | 95% CI | |||
| I | Reference | Reference | ||||
| II | 2.12 | 1.55–2.91 | 2.92 × 10−6 | 1.43 | 1.04–1.97 | 0.0282 |
| III | 2.52 | 1.84–3.44 | 7.02 × 10−9 | 2.55 | 1.90–3.40 | 2.87 × 10−10 |
| IV | 3.15 | 2.30–4.32 | 9.65 × 10−13 | 3.04 | 2.26–4.09 | 2.02 × 10−13 |
| V | 6.87 | 4.90–9.64 | 6.12 × 10−29 | 5.84 | 4.18–8.15 | 3.75 × 10−25 |
Figure 3DTC-associated genetic structure of cases and healthy controls. The bar plot was calculated by the STRUCTURE software in the merged datasets 1, 2 and 3. Each individual is represented by a vertical line, in which colours indicate the contribution of each of the k = 5 components to the individual genetic background. Cases and controls were ordered for graphical reasons, showing different genetic profiles at a glance, although indicating a certain degree of admixture.
Classification metrics of AdaBoost classifier on all datasets.
| Metric | Training set | Test set | Validation set |
|---|---|---|---|
| NPV | 0.65 | 0.56 | 0.52 |
| PPV | 0.64 | 0.66 | 0.70 |
| Sensitivity | 0.88 | 0.87 | 0.85 |
| Specificity | 0.29 | 0.27 | 0.32 |
| Accuracy | 64% | 64% | 67% |
| F1-score | 0.74 | 0.75 | 0.77 |
| F0.5-score | 0.67 | 0.70 | 0.73 |
| F2-score | 0.82 | 0.82 | 0.82 |
Fβ scores are defined as: .
NPV, negative predictive value = TN/(TN+FN); PPV, positive predictive value = TP/(TP+FP).
Figure 4Results from the ML-based DTC prediction and SNP relative importance. (A) ROC curves obtained on all datasets with the AdaBoost model. Dashed line represents random choice. (B) Relative feature importance of all variables (SNPs) in the AdaBoost model. Data normalized to most important feature. Suffix ‘_2’ indicates the second allele. Feature importance is calculated as an average over the individual classifiers used for probability calibration.