| Literature DB >> 35643771 |
Joan Gil1,2,3, Montserrat Marques-Pamies4, Miguel Sampedro3,5, Susan M Webb2,3, Guillermo Serra6, Isabel Salinas4, Alberto Blanco7, Elena Valassi2,3,4, Cristina Carrato8, Antonio Picó3,9,10, Araceli García-Martínez3,9, Luciana Martel-Duguech2, Teresa Sardon11, Andreu Simó-Servat12, Betina Biagetti13, Carles Villabona14, Rosa Cámara15, Carmen Fajardo-Montañana16, Cristina Álvarez-Escolá17, Cristina Lamas18, Clara V Alvarez19, Ignacio Bernabéu20, Mónica Marazuela3,5, Mireia Jordà21, Manel Puig-Domingo22,23,24,25.
Abstract
Predicting which acromegaly patients could benefit from somatostatin receptor ligands (SRL) is a must for personalized medicine. Although many biomarkers linked to SRL response have been identified, there is no consensus criterion on how to assign this pharmacologic treatment according to biomarker levels. Our aim is to provide better predictive tools for an accurate acromegaly patient stratification regarding the ability to respond to SRL. We took advantage of a multicenter study of 71 acromegaly patients and we used advanced mathematical modelling to predict SRL response combining molecular and clinical information. Different models of patient stratification were obtained, with a much higher accuracy when the studied cohort is fragmented according to relevant clinical characteristics. Considering all the models, a patient stratification based on the extrasellar growth of the tumor, sex, age and the expression of E-cadherin, GHRL, IN1-GHRL, DRD2, SSTR5 and PEBP1 is proposed, with accuracies that stand between 71 to 95%. In conclusion, the use of data mining could be very useful for implementation of personalized medicine in acromegaly through an interdisciplinary work between computer science, mathematics, biology and medicine. This new methodology opens a door to more precise and personalized medicine for acromegaly patients.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35643771 PMCID: PMC9148300 DOI: 10.1038/s41598-022-12955-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Biomarker data mining analyses procedure. First, a Data Cleaning process was performed to eliminate outliers, uninformative variables, missing values, and duplicate variables. Next, this new cleaned data set was used to train the model of the Data Mining process which is subdivided in different mathematical sub-processes: Feature Normalization, Feature Selection, Feature Transformation, Feature Extraction, Ensemble Classifier, Base Classifier, Backward Feature Removal and Validation. The Feature Normalization guarantees that the values of all variables are in the same range. The Feature Selection is applied to select the input variables that show the strongest relationship with the outcome. The Feature Transformation consists in mathematical transformations of the input data required for the Base Classifiers. It was not necessary to apply a Feature Extraction to reduce the number of random variables. Different algorithms generated different Base Classifiers with a good performance. Ensemble Classifiers were able to improve the performance of the Base Classifiers. Finally, the Validation process to estimate the accuracy of the predictive model was performed using the original database by several methods: 10-K fold and Leave-one-out.
Figure 2Representation of different possible models resulting from the data mining analysis in the whole cohort. (A) Sampling distribution graph representing the distribution of CR and NR patients for E-cadherin expression. When the classifier contains only one variable we used a variable brute force technique. The discriminant function is a constant that is determined as the threshold value that separates samples from the two groups with the best accuracy (marked by dotted red line). (B) Sampling distribution graph in 2D representing the distribution of CR and NR patients for the expression of AIP and E-cadherin. The blue line is the mathematical function defined by the values of the classifier, a mathematical function that separates NR from CR patients. As this classifier is composed of two variables, each dimension of the graph stands for one variable. The variables were selected by the Lasso method and the model performed according to Multilayer perceptron (MLP) methodology. (C) Sampling distribution graph in 2D representing the distribution of CR and NR patients for the expression of SSTR2, E-cadherin and AIP. As this classifier is composed of more than two variables, each dimension of the grafh stands for the the two main components after performing a principal component analysis (PCA). The blue line is the mathematical funtion that separates CR from NR patients. The variables were selected by the Wilcoxon method and the model performed according to Multilayer perceptron (MLP) methodology.
Mathematical methods explored during the different processes included in the Data Mining strategy.
| Sub-process | Algorithm | References |
|---|---|---|
| Backward removal features | Backward elimination | [ |
| Base classifier | Elastic net | [ |
| K-nearest neighbors (K-NN) | [ | |
| Boosted Generalized Additive Models (B-GAM) | [ | |
| Tree | [ | |
| Support vector machine (SVM) | [ | |
| Multilayer perceptron (MLP) | [ | |
| MLP ensemble | [ | |
| Linear search | [ | |
| Linear regression | [ | |
| Quadratic | [ | |
| Random linear | [ | |
| Generalized linear model binomial | [ | |
| Ridge regression | [ | |
| Naïve bayes | [ | |
| Lasso regression | [ | |
| Radial basis function (RBF) | [ | |
| Cost function | Accuracy | [ |
| Balanced accuracy | [ | |
| Balanced cost matrix | [ | |
| Cost matrix | [ | |
| F1 score | [ | |
| Matthews correlation coefficient (MCC) | [ | |
| Area Under Curve (AUC) | [ | |
| Dimensionality reduction | Principal component analysis (PCA) | [ |
| T-distributed Stochastic Neighbor Embedding (t-SNE) | [ | |
| Multidimensional scaling (MDS) | [ | |
| Hessian locally linear embedding (HLLE) | [ | |
| Isomap | [ | |
| Latent Dirichlet allocation (LDA) | [ | |
| Locally linear embedding (LLE) | [ | |
| Sammon projection | [ | |
| LandMark ISOMAP (L-ISOMAP) | [ | |
| Laplacian | [ | |
| Gaussian process latent variable model (GPLVM) | [ | |
| Kernel PCA | [ | |
| Independent component analysis (ICA) | [ | |
| Non-negative matrix factorization (NMF) | [ | |
| Factor analysis | [ | |
| Probabilistic principal component analysis (PPCA) | [ | |
| Local tangent space alignment (LTSA) | [ | |
| Ensemble classifier | Bootstrap | [ |
| Bootstrap respecting prevalence | [ | |
| Balanced bootstrap | [ | |
| Ensemble method | Bootstrap | [ |
| Bootstrap respecting prevalence | [ | |
| Balanced bootstrap | [ | |
| Feature selection | K-nearest neighbors (K-NN) | [ |
| Receiver operating characteristic (ROC) | [ | |
| Bhattacharyya | [ | |
| Ridge regression | [ | |
| Wilcoxon | [ | |
| Wilcoxon + correlation | [ | |
| minimum Redundancy Maximum Relevance (mRMR) Mean discretized | [ | |
| Boolean balanced three-valued logic rules | [ | |
| Sequential floating forward selection (SFFS) | [ | |
| Support vector machines recursive feature elimination (SVM-RFE) | [ | |
| Random forest | [ | |
| Chow-Liu | [ | |
| Simple regression | [ | |
| Relieff | [ | |
| Random generalized linear model | [ | |
| One variable brute force | [ | |
| Bhattacharyya + Correlation | [ | |
| Entropy | [ | |
| Entropy + Correlation | [ | |
| Mattest | [ | |
| T-test | [ | |
| T-test + Correlation | [ | |
| minimum Redundancy Maximum Relevance (mRMR) | [ | |
| Lasso | [ | |
| Elastic net | [ | |
| Double Cross-Validation regression | [ | |
| Feature transformation | Sigmoid | [ |
| Gaussian: the value used is the value obtained after being submitted to a Gaussian function | ||
| No value transformation | ||
| The value used is the original value multiplied by itself | ||
| The value used is the square root of the original value | ||
| Multiclass classifier | Generalized coding | [ |
| One versus all (OVA) binary classified applied | ||
| One versus one (OVO) binary classifiers applied | ||
| Normalization | Sigmoidal mean variance | [ |
| Trimmed mean variance | [ | |
| Mean variance | ||
| Median dispersion | ||
| Min Max: each value is divided by the difference between the maximum and the minimum value | ||
| Winsorizing mean variance | ||
| Validation | Bootstrap | [ |
| K-Fold | [ | |
| LeaveOneOut (LOO) | [ |
Clinical categorical variables related to SRL response.
| Group | SRL responsea | Pearson χ2 p-valueb | |||
|---|---|---|---|---|---|
| CR | PR | NR | |||
| Presurgical hypopituitarism | Yes | 42% | 15% | 55% | |
| No | 68% | 85% | 45% | ||
| Presurgical visual alterations | Yes | 13% | 27% | 19% | 0.62 |
| No | 87% | 73% | 81% | ||
| T2 signal intensity | Hypointense | 31% | 22% | 36% | 0.90 |
| Isointense | 38% | 56% | 36% | ||
| Hyperintense | 31% | 22% | 28% | ||
| T1 signal intensity | Hypointense | 61% | 40% | 53% | 0.75 |
| Isointense | 39% | 50% | 38% | ||
| Hyperintense | 0% | 10% | 8% | ||
| Gender | Male | 46% | 35% | 62% | 0.07 |
| Female | 54% | 65% | 38% | ||
| GNAS mutation | Mutated | 29% | 38% | 36% | 0.83 |
| WT | 71% | 62% | 64% | ||
| Sinus Invasion | Yes | 22% | 35% | 59% | 0.05 |
| No | 78% | 65% | 41% | ||
| Extrasellar growth | Yes | 48% | 60% | 95% | |
| No | 52% | 40% | 5% | ||
aSRL response columns indicate the percentage of patients with CR, PR, or NR dictated by the presence of absence of the clinical condition.
bPearson χ2 p-values are shown. Statistically significant values (p-value < 0.05) are reported in bold.
Clinical numerical variables showing differences between the evaluated comparisons.
| Variable | CR + PR vs NR | CR vs NR | PR vs NR | CR vs PR | ||||
|---|---|---|---|---|---|---|---|---|
| p-value | Log2FC | p-value | Log2FC | p-value | Log2FC | p-value | Log2FC | |
| IGF1 diagnosis | 0.722 | |||||||
| IGF1 index diagnosis | 0.838 | 0.04 | ||||||
| GH diagnosis | 0.590 | 1.04 | 0.134 | 0.94 | 0.429 | 1.17 | 0.134 | |
| GH after OGTT | 0.622 | 1.27 | 0.728 | 1.29 | 0.633 | 1.25 | 0.941 | 0.03 |
| BMI diagnosis | 0.452 | 0.316 | ||||||
| Maximum diameter | 0.178 | 0.532 | 0.708 | |||||
| Age diagnosis | 0.197 | 0.14 | 0.272 | 0.13 | 0.802 | 0.276 | 0.16 | |
The clinical numerical variables that were tested: IGF1 levels measured at diagnosis in each center, IGF1 index at diagnosis, GH levels measured at diagnosis in each center, GH levels measured after a 75 g oral glucose load (OGTT), BMI (Body Mass Index) at diagnosis, maximum tumor diameter in the MRI measured in each center and the age of the patient at diagnosis. T-test or Wilcoxon-test p-values are shown. Statistically significant values (p-value < 0.05) are reported in bold, and p-value < 0.1 in italic Log2FC: Log2 Fold Change.
Best classifiers in the whole cohort.
| Evaluated comparison | Panel of classifiers | ACC | p-value |
|---|---|---|---|
| CR + PR vs NR | E-cadherin | 62.61% | 0.027 |
| 67.26% | 0.002 | ||
| 69.95% | 0.001 | ||
| CR vs NR | 69.23% | 0.006 | |
| E-cadherin | 73.08% | 0.001 | |
| 75.00% | < 0.001 | ||
| 75.00% | < 0.001 | ||
| PR vs NR | 67.87% | 0.02 | |
| 69.68% | 0.004 | ||
| CR vs PR | E-cadherin | 65.84% | 0.028 |
| 69.68% | 0.004 |
All individual classifiers and those panels with 2 or 3 classifiers that display an improvement in accuracy are presented in this table. ACC: Accuracy.
Best classifiers in patients with or without SRL presurgical treatment, extrasellar growth, sinus invasion, biological sex and GNAS mutational status.
| Fragmenting condition | Evaluated comparison | Fragmented population Na | Best panel of classifiers | ACC | p-value |
|---|---|---|---|---|---|
| A. SRL presurgical treatement | CR + PR vs NR | No (9 vs 7) | 88.89% | 0.003 | |
| Yes (33 vs 19) | 70.65% | 0.001 | |||
| CR vs NR | No (6 vs 7) | Age + | 100.00% | 5.83E−04 | |
| Yes (20 vs 19) | 76.97% | 9.43E−04 | |||
| PR vs NR | No (3 vs 7) | Not found | – | – | |
| Yes (13 vs 19) | 74.29% | 0.003 | |||
| CR vs PR | No (6 vs 3) | 100% | 0.012 | ||
| Yes (20 vs 13) | 76.82% | 4.02E−04 | |||
| B. Extrasellar growth | CR + PR vs NR | No (18 vs 1) | Not found | – | – |
| Yes (20 vs 19) | 71.32% | 0.005 | |||
| CR vs NR | No (12 vs 1) | Not found | – | – | |
| Yes (11 vs 19) | Not found | – | – | ||
| PR vs NR | No (6 vs 1) | Not found | – | – | |
| Yes (9 vs 19) | Not found | – | – | ||
| CR vs PR | No (12 vs 6) | 87.50% | 0.004 | ||
| Yes (11 vs 9) | 79.80% | 0.012 | |||
| C. Sinus Invasion | CR + PR vs NR | No (26 vs 7) | Not found | – | – |
| Yes (12 vs 10) | 77.50% | 0.015 | |||
| CR vs NR | No (18 vs 7) | 81.75% | 0.007 | ||
| Yes (5 vs 10) | 85.00% | 0.017 | |||
| PR vs NR | No (8 vs 7) | Ki-67 + | 85.71% | 0.007 | |
| Yes (7 vs 10) | Not found | – | – | ||
| CR vs PR | No (18 vs 8) | 86.61% | 0.009 | ||
| Yes (5 vs 7) | Not found | – | – | ||
| D. Gender | CR + PR vs NR | Female (25 vs 10) | 73.78% | 0.007 | |
| Male (18 vs 16) | Age + E-cadherin | 80.83% | 0.001 | ||
| CR vs NR | Female (14 vs 10) | 79.76% | 0.005 | ||
| Male (12 vs 16) | Age + | 85.45% | 4.91E−04 | ||
| PR vs NR | Female (11 vs 10) | Not found | – | – | |
| Male (6 vs 16) | 85.35% | 0.003 | |||
| CR vs PR | Female (14 vs 11) | 74.68% | 0.016 | ||
| Male (12 vs 6) | 80.00% | 0.018 | |||
| E. | CR + PR vs NR | WT (19 vs 14) | 77.07% | 0.003 | |
| Mutated (10 vs 5) | Not found | – | – | ||
| CR vs NR | WT (10 vs 14) | Not found | – | – | |
| Mutated (5 vs 5) | 90.00% | 0.024 | |||
| PR vs NR | WT (9 vs 14) | 72.22% | 0.014 | ||
| Mutated (5 vs 5) | Not found | – | – | ||
| CR vs PR | WT (10 vs 9) | 84.44% | 0.004 | ||
| Mutated (5 vs 5) | Not found | – | – | ||
| F. Hypointense T2 signaling | CR + PR vs NR | NO HYPO (23 vs 15) | 74.18% | 0.008 | |
| HYPO (14 vs 8) | 75.00% | 0.040 | |||
| CR vs NR | NO HYPO (13 vs 15) | 88.46% | 8,75E−05 | ||
| HYPO (9 vs 8) | E-cadherin | 87.50% | 0.003 | ||
| PR vs NR | NO HYPO (10 vs 15) | Age + | 76.79% | 0.022 | |
| HYPO (5 vs 8) | Not found | – | – | ||
| CR vs PR | NO HYPO (10 vs 9) | 85.04% | 0.001 | ||
| HYPO (5 vs 5) | Not found | – | – |
For each subgroup, the best panel/s of classifiers (with accuracy higher than the maximal one achieved by the classifiers using the whole cohort without fragmentation) in each comparison are shown. aThe third column refers to the condition in the first column. ACC Accuracy.
Figure 3Best therapeutic tree decision algorithms based on mathematical modelling. (A) Decision tree to determine the first line drug for a given acromegaly patient based on the extrasellar tumor growth and molecular information. A patient without extrasellar growth is automatically classified as CR/PR without performing any molecular analysis (NR category is discarded with an accuracy of 95%). Then, by measuring the gene expression of SSTR5 and PEBP1 a clinician would be able to assign the right treatment with an accuracy of 87.5%. If the tumor has extrasellar growth, the gene expression of GHRL should be measured. If levels are < 0.008 or > 0.04, the patient is classified as NR with an accuracy of 71.3%, while if levels are between 0.008 and 0.04, the patient is classified as CR/PR. Then, by measuring the gene expression of SSTR5, IN1GHRL and E-cadherin a clinician would be able to assign the right treatment with an accuracy of 79.8%. When classifiers are composed of more than one variable (e.g. SSTR5 and PEBP1 or SSTR5, IN1GHRL and E-cadherin), the distribution of CR and PR patients is defined by a mathematical function (the blue line in the scatterplots) that separates CR from PR patients (blue and pink dots in the scatter plots, respectively). The details of the scatter plots and the mathematical models can be found in the Supplementary Figures S1-S3. (B) Decision tree exploiting molecular differences according to sex to accurately treat an acromegaly patient. If the patient is a male, the expression of E-cadherin should be measured and together with age it would be able to classify the patient as NR with an accuracy of 80.8%. If it is classified as CR/PR, the expression of the short and long DRD2 isoforms should be analyzed and together with E-cadherin it would be able to assign the right treatment with an accuracy of 80.0%. If the patient is a female, the expression of PEBP1 and GHRL should be measured and this will allow to classify the patient as NR with an accuracy of 73.8%. If it is classified as CR/PR, the expression of the short and long DRD2 isoform should be analyzed and together with E-cadherin it would allow to assign the right treatment with an accuracy of 74.7%. The details of the scatter plots and the mathematical models can be found in the Supplementary Figures S4-S7. ACC Accuracy, CR complete responder, PR partial responder, NR non-responder.