| Literature DB >> 33324443 |
Benjamin Vittrant1,2, Mickael Leclercq1,2, Marie-Laure Martin-Magniette3,4, Colin Collins5,6, Alain Bergeron1,7, Yves Fradet1,7, Arnaud Droit1,2.
Abstract
Determining which treatment to provide to men with prostate cancer (PCa) is a major challenge for clinicians. Currently, the clinical risk-stratification for PCa is based on clinico-pathological variables such as Gleason grade, stage and prostate specific antigen (PSA) levels. But transcriptomic data have the potential to enable the development of more precise approaches to predict evolution of the disease. However, high quality RNA sequencing (RNA-seq) datasets along with clinical data with long follow-up allowing discovery of biochemical recurrence (BCR) biomarkers are small and rare. In this study, we propose a machine learning approach that is robust to batch effect and enables the discovery of highly predictive signatures despite using small datasets. Gene expression data were extracted from three RNA-Seq datasets cumulating a total of 171 PCa patients. Data were re-analyzed using a unique pipeline to ensure uniformity. Using a machine learning approach, a total of 14 classifiers were tested with various parameters to identify the best model and gene signature to predict BCR. Using a random forest model, we have identified a signature composed of only three genes (JUN, HES4, PPDPF) predicting BCR with better accuracy [74.2%, balanced error rate (BER) = 27%] than the clinico-pathological variables (69.2%, BER = 32%) currently in use to predict PCa evolution. This score is in the range of the studies that predicted BCR in single-cohort with a higher number of patients. We showed that it is possible to merge and analyze different small and heterogeneous datasets altogether to obtain a better signature than if they were analyzed individually, thus reducing the need for very large cohorts. This study demonstrates the feasibility to regroup different small datasets in one larger to identify a predictive genomic signature that would benefit PCa patients.Entities:
Keywords: RNA-seq; biochemical recurrence; machine learning; predictive signature; prostate cancer; random forest
Year: 2020 PMID: 33324443 PMCID: PMC7723980 DOI: 10.3389/fgene.2020.550894
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Pipeline workflow. Quality control of raw data sequencing files is measured, then trimmed to remove their adaptors. Patient metadata are then filtered to keep only BCR patients with long follow-up. Retained sequences are then mapped, quantified and normalized. Finally, a machine learning approach is used to analyze the data to obtain a gene expression predictive signature and a model.
Baseline characteristics of the cohorts.
| TCGA | GSE54460 | VPCC | Total | ||
| 52 | 96 | 23 | 171 | ||
| 5 | 0 | 1 | 3 | 4 | |
| 6 | 2 | 9 | 12 | 23 | |
| 7 | 14 | 72 | 4 | 90 | |
| 16 | 82 | 19 | 117 | ||
| 8 | 9 | 9 | 1 | 19 | |
| 9 | 27 | 5 | 2 | 34 | |
| 10 | 0 | 0 | 1 | 1 | |
| NA | 0 | 0 | 0 | 0 | |
| 36 | 14 | 4 | 54 | ||
| Total | 52 | 96 | 23 | 171 | |
| T1C | 0 | 14 | 0 | 14 | |
| T2 | 0 | 7 | 0 | 7 | |
| T2A | 1 | 21 | 3 | 25 | |
| T2B | 2 | 10 | 0 | 12 | |
| T2C | 9 | 26 | 17 | 52 | |
| T3 | 0 | 2 | 0 | 2 | |
| T3A | 16 | 5 | 2 | 23 | |
| T3B | 24 | 9 | 1 | 34 | |
| T4 | 0 | 1 | 0 | 1 | |
| NA | 0 | 1 | 0 | 1 | |
| Total | 52 | 96 | 23 | 171 | |
| NO | 14 | 54 | 5 | 73 | |
| YES | 38 | 42 | 18 | 98 | |
| Total | 52 | 96 | 23 | 171 | |
| < = 10 | 31 | 64 | 21 | 116 | |
| 10–20 | 16 | 17 | 1 | 34 | |
| > = 20 | 5 | 12 | 1 | 18 | |
| NA | 0 | 3 | 0 | 3 | |
| Total | 52 | 96 | 23 | 171 | |
FIGURE 2Summary of gene expression value in each dataset (A) or log of the expression value (B).
FIGURE 3Machine learning feature selection and model evaluation workflow.
Performance measures.
| Performance metric | Formula |
| Sensitivity | TP/(TP + FN) |
| Specificity | TN/(TN + FP) |
| Accuracy | (TP + TN)/(TP + TN + FP + FN)∗100 |
| MCC | |
| BER | 1–0.5 (Sensitivity + Specificity) |
| MMCE | Mean (response! = truth) |
FIGURE 4Machine learning algorithms comparison. The BER results of our 13 benchmarked algorithms are presented. The last model is a featureless control case.
FIGURE 5For the 400 genes tested the best genes/performance ratio is obtained with less than 20 genes in our model.
Feature selection benchmark.
| Nb of features | BER | MMCE | MCC | ACC | Gene name | ENSG |
| 1 | 0.40 | 0.39 | 0.20 | 0.60 | PPDPF | ENSG00000125534 |
| 2 | 0.32 | 0.30 | 0.38 | 0.69 | HES4 | ENSG00000188290 |
| 3 | 0.28 | 0.28 | 0.48 | 0.74 | JUN | ENSG00000177606 |
| 4 | 0.28 | 0.26 | 0.47 | 0.73 | GNB2 | ENSG00000172354 |
| 5 | 0.28 | 0.26 | 0.48 | 0.74 | PYROXD2 | ENSG00000119943 |
| 6 | 0.25 | 0.23 | 0.53 | 0.77 | MAP3K2 | ENSG00000169967 |
| 7 | 0.27 | 0.25 | 0.50 | 0.75 | RPL28 | ENSG00000108107 |
| 8 | 0.25 | 0.23 | 0.53 | 0.77 | DHCR24 | ENSG00000116133 |
FIGURE 6Balanced Error Rate (BER) evolution according to modulation of Random Forest (RF) parameters. Four different RF hyper-parameters were tested while keeping the others at default value in a grid search approach. The results were then used in an Irace search to find optimal parameters. (A) ntree, number of decision trees; (B) mtry, number of variables selected from a decision split for the next split; (C) maxnodes, maximal number of nodes; (D) nodesize, minimal number of samples allowed in a node.
FIGURE 7ROC curve for the three-gene model.
FIGURE 8Log2 transformed distribution of normalized read counts for the three genes signature in each cohort.
Comparison of model performance using clinic or omics data or both.
| Metric | Omics | Clinic | Omics + Clinic |
| BER | 0.27 | 0.32 | 0.28 |
| MMCE | 0.257 | 0.306 | 0.265 |
| MCC | 0.474 | 0.373 | 0.457 |
| ACC | 0.742 | 0.692 | 0.734 |
| ntree | 187 | 1402 | 667 |
| mtry | 1 | 3 | 1 |
| maxnodes | 881 | 30 | 25 |
| nodesize | 1 | 4 | 6 |
FIGURE 9Performance obtained using leave one out group validation. (A) Model trained on GSE54460 and VPCC then tested on TCGA. (B) Model trained on TCGA and VPCC then tested on GSE54460. (C) Model trained on GSE54460 and TCGA then tested on VPCC. (D) Combined dataset evaluated by subsampling method described in “Validation Strategy.”