| Literature DB >> 36175974 |
Mauro Nascimben1,2, Lia Rimondini3, Davide Corà3,4, Manolo Venturin5.
Abstract
INTRODUCTION: Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns.Entities:
Keywords: Data-driven biomarker research; Non-linear dimension reduction; Polygenic risk modeling; Tree ensemble embedding
Year: 2022 PMID: 36175974 PMCID: PMC9523990 DOI: 10.1186/s13040-022-00306-w
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 4.079
Fig. 1Descriptive information of the cohort of patients included in the dataset. In clockwise order, the pie charts show in the top left corner the lineage, the percentage of males or females, the rate of patients dead or alive, and the tumor stage at the time of data collection
Fig. 2The boxplots depict expression levels for the hub and seed genes before preprocessing
Fig. 3The heatmap reports the Pearson product-moment correlation coefficients of expression levels for the hub and seed genes before preprocessing
Fig. 4Outline of the analysis pipeline to produce complete and partial forest embeddings
Fig. 5Overview of initial GED discretizations applied as preprocessing
Number of examples in each class after preprocessing
| IIa | IIIa | IVa | IId | IIId | IVd | Total | |
|---|---|---|---|---|---|---|---|
| Original dataset | 88 | 80 | 47 | 36 | 53 | 82 | 386 |
| Log-z | 69 | 72 | 87 | 87 | 82 | 77 | 474 |
| Uniform | 73 | 75 | 82 | 84 | 80 | 70 | 464 |
| Normal | 75 | 73 | 84 | 86 | 82 | 76 | 476 |
t-SNE parameters
| Parameter | Abbreviation | Levels |
|---|---|---|
| Angular size for Barnes-Hut | 8 | |
| Early exaggeration | EE | 8 |
| Learning rate | LR | 14 |
| Metric for distance between instances | Metr | 9 |
| Perplexity | Perp | 11 |
UMAP parameters
| Parameter | Abbreviation | Levels |
|---|---|---|
| Learning rate | LR | 8 |
| Metric for high dimensional space distances calculation | Metr | 8 |
| Number of nearest neighbors assumed at local level | LC | 5 |
| Dispersion of points on manifold | MiD | 5 |
| Size of neighboring sample points in manifold estimation | NN | 6 |
| During optimization, ratio of negative samples per positive example | NSR | 3 |
| Negative samples penalization while optimizing in low dimension | RS | 4 |
| Ratio of fuzzy set operations to obtain global fuzzy simplicial sets | Mix | 5 |
| Spread out scale of embedded points | Sp | 5 |
t-SNE summary table
| Parameters | Clust. Param.a | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Emb. | Transf. | EE | LR | Metr. | Perp. | Clust. | parameter 1 | parameter 2 | Sil. | CHI | DBI | |
| Full | Log-z | 0.35 | 20 | 17 | corr | 10 | hdbscan | min cl s=50 | min s=1 | 0.761 | 2831.66 | 0.343 |
| Full | Unif | 0.35 | 12 | 100 | corr | 10 | hdbscan | min cl s=50 | min s=1 | 0.805 | 3515.45 | 0.269 |
| Full | Norm | 0.57 | 16 | 50 | cheb | 25 | birch | bf=5 | th=0.2 | 0.463 | 600.45 | 0.774 |
| Part. | Log-z | 0.57 | 20 | 200 | corr | 25 | birch | bf=54 | th=0.73 | 0.417 | 601.63 | 0.781 |
| Part. | Unif | 0.57 | 24 | 25 | cheb | 20 | birch | bf=80 | th=0.26 | 0.416 | 549.99 | 0.801 |
| Part. | Norm | 0.57 | 20 | 1000 | cheb | 25 | SC | neighbors=10 | - | 0.355 | 409.97 | 0.828 |
aMIN CL S smallest size grouping, TH threshold, BF branching factor, MIN S minimal samples
UMAP summary table
| Parameters | Clust. Param.a | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Emb. | Transf. | LR | LC | Metr. | MiD | NN | NSR | RS | Mix | Sp | Clust. | parameter 1 | parameter 2 | Sil. | CHI | DBI |
| Full | Log-z | 0.1 | 1 | mink | 0.2 | 8 | 7 | 2 | 0.5 | 0.25 | AP | pref=-34.3 | damp=0.714 | 0.619 | 2686.95 | 0.505 |
| Full | Unif | 10 | 1 | hamm | 0.5 | 50 | 5 | 1 | 0.25 | 4 | mb k-m | bat s=10 | - | 0.836 | 13263.93 | 0.246 |
| Full | Norm | 10 | 4 | hamm | 0.2 | 15 | 7 | 1 | 0.75 | 0.25 | SC | - | 0.602 | 1302.76 | 0.577 | |
| Part. | Log-z | 5 | 1 | hamm | 0.05 | 50 | 9 | 2 | 0.1 | 2 | birch | bf=13 | th=0.2 | 0.601 | 1151.23 | 0.495 |
| Part. | Unif | 0.1 | 1 | hamm | 0.01 | 30 | 5 | 3 | 0.1 | 0.25 | AP | pref=-34.3 | damp=0.71 | 0.729 | 3384.71 | 0.414 |
| Part. | Norm | 0.1 | 1 | hamm | 0.2 | 50 | 5 | 3 | 0.25 | 1 | birch | bf=78 | th=0.2 | 0.541 | 936.61 | 0.598 |
aPREF preferences for each point, BAT S size of the mini batches, DAMP damping factor, TH threshold, kernel coefficient of radial basis function
Fig. 6GED full embedding generating prognostic maps using tSNE Log-z values (on the left), and Uniform UMAP transformation (on the right)
Fig. 7The scatterplot displays the first two principal components of expression levels for the hub and seed genes before preprocessing. Total explained variance is
Fig. 8External evaluation metrics on partially embedded data
External evaluation metrics
| t-SNE | UMAP | |||||
|---|---|---|---|---|---|---|
| Metric | Log-z | Unif | Norm | Log-z | Unif | Norm |
| Fowlkes-Mallows index | -56.3 | -33.0 | -52.3 | -41.5 | -21.6 | -42.9 |
| Adjusted Rand index | -68.2 | -39.7 | -63.3 | -51.0 | -25.9 | -52.8 |
| Adjusted Mutual Information | -60.6 | -37.3 | -54.5 | -42.3 | -27.9 | -42.3 |
| Normalized Mutual Information | -59.7 | -36.7 | -53.7 | -41.6 | -27.5 | -41.6 |
| Homogeneity | -60.2 | -37.1 | -54.4 | -43.0 | -27.5 | -43.2 |
| Completeness | -59.2 | -36.2 | -53.0 | -40.2 | -27.5 | -39.9 |
| Harmonic mean (V-measure) | -59.7 | -36.7 | -53.7 | -41.6 | -27.5 | -41.6 |
t-SNE Parameter space exploration
| Available | Reduced parameter set | All 5 parameters | ||||
|---|---|---|---|---|---|---|
| Pipeline | Combinat. | R | Selected parameters | Regressor | R | Regressor |
| Uniform | 615 | 0.968 | ETRa | 0.961 | ETR | |
| Log-z | 601 | 0.835 | Metr, Perp | Bagging | 0.814 | ETR |
| Normal | 608 | 0.884 | LR, Metr, Perp | Votingb | 0.821 | ETR |
aMeta estimator fitting 100 randomized decision trees
bAveraged individual predictions of Bagging, Random Forest and Gradient Boosting regressors
UMAP Parameter space exploration
| Available | Reduced parameter set | All 9 parameters | ||||
|---|---|---|---|---|---|---|
| Pipeline | Combinat. | R | Selected parameters | Regressor | R | Regressor |
| Uniform | 2173 | 0.825 | LR,LC,Metr,MiD,NN | HGBRa | 0.820 | HGBR |
| RS,Mix,Sp | ||||||
| Log-z | 2170 | 0.825 | LR,LC,Metr,MiD,NN | HGBR | 0.825 | HGBR |
| RS,Mix,Sp,NSR | ||||||
| Normal | 2173 | 0.810 | LR,LC,Metr,MiD,NN | HGBR | 0.803 | HGBR |
| RS,Mix,Sp | ||||||
aHistogram-based Gradient Boosting Regression Tree
Random Forest and Dummy classifiers balanced accuracy of preprocessed GED with discretizations pipelines vs. GED (accuracies are expressed as percentages)
| Pipeline | RF Bal. Acc. | Dummy Bal. Acc. |
|---|---|---|
| tSNE Uniform | ||
| tSNE Log-z | ||
| tSNE Normal | ||
| UMAP Uniform | ||
| UMAP Log-z | ||
| UMAP Normal | ||
Random Forest balanced accuracies during gene relevance investigation (as percentages)
| Pipeline | RF PI | RF RFECV | Dummy PI | Dummy RFECV |
|---|---|---|---|---|
| tSNE Uniform | ||||
| tSNE Log-z | ||||
| tSNE Normal | ||||
| UMAP Uniform | ||||
| UMAP Log-z | ||||
| UMAP Normal | ||||
| Average |
Occurrencies of the gene ranked most important by the six pipelines. Last column sums the number of times genes were top ranked by both PA and RFECV procedures
| Gene | Type | Occur. PI top ranked | Occur. RFECV top ranked | Tot. occur. top ranked |
|---|---|---|---|---|
| KPNA2 | HUB | 6 | 6 | 12 |
| KIF11 | HUB | 6 | 6 | 12 |
| DMD | SEED | 6 | 6 | 12 |
| SLMAP | SEED | 6 | 6 | 12 |
| TAGLN | SEED | 6 | 6 | 12 |
| SH3BGR | SEED | 6 | 6 | 12 |
| CCNB1 | HUB | 6 | 6 | 12 |
| CDK1 | HUB | 6 | 6 | 12 |
| KIF20A | HUB | 5 | 6 | 11 |
| CDC20 | HUB | 6 | 5 | 11 |
| CRYAB | HUB | 6 | 5 | 11 |
| MAD2L1 | HUB | 4 | 6 | 10 |
| AURKA | HUB | 4 | 6 | 10 |
| AP2S1 | SEED | 4 | 6 | 10 |
| TUBA1C | SEED | 4 | 6 | 10 |
| TCEAL2 | SEED | 3 | 6 | 9 |
| PLAU | SEED | 2 | 6 | 8 |
| ATP2B4 | SEED | 2 | 6 | 8 |
| KIF2C | HUB | 1 | 5 | 6 |
| CASQ2 | HUB | 0 | 6 | 6 |
| TPM1 | HUB | 0 | 5 | 5 |
| CCNA2 | HUB | 0 | 3 | 3 |
| UBE2C | HUB | 0 | 3 | 3 |
| HJURP | SEED | 0 | 1 | 1 |
| SBSPON | SEED | 0 | 1 | 1 |
Fig. 9Barplot of gene relevance in categorizing the prognosis of the patients (agreement between RFECV and PI methods)
Average computational times (in seconds) for each single operation performed in the analysis pipeline during the complete experimental embedding
| Pipeline | Forest Emb. | Dim. Red. | Clustering | Param. Comb. |
|---|---|---|---|---|
| tSNE Uniform | 615 | |||
| tSNE Log-z | 601 | |||
| tSNE Normal | 608 | |||
| UMAP Uniform | 2173 | |||
| UMAP Log-z | 2170 | |||
| UMAP Normal | 2173 |