| Literature DB >> 36249564 |
Ross G Murphy1, Alan Gilmore2, Seedevi Senevirathne1, Paul G O'Reilly3, Melissa LaBonte Wilson1, Suneil Jain1, Darragh G McArt1.
Abstract
The development of gene signatures is key for delivering personalized medicine, despite only a few signatures being available for use in the clinic for cancer patients. Gene signature discovery tends to revolve around identifying a single signature. However, it has been shown that various highly predictive signatures can be produced from the same dataset. This study assumes that the presentation of top ranked signatures will allow greater efforts in the selection of gene signatures for validation on external datasets and for their clinical translation. Particle swarm optimization (PSO) is an evolutionary algorithm often used as a search strategy and largely represented as binary PSO (BPSO) in this domain. BPSO, however, fails to produce succinct feature sets for complex optimization problems, thus affecting its overall runtime and optimization performance. Enhanced BPSO (EBPSO) was developed to overcome these shortcomings. Thus, this study will validate unique candidate gene signatures for different underlying biology from EBPSO on transcriptomics cohorts. EBPSO was consistently seen to be as accurate as BPSO with substantially smaller feature signatures and significantly faster runtimes. 100% accuracy was achieved in all but two of the selected data sets. Using clinical transcriptomics cohorts, EBPSO has demonstrated the ability to identify accurate, succinct, and significantly prognostic signatures that are unique from one another. This has been proposed as a promising alternative to overcome the issues regarding traditional single gene signature generation. Interpretation of key genes within the signatures provided biological insights into the associated functions that were well correlated to their cancer type.Entities:
Keywords: Artificial Intelligence; Biomarker Discovery; Cancer; Machine Learning; Particle Swarm Optimization; Transcriptomics
Year: 2022 PMID: 36249564 PMCID: PMC9556859 DOI: 10.1016/j.csbj.2022.09.033
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Summary of the simulated and selected clinical transcriptomics cohorts to validate EBPSO and BPSO.
| Dataset name | Samples | Features | Classes | Class balance | Class separation | Informative features |
|---|---|---|---|---|---|---|
| Binary class simulated | 200 | 500 | 2 | 100 (50%) / 100 (50%) | 5 | 20 |
| Multi-class simulated | 200 | 500 | 3 | 67 (33.5%)/ 67 (33.5%) / 66 (33%) | 5 | 20 |
| DLBCL | 77 | 7129 | 2 | 77 (75%) / 19 (25%) | --- | --- |
| Breast | 31 | 54,675 | 2 | 17 (55%) / 14 (45%) | --- | --- |
| FASTMAN | 248 | 19,453 | 2 | 64 (26%) / 184 (74%) | --- | --- |
Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization; DLBCL, Diffuse Large B-Cell Lymphoma.
Fig. 1High level schematic showing the data flow from the full transcriptomics cohort to Flask-EBPSO. The limma R package selects the top 250 differentially expressed genes to be used in EBPSO. After each iteration of EBPSO, Flask-EBPSO visualizes the top candidate gene signatures through hierarchical clustering heatmaps and ROC curves. This is repeated until the number of EBSPO iterations has been reached. Abbreviations: EBPSO, enhanced binary particle swarm optimization; ROC, receiver operating characteristic.
Comparing the best candidate gene signatures produced from EBPSO and BPSO on simulated and real patient gene expression data sets. Note that runtimes for the simulated datasets and the GSE116918 FASTMAN dataset was represented as single CPU runs, but was run on four CPU’s.
| Dataset | Statistics | EBPSO | BPSO | ||||
|---|---|---|---|---|---|---|---|
| Best | Average | S.D. | Best | Average | S.D. | ||
| Binary class simulated | Accuracy (%) | 99.5 | 99.5 | 0 | 99.5 | 99.5 | 0 |
| Genes | 5 | 9.7 | 2.8 | 154 | 157.1 | 2 | |
| Time (min) | 1129 | 1185 | 37.7 | 7363 | 7426 | 44.5 | |
| Multi-class simulated | Accuracy (%) | 99 | 96.5 | 1.9 | 99 | 99 | 0 |
| Genes | 77 | 56.7 | 59.8 | 155 | 159.6 | 3.5 | |
| Time (min) | 1447 | 1465 | 49 | 10202.1 | 10264.7 | 48.1 | |
| DLBCL | Accuracy (%) | 100 | 100 | 0 | 100 | 99.2 | 0.7 |
| Genes | 5 | 5.4 | 0.8 | 69 | 69.2 | 5.2 | |
| Time (min) | 68.9 | 70.5 | 1.1 | 207 | 204.8 | 5.8 | |
| GSE43358 | Accuracy (%) | 100 | 100 | 0 | 100 | 100 | 0 |
| Genes | 2 | 2.3 | 0.5 | 57 | 63.6 | 3.3 | |
| Time (min) | 17.8 | 18.2 | 0.3 | 29.2 | 29.4 | 0.3 | |
| GSE116918 FASTMAN | Accuracy (%) | 85.1 | 83.8 | 0.9 | 90 | 88.8 | 0.8 |
| Genes | 15 | 12.3 | 7 | 98 | 92.8 | 6.1 | |
| Time (min) | 1464 | 1515 | 58 | 5522 | 5451 | 87.8 | |
Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization; S.D., Standard Deviation.
Fig. 2EBPSO and PySwarms BPSO on a simulated dataset of 200 samples, 500 features, two classes, and 20 informative features with a class separation of five. A. Hierarchical clustering of the 20 informative features towards the two classes. B. Hierarchical clustering of the candidate signature selected by EBPSO. C. Hierarchical clustering of the candidate signature selected by PySwarms BPSO. D. Cost history over 500 iterations for EBPSO (solid) and PySwarms BPSO (dashed). Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization.
Fig. 3EBPSO and PySwarms BPSO on a simulated dataset of 200 samples, 500 features, three classes, and 20 informative features with a class separation of five. A. Hierarchical clustering of the 20 informative features towards the three classes. B. Hierarchical clustering of the candidate signature selected by EBPSO. C. Hierarchical clustering of the candidate signature selected by PySwarms BPSO. D. Cost history over 500 iterations for EBPSO (solid) and PySwarms BPSO (dashed). Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization.
Fig. 4EBPSO and PySwarms BPSO on the DLBCL data set. A. Hierarchical clustering of eight previously identified features for best class separation towards the two classes. B. Hierarchical clustering of the candidate signature selected by the EBPSO. C. Hierarchical clustering of the candidate signature selected by PySwarms BPSO. D. Cost history over 500 iterations for EBPSO (solid) and PySwarms BPSO (dashed). E. Venn diagram comparing the top three candidate signatures from a single run of EBPSO and the eight previously identified features in A. Abbreviations: EBPSO, Enhanced Binary Particle Swarm; BPSO, Binary Particle Swarm Optimization; DLBCL, Diffuse Large B-Cell Lymphoma.
Summary statistics from limma for the gene features selected by the top three ranked gene signatures selected by EBPSO on the DLBCL dataset for DLBCL vs FL. Feature names in bold relate to features that have appeared in more than one of the top three ranked gene signatures from EBPSO.
| Signature rank | Feature name | Gene | LogFC | p-value | Adj(p-value) |
|---|---|---|---|---|---|
| 1st ranked | −1.557 | 3.436e-11 | 6.123e-08 | ||
| 1.321 | 1.791e-09 | 8.51e-07 | |||
| −1.724 | 6.651e-09 | 1.6e-06 | |||
| 2.037 | 3.634e-07 | 3.365e-05 | |||
| D26069_at | ACAP2 | −0.722 | 2.233e-06 | 1.373e-04 | |
| −1.324 | 9.046e-06 | 4.243e-04 | |||
| 0.778 | 1.143e-05 | 4.97e-04 | |||
| 2nd ranked | −1.557 | 3.436e-11 | 6.123e-08 | ||
| 1.321 | 1.791e-09 | 8.51e-07 | |||
| −1.724 | 6.651e-09 | 1.6e-06 | |||
| 2.037 | 3.634e-07 | 3.365e-05 | |||
| M22960_at | CTSA | 0.75 | 1.01e-06 | 7.12e-05 | |
| −1.324 | 9.046e-06 | 4.243e-04 | |||
| 0.778 | 1.143e-05 | 4.97e-04 | |||
| 3rd ranked | M74093_at | CCNE1 | 1.87 | 1.448e-11 | 3.44e-08 |
| M14328_s_at | ENO1 | 0.899 | 1.191e-09 | 6.531e-07 | |
| M23323_s_at | CD3E | −0.924 | 2.203e-09 | 9.551e-07 | |
| L19437_at | TALDO1 | 0.749 | 3.311e-07 | 3.147e-05 | |
| Z35227_at | RHOH | −0.978 | 6.402e-07 | 5.015e-05 | |
| X66867_cds1_at | MAX | −1.133 | 1.451e-05 | 5.868e-04 | |
| HG4258-HT4528_at | --- | −0.967 | 2.878e-05 | 9.634e-04 |
Abbreviations: LogFC, Log Fold Change; Adj(p-value), Adjusted p-value; TRIP2, Tribbles Pseudokinase 2; HGMA1, High Mobility group AT-hook 1; CSRP2, Cysteine and glycine Rich Protein 2; CKS1B, CDC28 protein Kinase regulatory Subunit 1B; ACAP2, ArfGAP with Coiled-coil, Ankyrin repeat and PH domains 2; ZFP36L2, ZFP36 ring finger protein Like 2; MANF, Mesencephalic Astrocyte derived Neurotrophic Factor; CTSA, Cathepsin A; CCNE1, Cyclin E1; ENO1, Enolase 1; CD3E, CD3e molecule; TALDO1, Transaldolase 1; RHOH, Ras Homolog family member H; MAX, MYC Associated factor X.
Fig. 5EBPSO and PySwarms BPSO on GSE43358. A. Hierarchical clustering of the top ten features selected on p-value from limma. B. Hierarchical clustering of the candidate signature selected by EBPSO. C. Hierarchical clustering of the candidate signature selected by PySwarms BPSO. D. Cost history over 500 iterations for EBPSO (solid) and PySwarms BPSO (dashed). E. Venn diagram comparing the top three candidate signatures from a single run of EBPSO and the top ten features selected from limma in A. Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization.
Summary statistics from limma for the gene features selected by the top three ranked gene signatures selected by EBPSO on GSE43358 for TNBC vs HER2 breast cancer subtypes.
| Signature rank | Feature name | Gene | LogFC | p-value | Adj(p-value) |
|---|---|---|---|---|---|
| 1st ranked | 211026_s_at | MGLL | −0.994 | 2.308e-05 | 0.007 |
| 218440_at | MCC1 | 0.827 | 1.791e-09 | 0.007 | |
| 201728_s_at | KIAA0100 | −0.921 | 3.253e-05 | 0.008 | |
| 2nd ranked | 219344_at | SLC29A3 | −0.625 | 5.835e-06 | 0.003 |
| 227279_at | TCEAL3 | −0.981 | 6.056e-06 | 0.003 | |
| 224809_x_at | TINF2 | −0.373 | 8.197e-06 | 0.004 | |
| 3rd ranked | 221732_at | CANT1 | −0.94 | 1.914e-06 | 0.002 |
| 222400_s_at | ADI1 | 0.58 | 4.098e-05 | 0.009 | |
| 223344_s_at | MS4A7 | −0.858 | 4.175e-05 | 0.009 |
Abbreviations: LogFC, Log Fold Change; Adj(p-value), Adjusted p-value; MGLL, Monoglyceride Lipase; MCC1, Methylcrotonyl-CoA Carboxylase subunit 1; SLC29A3, solute Carrier family 29 member 3; TCEAL3, Transcription Elongation factor A Like 3; TINF2, TERF1 Interacting Nuclear Factor 2; CANT1, Calcium Activated Nucleotidase 1; ADI1, Acireductone Dioxygenase 1; MS4A7, Membrane Spanning 4-domains A7.
Fig. 6EBPSO and PySwarms BPSO on GSE116918 FASTMAN. A. Hierarchical clustering of the top ten features selected based on p-value from limma. B. Hierarchical clustering of the candidate signature selected by EBPSO. C. Hierarchical clustering of the candidate signature selected by PySwarms BPSO. D. Cost history over 500 iterations for EBPSO (solid) and PySwarms BPSO (dashed). E. Venn diagram comparing the top three candidate signatures from a single run of EBPSO and the top ten features selected from limma in A. Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization.
Summary statistics from limma for the gene features selected by the top three ranked gene signatures selected by EBPSO on GSE116918 FASTMAN for biochemical failure status. Feature names in bold relate to features that have appeared in more than one of the top three ranked gene signatures from EBPSO. Abbreviations for the official gene symbols can be seen in Appendix B.
| Signature rank | Feature name | LogFC | p-value | Adj(p-value) |
|---|---|---|---|---|
| 1st ranked | FAP | 0.599 | 4.069e-07 | 0.008 |
| MX1 | 0.331 | 1.424e-05 | 0.046 | |
| RAB27B | −0.507 | 2.844e-05 | 0.061 | |
| RGS16 | 0.341 | 5.305e-05 | 0.069 | |
| UACA | 0.225 | 0.0003 | 0.137 | |
| ADC | 0.188 | 0.0003 | 0.137 | |
| IGFBP3 | 0.247 | 0.0004 | 0.158 | |
| AKAP7 | −0.226 | 0.0012 | 0.247 | |
| FCER1G | 0.208 | 0.0024 | 0.302 | |
| ANO10 | 0.214 | 0.0038 | 0.362 | |
| COL1A2 | 0.323 | 0.004 | 0.364 | |
| THBS1 | 0.164 | 0.0047 | 0.394 | |
| GLIPR1 | 0.193 | 0.0049 | 0.402 | |
| −0.199 | 0.0064 | 0.419 | ||
| MFSD4 | −0.209 | 0.007 | 0.432 | |
| APPBP2 | −0.149 | 0.0131 | 0.498 | |
| PDIA5 | −0.2 | 0.0138 | 0.505 | |
| OR51D1 | 0.203 | 0.0198 | 0.579 | |
| 2nd ranked | RTCA | −0.274 | 0.0002 | 0.107 |
| TLL1 | 0.218 | 0.0002 | 0.112 | |
| SIAH1 | −0.318 | 0.002 | 0.271 | |
| ZNF382 | 0.229 | 0.002 | 0.281 | |
| PPAPDC1B | −0.344 | 0.002 | 0.294 | |
| ASPN | 0.382 | 0.003 | 0.345 | |
| CHRNA2 | −0.325 | 0.003 | 0.357 | |
| LYZ | 0.455 | 0.004 | 0.362 | |
| PCED1A | 0.151 | 0.005 | 0.394 | |
| RBP7 | −0.223 | 0.008 | 0.442 | |
| DESI2 | 0.139 | 0.013 | 0.501 | |
| −0.52 | 0.014 | 0.511 | ||
| SYNPO | 0.153 | 0.019 | 0.575 | |
| 3rd ranked | ORL51L1 | −0.257 | 0.0007 | 0.197 |
| FNDC1 | 0.602 | 0.0008 | 0.217 | |
| IFI44L | 0.343 | 0.0009 | 0.217 | |
| TMEM138 | 0.139 | 0.001 | 0.236 | |
| TRPM8 | −0.482 | 0.001 | 0.236 | |
| ZNF702P | −0.376 | 0.001 | 0.245 | |
| PDCD1LG2 | 0.135 | 0.001 | 0.245 | |
| HOX19 | −0.352 | 0.002 | 0.271 | |
| CTSD | 0.323 | 0.002 | 0.274 | |
| MRPL17 | −0.292 | 0.003 | 0.319 | |
| SAMD3 | 0.332 | 0.003 | 0.341 | |
| TMSB10 | 0.283 | 0.004 | 0.362 | |
| SUSD4 | −0.24 | 0.004 | 0.362 | |
| MPEG1 | 0.256 | 0.004 | 0.371 | |
| SLFN5 | 0.207 | 0.006 | 0.412 | |
| −0.199 | 0.006 | 0.419 | ||
| PRSS27 | 0.208 | 0.008 | 0.441 | |
| −0.52 | 0.014 | 0.511 | ||
| CRIP1 | 0.136 | 0.017 | 0.558 | |
| ZNF613 | −0.216 | 0.018 | 0.567 | |
| AIFM1 | −0.139 | 0.02 | 0.581 | |
| GFM2 | −0.173 | 0.025 | 0.611 |
Abbreviations: LogFC, Log Fold Change; Adj(p-value), Adjusted p-value.
Fig. 7Survival analysis of hierarchical clustering patient subgroups with selected gene signatures from EBPSO and PySwarms BPSO on GSE116918 FASTMAN. A. Kaplan-Meier plot of the subgroups for the top ten features selected based on p-value from limma. B. Kaplan-Meier plot of the subgroups for the candidate signature selected by EBPSO. C. Kaplan-Meier plot of the subgroups for the candidate signature selected by PySwarms BPSO. D. Forest plot of the signatures from limma, EBPSO (TopEBPSO), and PySwarms BPSO, and the top three ranked candidate signatures from a single run of EBPSO (EBPSO1-3; see Fig. 6E). Abbreviations: DEG, Differentially Expressed Gene; BCR, Biochemical Recurrence; EBPSO, Enhanced Binary Particle Swarm Optimization; BPSO, Binary Particle Swarm Optimization.
Fig. 8EBPSO web-based analytical application with Flask. A. Upload/View Data homepage for loaded data sets to be previewed or deleted and new data sets to be loaded into the application’s file system. B. Monitor Signatures page displaying: visualization options and an interactive gene signature performance cost leaderboard (left panel); real time visualizations (middle panel); and user input parameters including loading historical application runs and button for downloading the signatures from a completed application run (right panel). Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization.
Candidate gene signature selection results over ten runs of EBPSO on simulated datasets. These simulated datasets were created based on binary and multi-class classification tasks.
| Run | Binary class | Multi-class | ||
|---|---|---|---|---|
| Accuracy (%) | Genes | Accuracy (%) | Genes | |
| 1 | 99.5 | 8 | 97.5 | 12 |
| 2 | 99.5 | 11 | 96.5 | 57 |
| 3 | 99.5 | 16 | 84 | 50 |
| 4 | 99.5 | 8 | 93.5 | 40 |
| 5 | 99.5 | 9 | 96.5 | 25 |
| 6 | 99.5 | 5 | 99 | 218 |
| 7 | 99.5 | 10 | 99 | 77 |
| 8 | 99.5 | 11 | 95 | 29 |
| 9 | 99.5 | 10 | 97.5 | 20 |
| 10 | 99.5 | 9 | 96 | 19 |
| Average ± S.D. | 99.5 ± 0 | 9.7 ± 2.8 | 96.5 ± 1.9 | 56.7 ± 59.8 |
Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization.
Candidate gene signature selection results over ten runs of EBPSO on real patient gene expression cohorts DLBCL, GSE43358, and GSE116918 FASTMAN.
| Run | DLBCL | GSE43358 | GSE116918 FASTMAN | |||
|---|---|---|---|---|---|---|
| Accuracy (%) | Genes | Accuracy (%) | Genes | Accuracy (%) | Genes | |
| 1 | 100 | 5 | 100 | 2 | 83.5 | 4 |
| 2 | 100 | 5 | 100 | 2 | 83.5 | 11 |
| 3 | 100 | 5 | 100 | 3 | 84.7 | 22 |
| 4 | 100 | 7 | 100 | 2 | 92.7 | 10 |
| 5 | 100 | 5 | 100 | 2 | 85.1 | 26 |
| 6 | 100 | 7 | 100 | 2 | 83.5 | 11 |
| 7 | 100 | 5 | 100 | 2 | 83.9 | 6 |
| 8 | 100 | 5 | 100 | 2 | 85.1 | 15 |
| 9 | 100 | 5 | 100 | 3 | 83.1 | 11 |
| 10 | 100 | 5 | 100 | 3 | 82.7 | 7 |
| Average ± S.D. | 100 ± 0 | 5.4 ± 0.8 | 100 ± 0 | 2.3 ± 0.5 | 83.8 ± 0.9 | 12.3 ± 7 |
Abbreviations: EBPSO, Enhanced Binary Particle Swarm Optimization; DLBCL, Diffuse Large B-Cell Lymphoma.
Abbreviations of the official gene symbols from Table 5.
| Signature rank | Feature name | Full gene name |
|---|---|---|
| 1st ranked | FAP | Fibroblast Activation Protein alpha |
| MX1 | MX dynamin like GTPase 1 | |
| RAB27B | RAB27B, member RAS oncogene family | |
| RGS16 | Regulator of G protein Signaling 16 | |
| UACA | Uveal Autoantigen with Coiled-coil domains and Ankyrin repeats | |
| ADC | --- | |
| IGFBP3 | Insulin like Growth Factor Binding Protein 3 | |
| AKAP7 | A-Kinase Anchoring Protein 7 | |
| FCER1G | FC Epsilon Receptor IG | |
| ANO10 | Anoctamin 10 | |
| COL1A2 | Collagen type I Alpha 2 chain | |
| THBS1 | Thrombospondin 1 | |
| GLIPR1 | GLI Pathogenesis Related 1 | |
| MFSD4 | Major Facilitator Superfamily Domain containing 4 | |
| APPBP2 | Amyloid beta Precursor Protein Binding Protein 2 | |
| PDIA5 | Protein Disulfide Isomerase family A member 5 | |
| OR51D1 | Olfactory Receptor family 51 subfamily D member 1 | |
| 2nd ranked | RTCA | RNA 3′-Terminal phosphate Cyclase |
| TLL1 | Tolloid Like 1 | |
| SIAH1 | SIAH E3 ubiquitin protein ligase 1 | |
| ZNF382 | Zinc Finger protein 382 | |
| PPAPDC1B | Phosphatidic Acid Phosphatase type 2 Domain Containing 1B | |
| ASPN | Asporin | |
| CHRNA2 | Cholinergic Receptor Nicotinic Alpha 2 subunit | |
| LYZ | Lysozyme | |
| PCED1A | PC-Esterase Domain containing 1A | |
| RBP7 | Retinol Binding Protein 7 | |
| DESI2 | Desumoylating Isopeptidase 2 | |
| SYNPO | Synaptopodin | |
| 3rd ranked | ORL51L1 | --- |
| FNDC1 | Fibronectin type III Domain Containing 1 | |
| IFI44L | Interferon Induced protein 44 Like | |
| TMEM138 | Transmembrane protein 138 | |
| TRPM8 | Transient Receptor Potential cation channel subfamily M member 8 | |
| ZNF702P | Zinc Finger protein 702, Pseudogene | |
| PDCD1LG2 | Programmed Cell Death 1 Ligand 2 | |
| HOX19 | Homeobox-leucine zipper protein HOX19 | |
| CTSD | Cathepsin D | |
| MRPL17 | Mitochondrial Ribosomal Protein L17 | |
| SAMD3 | Sterile Alpha Motif Domain containing 3 | |
| TMSB10 | Thymosin Beta 10 | |
| SUSD4 | Sushi Domain containing 4 | |
| MPEG1 | Macrophage Expressed 1 | |
| SLFN5 | Schlafen Family member 5 | |
| PRSS27 | Serine Protease 27 | |
| CRIP1 | Cysteine Rich Protein 1 | |
| ZNF613 | Zinc Finger protein 613 | |
| AIFM1 | Apoptosis Inducing Factor Mitochondria associated 1 | |
| GFM2 | GTP dependent ribosome recycling Factor Mitochondrial 2 |