| Literature DB >> 35352098 |
Jing Wang1, Qinglong Zhang1, Junshan Han1, Yanpeng Zhao1, Caiyun Zhao1, Bowei Yan1, Chong Dai1, Lianlian Wu1, Yuqi Wen1, Yixin Zhang1, Dongjin Leng1, Zhongming Wang1, Xiaoxi Yang1, Song He1, Xiaochen Bo1.
Abstract
Synthetic lethality (SL) occurs between two genes when the inactivation of either gene alone has no effect on cell survival but the inactivation of both genes results in cell death. SL-based therapy has become one of the most promising targeted cancer therapies in the last decade as PARP inhibitors achieve great success in the clinic. The key point to exploiting SL-based cancer therapy is the identification of robust SL pairs. Although many wet-lab-based methods have been developed to screen SL pairs, known SL pairs are less than 0.1% of all potential pairs due to large number of human gene combinations. Computational prediction methods complement wet-lab-based methods to effectively reduce the search space of SL pairs. In this paper, we review the recent applications of computational methods and commonly used databases for SL prediction. First, we introduce the concept of SL and its screening methods. Second, various SL-related data resources are summarized. Then, computational methods including statistical-based methods, network-based methods, classical machine learning methods and deep learning methods for SL prediction are summarized. In particular, we elaborate on the negative sampling methods applied in these models. Next, representative tools for SL prediction are introduced. Finally, the challenges and future work for SL prediction are discussed.Entities:
Keywords: computational methods; deep learning; machine learning; synthetic lethality
Mesh:
Year: 2022 PMID: 35352098 PMCID: PMC9116379 DOI: 10.1093/bib/bbac106
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 13.994
Figure 1The concept of SL and SDL.
Some recent clinical trials related to SL (https://clinicaltrials.gov/ct2/home). All of the listed agents are inhibitors
| Agent | Target gene | Mutate/overexpressed gene | Cancer type |
| First posted |
|---|---|---|---|---|---|
| Olaparib | PARP | BRCA1/2 | Platinum sensitive relapsed ovarian cancer and metastatic breast cancer | IV, NCT04330040 | 1 April 2020 |
| Niraparib | Advanced pancreatic adenocarcinoma | II, NCT03601923 | 26 July 2018 | ||
| Rucaparib | Metastatic and recurrent endometrial cancer | II, NCT03617679 | 6 August 2018 | ||
| Talazoparib | Leukemia | I, NCT03974217 | 4 June 2019 | ||
| AZD6738 | ATR | TP53 | Recurrent, persistent or progressive myelodysplastic syndrome (MDS) or chronic myelomonocytic leukemia | I, NCT03770429 | 10 December 2018 |
| BAY1895344 | ATR | ATM | Advanced solid tumors and lymphomas (ATM loss and/or ATM deleterious mutations will be included) | I, NCT03188965 | 16 June 2017 |
| SRA737 | CHK1 | CCNE1, TP53, BRCA1, BRCA2, MYC, RAD50 | Advanced solid tumors or Non-Hodgkin’s Lymphoma | I and II, NCT02797964 | 14 June 2016 |
| Prexasertib (LY2606368) | BRCA | BRCA1/2 mutation associated breast or ovarian cancer, triple-negative breast cancer, and high grade serous ovarian cancer | II, NCT02203513 | 30 June 2014 | |
| MYC, CCNE1, Rb, FBXW7, BRCA1, BRCA2, PALB2, RAD51C, RAD51D, ATR, ATM, CHK2 | Advanced solid tumors | II, NCT02873975 | 22 August 2016 | ||
| Adavosertib (AZD1775) | WEE1 | TP53 | Uterine Serous Carcinoma | II, NCT04590248 | 19 October 2020 |
| SETD2 | Advanced/metastatic solid tumors | II, NCT03284385 | 15 September 2017 | ||
| BRCA | Advanced refractory cancers/lymphomas/multiple myeloma | II, NCT04439227 | 19 June 2020 | ||
| CYC140 | PLK1 | KRAS | Advanced leukemias or Myelodysplastic syndromes | I, NCT03884829 | 21 March 2019 |
| BI 6727 | Advanced, nonresectable and/or metastatic solid tumor | I, NCT01145885 | 17 June 2010 | ||
| GSK461364 | Advanced solid tumor or Non-Hodgkin’s lymphoma that has relapsed or is refractory to standard therapies | I, NCT00536835 | 28 September 2007 | ||
| Sotorasib (AMG 510) | CD274/PD-L1 | Stage IV non-small cell lung cancer | II, NCT04933695 | 22 June 2021 | |
| AZD2014 | 4EBP1 | MYC | High-risk prostate cancer | I, NCT02064608 | 17 February 2014 |
| CC-115 | Advanced solid tumors, and hematologic malignancies | I, NCT01353625 | 13 May 2011 | ||
| AZD4573 | CDK9 | Relapsed/refractory hematological malignancies | I, NCT03263637 | 28 August 2017 | |
| TP-1287 | Advanced solid tumors Sarcoma | I, NCT03604783 | 27 July 2018 | ||
| P276-00 | Stage III (unresectable) or stage IV metastatic melanoma | II, NCT00835419 | 3 February 2009 |
Statistics of label databases reviewed in this paper
| Database | Methods | Description | Species and No. of SL pairs | Website | Latest update |
|---|---|---|---|---|---|
| SynLethDB V2 [ | DAISY, text mining, large-scale screening techniques | Comprehensive database for SL |
|
| 2020 |
| BioGRID V 4.4.201 [ | Experiments and literature mining | Genetic interactions from all major model organisms and humans | Major model organisms and humans |
| 1 September 2021 |
| Syn-lethality [ | Manually curated SL pairs for human cancer from the literatures (113) SL pairs for human cancer inferred from yeast (1114) | Integrates experimentally discovered and verified human SL gene pairs into a network |
|
| |
| GenomeRNAi [ | RNAi | Genetic interactions detected by GenomeRNAi |
|
| 27 November 2017 |
| DAISY [ | Computational prediction | Statistically inferring SL pairs |
|
| |
| The Cellmap [ | Yeast screening | Database of genetic interaction for |
|
| May 2016 |
| Laufer | RNAi | Combinatorial RNAi and high-throughput imaging | Human cell lines: HCT116 HeLa |
| |
| Vizeacoumar | A negative genetic interaction map in isogenic cancer cell lines | 6 isogenic cancer cell lines (KRAS, PTTG1, PTEN, MUS81, BLM) | Support Information | ||
| Shen | CRISPR screening | Combinatorial CRISPR screening | Human cell lines HeLa: 52 A549: 57293 T: 59 | 293 T - | |
| GImap [ | Combinatorial CRISPR screening | Human cell lines Jurkat: 454 K562:1678 |
| 22 July 2018 | |
| Najm | Combinatorial CRISPR screening | Human cell lines A375, HT29, OVCAR8, 786O, A549, Meljuso | |||
| Zhao | Metabolic gene networks through combinatorial CRISPR screening | Human cell lines A549 HeLa | Support information | ||
| GEMINI [ | Computational prediction | A variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screening | Sensitive lethal interactions and sensitive recovery interactions for four combinatorial CRISPR studies | Support information | |
| Wan | Application of GEMINI to identify genetic interactions | Human cell lines A549: 126 A375: 18 HT29: 18 |
| ||
| Slorth [ | Predict SL pairs in a RF classifier |
|
| Jun, 2019 | |
| CGIdb [ | Identify potential SL pairs for specific cancer types from TCGA and functional screen data |
|
| 2019 | |
| Srivas | Drug screening | Evaluate thousands of TSG-drug combinations | Yeast: 1420 HeLa: 127 | Support information | 2016 |
Note: BioGRID, Biological General Repository for Interaction Datasets; DAISY, Data mining SL identification pipeline; TCGA, The Cancer Genome Atlas; TSG, tumor suppressor genes; H. sapiens, Homo sapiens; S. cerevisiae, Saccharomyces cerevisiae; D. melanogaster, Drosophila melanogaster; M. musculus, Mus musculus; C. elegans, Caenorhabditis elegans; S. pombe, Schizosaccharomyces pombe.
Statistics of feature databases reviewed in this paper
| Database | Statistics | Website | Latest update |
|---|---|---|---|
|
| Gene sequence data: 233 642 893 |
| 15 October 2021 |
| Unitprot release 2021_03 [ | Protein sequence data: 219 740 215 |
| 2 June 2021 |
| GO release 2021-10-26 [ | 43 832 GO terms 7 827 176 annotations |
| 26 October 2021 |
| KEGG Release 100.0 [ | Pathway maps: seven categories, 548 maps |
| 1 October 2021 |
| MSigDB V7.4 [ | Pathway comembership |
| April 2021 |
| CTD [ | Gene-pathway annotations: 135 789 |
| 5 October 2021 |
| LINCS Data Portal 3.0 [ | 978 landmark genes under different perturbations |
| June 2021 |
| PhyloGene [ |
| 2015 | |
| CORUM 3.0 [ | Mammalian protein complexes: 4274 |
| 9 March 2018 |
| STRING 11.5 [ | PPIs: more than 20 billion |
| 12 August 2021 |
| HPRD release 9 [ | PPIs: 41 327 |
| 13 April 2010 |
| HIPPIE v2.0 [ | Confidence scored and annotated PPIs: over 270 000 |
| 14 February 2019 |
Note: UniProt, The Universal Protein Resource; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; MsigDB, The Molecular Signatures Database; CTD, The Comparative Toxicogenomics Database; CORUM, The comprehensive resource of mammalian protein complexes; STRING, Search Tool for the Retrieval of Interacting Genes/Proteins database; HPRD, Human Protein Reference Database; HIPPIE, Human Integrated Protein–Protein Interaction reference; PPI, Protein–protein interaction.
Statistics of other related SL databases reviewed in this paper
| Database | Description | Website | Latest update |
|---|---|---|---|
| The cancer dependency map [ | Databases based on large-scale single gene knockout |
| 19 August 2021 |
| TCGA | Cancer genomics and mutation databases |
| 29 October 2021 |
| CCLE |
| 2019 | |
| COSMIC v95 [ |
| 24 November 2021 | |
| InParanoid 8 [ | Orthology analysis |
| December 2013 |
| OrthoMCL-DB [ |
| 8 September 2021 |
Note: TCGA, The Cancer Genome Atlas; CCLE, Cancer Cell Line Encyclopedia; COSMIC, Catalogue of Somatic Mutations in Cancer.
Summary of SL prediction methods and representative models
| Methods and representative models | Description | Advantages | Disadvantages | Application scenarios |
|---|---|---|---|---|
| Statistical-based methods | Fit existing data based on certain hypothesis | From the perspective of systems biology Do not require known SL data | The selection of hypothesis or threshold is highly subjective and unstable | There are insufficient known SL data |
|
| Identifies SL interactions in cancer through three statistical procedures in parallel | Comprehendible to biologists Mining data from clinical cancer samples | The biological data are at times noisy and inaccurate | Identification of clinical-related SL interactions in cancer |
| Network-based methods | Study SL pairs from the perspective of biological network | Add network structure information to gain a more comprehensive understanding of genes globally | Network data are incomplete and contains a lot of noises | There are insufficient known SL data |
|
| Predicts enzymatic SDLs from a GSMM | The first computational method that captures enzymatic SDL effects in metabolic networks Uncovers the mechanisms behind SDLs | Does not integrate more data source such as patient-specific omics data | Identifies SDLs that have a significant impact on tumor in clinical settings |
|
| Rapidly identifies SL pairs in metabolic networks | Overcomes the issue of computational complexity | Does not identify human SL gene pairs | Identifies higher order SL pairs in metabolic network |
| Classic ML methods | Learn general patterns from a limited set of known SL data and use those patterns to make predictions about unknown or unobserved SL gene pairs | Good performance on small data sets Effectively integrate multidimensional feature data | Manually generated features and need to understand the features that represent the data Lacks of negative samples | Require known SL data and feature data of high quality |
|
| RF-based model to predict paralog SL pairs | Makes interpretable predictions for paralog SL pairs | Restricted in the identification of paralog SL pairs | Identifies context-specific paralog SL pairs |
|
| A GRSMF model | Has the ability of data-adaptiveness and avoids determining the dimension of the latent space | Focuses on mapping genes to latent representations and cannot aggregate information from neighbor genes | There are not enough negative samples |
| Deep learning methods | Use a multistep feature transformation to obtain a feature representation of the original data, and further input into the prediction function to obtain the final result | Discover deep features for representation learning and pattern recognition from large dataset Does not require manual feature extraction. | Demand a large amount of data and computational resources. Limited by the quality and quantity of the data, which contain many false positives and false negatives. It is hard to train the model. Poor interpretability Lack of negative samples | Require sufficient known SL data and feature data of high quality |
|
| A semisupervised neural network method | Utilizes unlabeled SL data to predict cell-line-specific SL pairs Demonstrates that L1000 expression profiles are effective features data for SL prediction | Limited sample space and cell lines | Predicts cell-line specific SL pairs There are insufficient labeled SL samples |
|
| A dual-dropout GCN method | Uses SL dataset with better quality Aggregates information from neighbor genes | Focuses solely on known SL pairs and ignores other data sources of genes | There are sufficient SL samples of high quality and insufficient feature data |
Summary of studies involved in this review
| Category | Study | Published year | Algorithms | SL data | Feature data | Program code |
|---|---|---|---|---|---|---|
| Statistical-based methods | Li | 2011 | MLE | SGD [ | Domain relationships | |
| Zhang | 2012 | MLE | SGD [ | Protein sequences | ||
| Conde-Pueyo | 2009 | Homologous mapping | BioGRID [ | Somatic mutations, GO annotation, drugs and their gene targets | ||
| Lee | 2013 | Homologous mapping | BioGRID [ | Homology information, gene expression information | ||
| Deshpande | 2013 | Homologous mapping | Literatures [ | Homology information | ||
| Kirzinger | 2019 | Homologous mapping | Gene expression data, homology information | |||
| Jerby-Arnon | 2014 | DAISY | SCNA and mutation profiles, gene essentiality profiles, gene expression profiles | |||
| Srihari | 2015 | Statistical analysis | Genomic copy-number and gene expression | |||
| Guo | 2016 | Statistical analysis | BioGRID [ |
| ||
| Wang | 2019 | Statistical analysis | SynLethDB [ | Somatic mutation information, shRNA data, yeast genetic interactions | ||
| Lee | 2018 | ISLE | SCNA, gene expression, mutation and survival data |
| ||
| Wang | 2013 | The univariate | Gene expression | |||
| Chang | 2016 | Statistical analysis | Literatures [ | Gene expression | ||
| Feng | 2019 | Statistical analysis | Genomics and patient survival data | |||
| Sinha | 2017 | MiSL | Mutation, copy number and gene expression |
| ||
| Yang | 2021 | SiLi | Large-scale sequencing data | |||
| Network-based methods | Kranthi | 2013 | PPI networks | PPIs | ||
| Jacunski | 2015 | PPI networks | BioGRID [ | PPIs, functional annotations | ||
| Ku | 2020 | PPI networks | PPIs, pathways | |||
| Zhang | 2015 | Signaling networks | Signaling data | |||
| Liu | 2018 | Signaling networks | SynLethDB [ | PPIs | ||
| Apaolaza | 2017 | Metabolic networks | Gene expression data | |||
| Megchelenbrink et al. [ | 2015 | IDLE | The human metabolic network | |||
| Pratapa | 2015 | Fast-SL | Genome-scale metabolic networks |
| ||
| Classic ML methods | Paladugu | 2008 | SVM | Literatures [ | PPI network | |
| Wu | 2021 | k-NN | SynLethDB [ | Seven similarities of gene pairs (gene expression, protein sequence, PPI, copathway, GO biological process, GO cellular component and GO molecular function) | ||
| Yin | 2019 | DT | SynLethDB [ | Mutation, CNV and clinical data of breast cancer | ||
| Pandey | 2010 | MNMC | SGD [ | PPIs, functional annotations, Pathways, mutant phenotype, proteins phylogenetic profiles, sequence similarity of genes and proteins | ||
| Wu | 2014 | Ensemble learning | BioGRID [ | Semantic similarity, PPIs, sequence orthologs, semantic similarity, co-complex membership, co-pathway membership, gene expression correlation, Common/interacting domains, the number of domains | ||
| Das | 2019 | DiscoverSL (RF) | SynLethDB [ | Mutation, gene expression, copy number alteration, gene-pathway information |
| |
| Li | 2019 | RF | Shen | GO term and KEGG pathway | ||
| Benstead-Hume | 2019 | RF | BioGRID [ | PPIs | ||
| De Kegel | 2021 | RF | Shared PPIs, evolutionary conservation, etc. |
| ||
| Benfatto | PARIS (RF) | CRISPR screens with genomics and transcriptomics data |
| |||
| Huang | 2019 | GRSMF (Matrix factorization) | SynLethDB [ | GO similarity matrix |
| |
| Liany | 2020 | CMF (Matrix factorization) | SynLethDB [ | Essentiality Profile, mRNA gene expression, SCNA level, pairwise coexpression |
| |
| Liu | 2020 | SL2MF (Matrix factorization | SynLethDB [ | PPI similarity, GO similarity | ||
| Deep learning methods | Wan | 2020 | Neural network | Shen et al. study [ | L1000 gene expression profiles [ |
|
| Cai | 2020 | GCN | SynLethDB [ |
| ||
| Long | 2021 | GAT | SynLethDB [ | GO semantic similarity, PPIs |
| |
| Hao | 2021 | GAE | SynLethDB [ | GO similarity matrix, PPIs, coexpression、mutual exclusion score、copathway |
| |
| Zhang | 2021 | KG | SynlethDB [ | Three relationships (different cancer types and their mutant genes, drugs and targets, drugs and their indications) | ||
| Wang | 2021 | KG | SynLethDB [ | The relationships of genes, drugs and compounds |
Note: SVM, support vector machine; DT, Decision tree; k-NN, k-nearest neighbors; RF, random forest; GCN, graph convolutional network; GAT, graph attention network; GAE, graph autoencoder; KG, knowledge graphs; MLE, maximum likelihood estimation; ISLE, identification of clinically relevant synthetic lethality; MiSL, mining synthetic lethals; SiLi, statistical inference-based synthetic lethality identification; IDLE, identifying dosage lethality effects; MNMC, multi-network and multi-classifier; PARIS, PAn-canceR Inferred Synthetic lethalities; GRSMF, graph regularized self-representative matrix factorization; CMF, collective matrix factorization; SGD, saccharomyces genome database; SCNA, somatic copy number alterations.
Performance scores and validation scheme of the methods involved in this review
| Study | Algorithms | Validation scheme | AUROC | AUPRC | ACC | F1 | MCC | Precision | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|---|---|---|
| Pratapa | SVM | 10-fold cross-validation | 0.796 | |||||||
| Wu | k-NN | 10-fold cross-validation | 0.848 | 0.861 | 0.764 | 0.739 | 0.825 | 0.670 | ||
| Pandey | MNMC | 10-fold cross-validation | 0.897 | |||||||
| Wu | Ensemble learning | 5-fold cross-validation | 0.871 | |||||||
| Li | RF | 10-fold cross-validation | 0.532 | |||||||
| Benstead-Hume | RF | 5-fold cross-validation | 0.889 | |||||||
| Liu | Logistic matrix factorization | 5-fold cross-validation | 0.848 | 0.239 | ||||||
| Huang | Matrix factorization | 5-fold cross-validation | 0.923 | |||||||
| Liany | CMF | 3-fold cross-validation | 0.980 | 0.980 | ||||||
| Wan | Neural network | 5-fold cross-validation | 0.969 | 0.880 | 0.959 | 0.866 | 0.872 | 0.903 | 0.968 | |
| Cai | GCN | 5-fold cross-validation | 0.878 | 0.344 | 0.552 | |||||
| Long | GAT | 5-fold cross-validation | 0.937 | 0.948 | ||||||
| Hao | GAE | 5-fold cross-validation | 0.917 | 0.942 | 0.871 | |||||
| Wang | KG | 5-fold cross-validation | 0.947 | 0.956 | 0.887 |
Notes: AUROC, area under receiver optimizer characteristics curve; AUPRC, area under precision-recall curve; ACC, accuracy; MCC, Matthews correlation coefficient.
Figure 3The flowcharts of some selected typical methods reviewed in this manuscript. (A) DAISY [13], a statistical-based method. (B) MNMC [72], an ensemble classifier. (C) SL2MF [29], a logistic matrix factorization method. (D) DDGCN [31], a GCN-based method. (E) KG4SL [80], a KG-based method.
Figure 2Workflow of ML methods used in SL prediction. SVM refers to support vector machine. DT refers to Decision Tree. RF refers to random forest.
Figure 4Negative sampling methods. (A) Randomly picking up unknown gene pairs as negative samples. (B) Extracting gene pairs from GI databases with certain GI scores as negative samples.
Tools and applications reviewed in this study
| Tool | Description | Availability | Website |
|---|---|---|---|
| G2G | Predict SL interactions based on mapping genes to GO terms | Online |
|
| SPAGE-Finder | Predict SL interactions from TCGA data | Online |
|
| SynLeGG | Predict SL interactions utilizing multiSEp gene expression clusters to Partition CRISPR essentiality scores and mutations from whole-exome sequencing | Online |
|
| SL-BioDP | Predict SL interactions from hallmark cancer pathways by mining cancer’s genomic and chemical interactions | Online |
|
| DiscoverSL | R package for multiomic data-driven prediction of SL interactions in cancer | Standalone |
|
| ISLE | Identify the most likely clinically relevant SL interactions by mining TCGA cohort | Standalone |
|
| GEMINI | Identify SL interactions from combinatorial CRISPR experiments | Standalone |
|
| Fast-SL | identify synthetic lethal sets in metabolic networks | Standalone |
|
Note: SynLeGG, Synthetic Lethality using Gene expression and Genomics; SL-BioDP, Synthetic Lethality BioDiscovery portal.