Literature DB >> 35352098

Computational methods, databases and tools for synthetic lethality prediction.

Jing Wang1, Qinglong Zhang1, Junshan Han1, Yanpeng Zhao1, Caiyun Zhao1, Bowei Yan1, Chong Dai1, Lianlian Wu1, Yuqi Wen1, Yixin Zhang1, Dongjin Leng1, Zhongming Wang1, Xiaoxi Yang1, Song He1, Xiaochen Bo1.   

Abstract

Synthetic lethality (SL) occurs between two genes when the inactivation of either gene alone has no effect on cell survival but the inactivation of both genes results in cell death. SL-based therapy has become one of the most promising targeted cancer therapies in the last decade as PARP inhibitors achieve great success in the clinic. The key point to exploiting SL-based cancer therapy is the identification of robust SL pairs. Although many wet-lab-based methods have been developed to screen SL pairs, known SL pairs are less than 0.1% of all potential pairs due to large number of human gene combinations. Computational prediction methods complement wet-lab-based methods to effectively reduce the search space of SL pairs. In this paper, we review the recent applications of computational methods and commonly used databases for SL prediction. First, we introduce the concept of SL and its screening methods. Second, various SL-related data resources are summarized. Then, computational methods including statistical-based methods, network-based methods, classical machine learning methods and deep learning methods for SL prediction are summarized. In particular, we elaborate on the negative sampling methods applied in these models. Next, representative tools for SL prediction are introduced. Finally, the challenges and future work for SL prediction are discussed.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Keywords:  computational methods; deep learning; machine learning; synthetic lethality

Mesh:

Year:  2022        PMID: 35352098      PMCID: PMC9116379          DOI: 10.1093/bib/bbac106

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   13.994


Introduction

Synthetic lethality (SL) is originally defined as the setting in which abnormal expression of either of two genes alone has little effect on cell viability but abnormalities in the expression of both genes concurrently lead to cell death [1]. Basically, ‘SL’ can be categorized into two classes: (i) SL, which occurs between genes with loss-of-function mutations (gene A) and their partner gene (gene B). (ii) Synthetic dosage lethality (SDL), which occurs between the overexpressed gene (gene A) and their partner gene (gene B) [2] (Figure 1). In cancer, the application of SL has the following significances: (i) SL provides an approach for targeted therapy. Abnormalities of gene A can be regarded as cancer-specific biomarkers and pharmacological inhibition of gene B leads to the selective killing of cancer cells [3]. (ii) SL expands the space of druggable targets. SL points the way to indirect targeting the genes that are not classically ‘druggable,’ owing to their molecular structure or because they are loss of function mutations [1, 4]. poly(ADP-ribose) polymerase inhibitor (PARPi) is the first successful clinical example based on SL [1, 5–7] and SL-based therapy has been regarded as one of the most effective anticancer treatments in the last decade [8]. The encouraging results of PARPi led to an increasing amount of drug candidates focusing on SL interactions. We list some recent clinical trials related to SL interactions in Table 1 [8, 9].
Figure 1

The concept of SL and SDL.

Table 1

Some recent clinical trials related to SL (https://clinicaltrials.gov/ct2/home). All of the listed agents are inhibitors

AgentTarget geneMutate/overexpressed geneCancer type Phase and ClinicalTrials.gov identifier First posted
OlaparibPARPBRCA1/2Platinum sensitive relapsed ovarian cancer and metastatic breast cancerIV, NCT043300401 April 2020
NiraparibAdvanced pancreatic adenocarcinomaII, NCT0360192326 July 2018
RucaparibMetastatic and recurrent endometrial cancerII, NCT036176796 August 2018
TalazoparibLeukemiaI, NCT039742174 June 2019
AZD6738ATRTP53Recurrent, persistent or progressive myelodysplastic syndrome (MDS) or chronic myelomonocytic leukemiaI, NCT0377042910 December 2018
BAY1895344ATRATMAdvanced solid tumors and lymphomas (ATM loss and/or ATM deleterious mutations will be included)I, NCT0318896516 June 2017
SRA737CHK1CCNE1, TP53, BRCA1, BRCA2, MYC, RAD50Advanced solid tumors or Non-Hodgkin’s LymphomaI and II, NCT0279796414 June 2016
Prexasertib (LY2606368)BRCABRCA1/2 mutation associated breast or ovarian cancer, triple-negative breast cancer, and high grade serous ovarian cancerII, NCT0220351330 June 2014
MYC, CCNE1, Rb, FBXW7, BRCA1, BRCA2, PALB2, RAD51C, RAD51D, ATR, ATM, CHK2Advanced solid tumorsII, NCT0287397522 August 2016
Adavosertib (AZD1775)WEE1TP53Uterine Serous CarcinomaII, NCT0459024819 October 2020
SETD2Advanced/metastatic solid tumorsII, NCT0328438515 September 2017
BRCAAdvanced refractory cancers/lymphomas/multiple myelomaII, NCT0443922719 June 2020
CYC140PLK1KRASAdvanced leukemias or Myelodysplastic syndromesI, NCT0388482921 March 2019
BI 6727Advanced, nonresectable and/or metastatic solid tumorI, NCT0114588517 June 2010
GSK461364Advanced solid tumor or Non-Hodgkin’s lymphoma that has relapsed or is refractory to standard therapiesI, NCT0053683528 September 2007
Sotorasib (AMG 510)CD274/PD-L1Stage IV non-small cell lung cancerII, NCT0493369522 June 2021
AZD20144EBP1MYCHigh-risk prostate cancerI, NCT0206460817 February 2014
CC-115Advanced solid tumors, and hematologic malignanciesI, NCT0135362513 May 2011
AZD4573CDK9Relapsed/refractory hematological malignanciesI, NCT0326363728 August 2017
TP-1287Advanced solid tumors SarcomaI, NCT0360478327 July 2018
P276-00Stage III (unresectable) or stage IV metastatic melanomaII, NCT008354193 February 2009
The concept of SL and SDL. Some recent clinical trials related to SL (https://clinicaltrials.gov/ct2/home). All of the listed agents are inhibitors Despite the attractive concept of SL-based therapeutics, only PARPi has progressed to the clinic so far. A major hurdle might be the identification of clinically relevant, robust SL pairs [10]. Identifying potential SL gene pairs is mainly achieved by two methods: laboratory-based methods and computational-based methods. The most common laboratory-based methods include yeast screening, drug screening, RNA interference (RNAi) screening, clustered regularly interspaced short palindromic repeat (CRISPR) screening [8]. The limitation of yeast screening is that only a small portion of yeast genes (∼2000) have human orthologs [11], which limits the potential of this method. Drug screening tests drugs on various cell lines with specific mutations to identify SL gene pairs. SL gene pairs identified by drug screening would be easier to achieve clinical translation. However, the effect and specificity of drug inhibition tend to be lower than gene knockdown [8] and SL–gene pairs are limited in druggable gene targets. With the advent of RNAi and CRISPR/Cas9 technology, it is now possible to screen human cells for SL gene pairs. However, due to a large number of gene combinations (~200 million in a mammalian cell) [12], it is impractical to screen all potential SL pairs by these laboratory-based methods. To overcome the abovementioned disadvantages, a variety of computational methods have been proposed, which can reduce the search space of SL gene pairs. These methods can be divided into four categories: statistics-based methods, network-based methods, classic machine learning-based methods and deep learning-based methods. Statistics-based methods are based on certain hypotheses to predict SL gene pairs. For instance, Jerby-Arnon et al. [13] developed a data-driven model called data mining synthetic lethality identification pipeline (DAISY) for SL prediction based on the assumption that SL genes tend to be co-expressed but seldom coinactivation. Network-based methods identify SL gene pairs through constructing protein–protein interaction (PPI) [14-17] network, signaling network [18, 19] or metabolism network [20-22]. With the rapid development of machine learning, various algorithms have been applied for SL prediction, including random forest (RF) [23-27], matrix factorization [28-30] and so on. Deep learning-based methods have recently emerged as useful methods to identify SL gene pairs, especially graph neural network (GNN) [31-33]. The rest of this review is organized as follows. The next section introduces SL-related databases, including label databases, feature databases and other related databases. The third section summarizes the computational methods for SL prediction. After that, negative sampling methods applied in these computational methods are explained in the fourth section. The subsequent section introduces available tools to predict SL interactions. Finally, challenges and future work are discussed in the last section.

SL-related databases

Due to the development of high-throughput screening technologies, a large amount of SL data have been identified. Many databases are developed to gather SL pairs, which are listed in Table 2. Among these databases, SynLethDB [34] is a unique comprehensive database for SL. Other databases are based on yeast screening, RNAi screening, CRISPR screening, computational prediction and drug screening.
Table 2

Statistics of label databases reviewed in this paper

DatabaseMethodsDescriptionSpecies and No. of SL pairsWebsiteLatest update
SynLethDB V2 [34]DAISY, text mining, large-scale screening techniquesComprehensive database for SL H. sapiens: 35943 S. cerevisiae: 14000 D. melanogaster: 439 M. musculus: 381 C. elegans: 105 http://synlethdb.sist.shanghaitech.edu.cn/v2/#/2020
BioGRID V 4.4.201 [35–37]Experiments and literature miningGenetic interactions from all major model organisms and humansMajor model organisms and humans http://www.thebiogrid.org 1 September 2021
Syn-lethality [38]Manually curated SL pairs for human cancer from the literatures (113) SL pairs for human cancer inferred from yeast (1114)Integrates experimentally discovered and verified human SL gene pairs into a network H. sapiens: 1227 http://www.ntu.edu.sg/home/zhengjie/software/Syn-Lethality/ (NTU staff’s personal web pages)
GenomeRNAi [39]RNAiGenetic interactions detected by GenomeRNAi H. sapiens Drosophila http://www.genomernai.org/ 27 November 2017
DAISY [13]Computational predictionStatistically inferring SL pairs H. sapiens: 2816 http://www.cs.tau.ac.il/∼livnatje/SL_network.zip
The Cellmap [40]Yeast screeningDatabase of genetic interaction for S. cerevisiae S. cerevisiae: ~10 000 (GI score < −0.35) http://thecellmap.org May 2016
Laufer et al. study [42]RNAiCombinatorial RNAi and high-throughput imagingHuman cell lines: HCT116 HeLa http://www.bioconductor.org/packages/2.12/data/experiment/html/HD2013SGI.html
Vizeacoumar et al. study [43]A negative genetic interaction map in isogenic cancer cell lines6 isogenic cancer cell lines (KRAS, PTTG1, PTEN, MUS81, BLM)Support Information http://kimLab1.ccbr.utoronto.ca/projects/cancer_essential/ or http://moffatlab.ccbr.utoronto.ca/resources.php
Shen et al. study [44]CRISPR screeningCombinatorial CRISPR screeningHuman cell lines HeLa: 52 A549: 57293 T: 59293 T - http://www.ndexbio.org/#/newNetwork/199f9bb1-c3eb-11e6-8e29-06603eb7f303 A549 - http://www.ndexbio.org/#/newNetwork/ec8bdae3-c3c9-11e6-8e29-06603eb7f303; HeLa - http://www.ndexbio.org/#/newNetwork/e50ee3c2-c3d4-11e6-8e29-06603eb7f303.
GImap [12]Combinatorial CRISPR screeningHuman cell lines Jurkat: 454 K562:1678 https://data.mendeley.com/datasets/rdzk59n6j4/1 22 July 2018
Najm et al. study [45]Combinatorial CRISPR screeningHuman cell lines A375, HT29, OVCAR8, 786O, A549, Meljuso
Zhao et al. study [46]Metabolic gene networks through combinatorial CRISPR screeningHuman cell lines A549 HeLaSupport information
GEMINI [47]Computational predictionA variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screeningSensitive lethal interactions and sensitive recovery interactions for four combinatorial CRISPR studiesSupport information
Wan et al. study [41]Application of GEMINI to identify genetic interactionsHuman cell lines A549: 126 A375: 18 HT29: 18 https://github.com/FangpingWan/EXP2SL/tree/master/GEMINI
Slorth [25]Predict SL pairs in a RF classifier H. sapiens: 518636 S. cerevisiae: 372560 D. melanogaster: 93002 S. pombe: 52594 C. elegans: 56908 http://slorth.biochem.sussex.ac.uk Jun, 2019
CGIdb [48]Identify potential SL pairs for specific cancer types from TCGA and functional screen data H. sapiens: 10 637 http://www.medsysbio.org/CGIdb 2019
Srivas et al. study [49]Drug screeningEvaluate thousands of TSG-drug combinationsYeast: 1420 HeLa: 127Support information2016

Note: BioGRID, Biological General Repository for Interaction Datasets; DAISY, Data mining SL identification pipeline; TCGA, The Cancer Genome Atlas; TSG, tumor suppressor genes; H. sapiens, Homo sapiens; S. cerevisiae, Saccharomyces cerevisiae; D. melanogaster, Drosophila melanogaster; M. musculus, Mus musculus; C. elegans, Caenorhabditis elegans; S. pombe, Schizosaccharomyces pombe.

Statistics of label databases reviewed in this paper Note: BioGRID, Biological General Repository for Interaction Datasets; DAISY, Data mining SL identification pipeline; TCGA, The Cancer Genome Atlas; TSG, tumor suppressor genes; H. sapiens, Homo sapiens; S. cerevisiae, Saccharomyces cerevisiae; D. melanogaster, Drosophila melanogaster; M. musculus, Mus musculus; C. elegans, Caenorhabditis elegans; S. pombe, Schizosaccharomyces pombe. In addition, we list 12 commonly used feature databases to be fed into computational models in Table 3. These databases comprise genes’ or proteins’ sequence property (GenBank, Unitprot), functional property [gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), molecular signatures database (MSigDB), Comparative Toxicogenomics Database (CTD), LINCS, PhyloGene and comprehensive resource of mammalian protein complexes (CORUM)] and topological property in PPI (Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), Human Protein Reference Database (HPRD) and Human Integrated Protein-Protein Interaction Reference (HIPPIE)).
Table 3

Statistics of feature databases reviewed in this paper

DatabaseStatisticsWebsiteLatest update
GenBank release 246.0 [112]Gene sequence data: 233 642 893 www.ncbi.nlm.nih.gov/genbank/ 15 October 2021
Unitprot release 2021_03 [113]Protein sequence data: 219 740 215 https://www.uniprot.org/ 2 June 2021
GO release 2021-10-26 [114]43 832 GO terms 7 827 176 annotations http://geneontology.org/ 26 October 2021
KEGG Release 100.0 [115]Pathway maps: seven categories, 548 maps http://www.kegg.jp/ 1 October 2021
MSigDB V7.4 [116]Pathway comembership http://www.broadinstitute.org/msigdb April 2021
CTD [117]Gene-pathway annotations: 135 789 http://ctdbase.org/ 5 October 2021
LINCS Data Portal 3.0 [118]978 landmark genes under different perturbations https://lincsproject.org/LINCS/ June 2021
PhyloGene [119] http://genetics.mgh.harvard.edu/phylogene/ 2015
CORUM 3.0 [120]Mammalian protein complexes: 4274 http://mips.helmholtz-muenchen.de/corum/ 9 March 2018
STRING 11.5 [121]PPIs: more than 20 billion https://string-db.org/ 12 August 2021
HPRD release 9 [122]PPIs: 41 327 http://www.hprd.org/ 13 April 2010
HIPPIE v2.0 [123]Confidence scored and annotated PPIs: over 270 000 http://cbdm.uni-mainz.de/hippie/ 14 February 2019

Note: UniProt, The Universal Protein Resource; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; MsigDB, The Molecular Signatures Database; CTD, The Comparative Toxicogenomics Database; CORUM, The comprehensive resource of mammalian protein complexes; STRING, Search Tool for the Retrieval of Interacting Genes/Proteins database; HPRD, Human Protein Reference Database; HIPPIE, Human Integrated Protein–Protein Interaction reference; PPI, Protein–protein interaction.

Statistics of feature databases reviewed in this paper Note: UniProt, The Universal Protein Resource; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; MsigDB, The Molecular Signatures Database; CTD, The Comparative Toxicogenomics Database; CORUM, The comprehensive resource of mammalian protein complexes; STRING, Search Tool for the Retrieval of Interacting Genes/Proteins database; HPRD, Human Protein Reference Database; HIPPIE, Human Integrated Protein–Protein Interaction reference; PPI, Protein–protein interaction. Other databases provide data of large-scale single-gene knockout, cancer genomics and mutations and orthology analysis. SL interactions can be identified from the first two kinds of databases and can be inferred with the help of orthology analysis databases. The statistics of these databases are listed in Table 4.
Table 4

Statistics of other related SL databases reviewed in this paper

DatabaseDescriptionWebsiteLatest update
The cancer dependency map [124]Databases based on large-scale single gene knockout https://depmap.org/portal/ 19 August 2021
TCGACancer genomics and mutation databases https://www.cancer.gov/tcga 29 October 2021
CCLE [125] https://sites.broadinstitute.org/ccle 2019
COSMIC v95 [126] https://cancer.sanger.ac.uk/cosmic 24 November 2021
InParanoid 8 [127]Orthology analysis https://inparanoid.sbc.su.se/cgi-bin/index.cgi December 2013
OrthoMCL-DB [128, 129] https://orthomcl.org/orthomcl/app/ 8 September 2021

Note: TCGA, The Cancer Genome Atlas; CCLE, Cancer Cell Line Encyclopedia; COSMIC, Catalogue of Somatic Mutations in Cancer.

Statistics of other related SL databases reviewed in this paper Note: TCGA, The Cancer Genome Atlas; CCLE, Cancer Cell Line Encyclopedia; COSMIC, Catalogue of Somatic Mutations in Cancer. A brief introduction of each feature database and other database is shown in the supplementary file.

Comprehensive label databases

SynLethDB

SynLethDB [34] is a comprehensive database for SL and it has two versions so far. SynLethDB 1.0 was released in 2015 and SynLethDB 2.0 was updated in 2020. SL pairs are collected from multiple sources, including manual curations from literatures, three SL-related databases (BioGRID [35-37], Syn-lethality [38] and GenomeRNAi [39]), bispecific shRNA screening (DECIPHER), computational predictions (DAISY [13]) and text mining data for five species (human, mouse, fruit fly, worm and yeast). A brief introduction of the integrated databases is provided in the supplementary file. SynLethDB provides a webserver to calculate the confidence score for each SL pair by integrating individual scores derived from different evidence sources. In addition, the latest version adds SynLethKG. It is a comprehensive knowledge graph (KG) of SL including 11 types of biomedical entities and 27 types of relationships, representing features and relationships between genes, cancers and drugs.

Label databases based on yeast screening

The CellMap

The CellMap [40] is a web-based database of genetic interaction for Saccharomyces cerevisiae released in 2016. Through constructing over 23 million double mutants, ~350 000 positive and ~550 000 negative genetic interactions are identified. Three different interaction maps are constructed: nonessential × nonessential (N×N), essential × nonessential (E×N) and essential × essential (E×E) genetic network. The number of SL pairs between nonessential genes (~10 000) is estimated by applying an extreme negative interaction score threshold (<−0.35) to the N × N dataset.

Label databases based on RNAi screening

SL gene pairs can be experimentally obtained by the comparison of single and dual mutants in the same assay. Compared to double mutant yeast strains, which can be developed through high-throughput mating methodologies, it is more challenging to develop human cell lines with double mutations [10, 41]. Early human double perturbation screening used combinatorial siRNA knockdowns or siRNA knockdown under specific mutations to obtain genetic interaction data. SL pairs can be extracted from these data based on a specific indicator.

Laufer et al.’s study

Laufer et al. [42] performed 51 680 combinatorial RNAi experiments and identified genetic interactions for one or more of 11 phenotypes between 2376 gene pairs in human colon cancer cells.

Vizeacoumar et al.’s study

Vizeacoumar et al. [43] identified negative genetic interaction partners of five specific driver-mutated genes across a set of isogenic cancer cell lines through pooled shRNA screening. A total of 826 genetic interactions are tested and 200 negative genetic interactions (24.2%) are confirmed. They generate a genetic interaction network consisting of 2014 nodes and 2617 edges.

Label databases based on CRISPR screening

With the advances in CRISPR technology, it is now possible to systematically map SL networks in human cancer cells using combinatorial CRISPR screening.

Shen et al.’s study

Shen et al. [44] developed a high-throughput CRISPR screening approach for targeting single and pairwise genes. They screened all possible pairs of 73 cancer genes in three human cell lines, with totally 152 SL gene pairs were identified.

GI map

Horlbeck et al. [12] systematically screened 222 784 gene pairs from two human cancer cell lines through the CRISPR interference method, and constructed a large genetic interaction (GI) map.

Najm et al.’s study

Najm et al. [45] developed a dual-Cas9 platform to screen genetic interactions across six human cell lines and examined SL interactions among them.

Zhao et al.’s study

Zhao et al. [46] probed metabolic gene networks through combinatorial CRISPR screening developed by Shen et al. [44]. They interrogated a set of 51 genes in A549 and HeLa cells, which are involved in glycolysis and pentose phosphate pathways.

Label databases based on computational prediction

GEMINI and Wan et al.’s study

Zamanighomi et al. develop GEMINI refer to section Tools and applications to identify sensitive lethal and sensitive recovery interactions from combinatorial CRISPR screening. Wan et al. [41] used GEMINI to identify SL interactions from the combinatorial CRISPR experiments in three cell lines. They provide SL gene pairs with both SL relationships and L1000 gene expression profiles.

Slorth

Benstead-Hume et al. [25] extracted various features from PPI networks for use in a RF classifier to predict SL and SDL pairs both within and across five species. All predicted pairs can be obtained in the Slorth database released in 2019.

Cancer genetic interaction database

Han et al. [48] developed an algorithm to identify potential SL interactions for specific cancer types from The Cancer Genome Atlas (TCGA) refer to the Supplementary data and functional screening data. As a result, 10 637 SL interactions are detected. They integrate SL interactions predicted by other studies and construct the Cancer Genetic Interaction database (CGIdb).

Label databases based on drug screening

Srivas et al.’s study

Srivas et al. [49] exploited ~169 000 potential interactions between tumor suppressor genes (TSG) orthologs and druggable genes in yeast. Under the guidance of the strongest signal, they screened thousands of TSG–drug pairs in HeLa cells and construct conserved SL interaction networks.

Computational methods for SL prediction

The increasing volume of biological data and the rapid development of computer technology have paved the way to develop computational methods for SL prediction. The principle behind computational methods is to utilize biological knowledge that is confirmed to be able to determine known SL interactions, thus providing valuable insights into identifying more SL interactions from genes of interest [50]. Moreover, they show an impressive ability in SL prediction. In general, computational methods can be divided into (i) statistical-based methods, (ii) network-based methods, (iii) classic machine learning (ML) methods and (iv) deep learning methods. Due to the various principles of these methods, they have their own merits and demerits, which are listed in Table 5. Summary of studies involved in this review are shown in Table 6 and their performance scores are summarized in Table 7.
Table 5

Summary of SL prediction methods and representative models

Methods and representative modelsDescriptionAdvantagesDisadvantagesApplication scenarios
Statistical-based methodsFit existing data based on certain hypothesisFrom the perspective of systems biology Do not require known SL dataThe selection of hypothesis or threshold is highly subjective and unstableThere are insufficient known SL data
DAISY [13]Identifies SL interactions in cancer through three statistical procedures in parallelComprehendible to biologists Mining data from clinical cancer samplesThe biological data are at times noisy and inaccurateIdentification of clinical-related SL interactions in cancer
Network-based methodsStudy SL pairs from the perspective of biological networkAdd network structure information to gain a more comprehensive understanding of genes globallyNetwork data are incomplete and contains a lot of noisesThere are insufficient known SL data
IDLE [21]Predicts enzymatic SDLs from a GSMMThe first computational method that captures enzymatic SDL effects in metabolic networks Uncovers the mechanisms behind SDLsDoes not integrate more data source such as patient-specific omics dataIdentifies SDLs that have a significant impact on tumor in clinical settings
Fast-SL [22]Rapidly identifies SL pairs in metabolic networksOvercomes the issue of computational complexityDoes not identify human SL gene pairsIdentifies higher order SL pairs in metabolic network
Classic ML methodsLearn general patterns from a limited set of known SL data and use those patterns to make predictions about unknown or unobserved SL gene pairsGood performance on small data sets Effectively integrate multidimensional feature dataManually generated features and need to understand the features that represent the data Lacks of negative samplesRequire known SL data and feature data of high quality
De Kegel et al. study [26]RF-based model to predict paralog SL pairsMakes interpretable predictions for paralog SL pairsRestricted in the identification of paralog SL pairsIdentifies context-specific paralog SL pairs
GRSMF [28]A GRSMF modelHas the ability of data-adaptiveness and avoids determining the dimension of the latent spaceFocuses on mapping genes to latent representations and cannot aggregate information from neighbor genesThere are not enough negative samples
Deep learning methodsUse a multistep feature transformation to obtain a feature representation of the original data, and further input into the prediction function to obtain the final resultDiscover deep features for representation learning and pattern recognition from large dataset Does not require manual feature extraction.Demand a large amount of data and computational resources. Limited by the quality and quantity of the data, which contain many false positives and false negatives. It is hard to train the model. Poor interpretability Lack of negative samplesRequire sufficient known SL data and feature data of high quality
EXP2SL [41]A semisupervised neural network methodUtilizes unlabeled SL data to predict cell-line-specific SL pairs Demonstrates that L1000 expression profiles are effective features data for SL predictionLimited sample space and cell linesPredicts cell-line specific SL pairs There are insufficient labeled SL samples
DDGCN [31]A dual-dropout GCN methodUses SL dataset with better quality Aggregates information from neighbor genesFocuses solely on known SL pairs and ignores other data sources of genesThere are sufficient SL samples of high quality and insufficient feature data
Table 6

Summary of studies involved in this review

CategoryStudyPublished yearAlgorithmsSL dataFeature dataProgram code
Statistical-based methodsLi et al. [51]2011MLESGD [130]Domain relationships
Zhang et al. [52]2012MLESGD [130]Protein sequences
Conde-Pueyo et al. [53]2009Homologous mappingBioGRID [35–37]Somatic mutations, GO annotation, drugs and their gene targets
Lee et al. [54]2013Homologous mappingBioGRID [35–37]Homology information, gene expression information
Deshpande et al. [55]2013Homologous mappingLiteratures [56]Homology information
Kirzinger et al. [16]2019Homologous mappingGene expression data, homology information
Jerby-Arnon et al. [13]2014DAISYSCNA and mutation profiles, gene essentiality profiles, gene expression profiles
Srihari et al. [58]2015Statistical analysisGenomic copy-number and gene expression
Guo et al. [34]2016Statistical analysisBioGRID [35–37], Syn-Lethality [38] GenomeRNAi [39] DAISY [13] The DECIPHER Project, http://histone.sce.ntu.edu.sg/SynLethDB/
Wang et al. [59]2019Statistical analysisSynLethDB [34] and Literatures [15, 49, 58, 61, 131]Somatic mutation information, shRNA data, yeast genetic interactions
Lee et al. [60]2018ISLESCNA, gene expression, mutation and survival data https://github.com/jooslee/ISLE/
Wang et al. [61]2013The univariate F-test or t-testGene expression
Chang et al. [62]2016Statistical analysisLiteratures [5, 6, 132, 133]Gene expression
Feng et al. [63]2019Statistical analysisGenomics and patient survival data
Sinha et al. [65]2017MiSLMutation, copy number and gene expression https://purl.stanford.edu/ny450yx7231
Yang et al. [64]2021SiLiLarge-scale sequencing data
Network-based methodsKranthi et al. [15]2013PPI networksPPIs
Jacunski et al. [14]2015PPI networksBioGRID [35–37]PPIs, functional annotations
Ku et al. [17]2020PPI networksPPIs, pathways
Zhang et al. [19]2015Signaling networksSignaling data
Liu et al. [18]2018Signaling networksSynLethDB [34]PPIs
Apaolaza et al. [20]2017Metabolic networksGene expression data
Megchelenbrink et al. [21]2015IDLEThe human metabolic network
Pratapa et al. [22]2015Fast-SLGenome-scale metabolic networks https://github.com/RamanLab/FastSL
Classic ML methodsPaladugu et al. [67]2008SVMLiteratures [134] [134–136]PPI network
Wu et al. [71]2021k-NNSynLethDB [34]Seven similarities of gene pairs (gene expression, protein sequence, PPI, copathway, GO biological process, GO cellular component and GO molecular function)
Yin et al. [69]2019DTSynLethDB [34]Mutation, CNV and clinical data of breast cancer
Pandey et al. [72]2010MNMCSGD [130]PPIs, functional annotations, Pathways, mutant phenotype, proteins phylogenetic profiles, sequence similarity of genes and proteins
Wu et al. [73]2014Ensemble learningBioGRID [35–37]Semantic similarity, PPIs, sequence orthologs, semantic similarity, co-complex membership, co-pathway membership, gene expression correlation, Common/interacting domains, the number of domains
Das et al. [23]2019DiscoverSL (RF)SynLethDB [34]Mutation, gene expression, copy number alteration, gene-pathway information https://github.com/shaoli86/DiscoverSL/releases/tag/V1.0
Li et al. [24]2019RFShen et al. study [44]GO term and KEGG pathway
Benstead-Hume et al. [25]2019RFBioGRID [35–37]PPIs
De Kegel et al. [26]2021RFShared PPIs, evolutionary conservation, etc. https://github.com/cancergenetics/paralog_SL_prediction; https://doi.org/10.5281/zenodo.5139973
Benfatto et al. [27]PARIS (RF)CRISPR screens with genomics and transcriptomics data https://github.com/sbenfatto/PARIS
Huang et al. [28]2019GRSMF (Matrix factorization)SynLethDB [34]GO similarity matrix https://github.com/Oyl-CityU/GRSMF
Liany et al. [30]2020CMF (Matrix factorization)SynLethDB [34]Essentiality Profile, mRNA gene expression, SCNA level, pairwise coexpression https://github.com/lianyh
Liu et al. [29]2020SL2MF (Matrix factorizationSynLethDB [34]PPI similarity, GO similarity
Deep learning methodsWan et al. [41]2020Neural networkShen et al. study [44] GI map [12] Najm et al. study [45] Zhao et al. study [46]L1000 gene expression profiles [118] https://github.com/FangpingWan/EXP2SL
Cai et al. [31]2020GCNSynLethDB [34] https://github.com/CXX1113/Dual-DropoutGCN
Long et al. [32]2021GATSynLethDB [34], SynLethDB- v2.0 (http://synlethdb.sist.shanghaitech.edu.cn/v2)GO semantic similarity, PPIs https://github.com/longyahui/GCATSL
Hao et al. [33]2021GAESynLethDB [34]GO similarity matrix, PPIs, coexpression、mutual exclusion score、copathway https://github.com/DiNg1011/SLMGAE
Zhang et al. [2]2021KGSynlethDB [34], Jerby-Arnon et al. [13]Three relationships (different cancer types and their mutant genes, drugs and targets, drugs and their indications)
Wang et al. [80]2021KGSynLethDB [34], SynLethDB- v2.0 (http://synlethdb.sist.shanghaitech.edu.cn/v2)The relationships of genes, drugs and compounds

Note: SVM, support vector machine; DT, Decision tree; k-NN, k-nearest neighbors; RF, random forest; GCN, graph convolutional network; GAT, graph attention network; GAE, graph autoencoder; KG, knowledge graphs; MLE, maximum likelihood estimation; ISLE, identification of clinically relevant synthetic lethality; MiSL, mining synthetic lethals; SiLi, statistical inference-based synthetic lethality identification; IDLE, identifying dosage lethality effects; MNMC, multi-network and multi-classifier; PARIS, PAn-canceR Inferred Synthetic lethalities; GRSMF, graph regularized self-representative matrix factorization; CMF, collective matrix factorization; SGD, saccharomyces genome database; SCNA, somatic copy number alterations.

Table 7

Performance scores and validation scheme of the methods involved in this review

StudyAlgorithmsValidation schemeAUROCAUPRCACCF1MCCPrecisionSensitivitySpecificity
Pratapa et al. [22]SVM10-fold cross-validation0.796
Wu et al. [71]k-NN10-fold cross-validation0.8480.8610.7640.7390.8250.670
Pandey et al. [72]MNMC10-fold cross-validation0.897
Wu et al. [73]Ensemble learning5-fold cross-validation0.871
Li et al. [24]RF10-fold cross-validation0.532
Benstead-Hume et al. [25]RF5-fold cross-validation0.889
Liu et al. [29]Logistic matrix factorization5-fold cross-validation0.8480.239
Huang et al. [28]Matrix factorization5-fold cross-validation0.923
Liany et al. [30]CMF3-fold cross-validation0.9800.980
Wan et al. [41]Neural network5-fold cross-validation0.9690.8800.9590.8660.8720.9030.968
Cai et al. [31]GCN5-fold cross-validation0.8780.3440.552
Long et al. [32]GAT5-fold cross-validation0.9370.948
Hao et al. [33]GAE5-fold cross-validation0.9170.9420.871
Wang et al. [80]KG5-fold cross-validation0.9470.9560.887

Notes: AUROC, area under receiver optimizer characteristics curve; AUPRC, area under precision-recall curve; ACC, accuracy; MCC, Matthews correlation coefficient.

Summary of SL prediction methods and representative models Summary of studies involved in this review Note: SVM, support vector machine; DT, Decision tree; k-NN, k-nearest neighbors; RF, random forest; GCN, graph convolutional network; GAT, graph attention network; GAE, graph autoencoder; KG, knowledge graphs; MLE, maximum likelihood estimation; ISLE, identification of clinically relevant synthetic lethality; MiSL, mining synthetic lethals; SiLi, statistical inference-based synthetic lethality identification; IDLE, identifying dosage lethality effects; MNMC, multi-network and multi-classifier; PARIS, PAn-canceR Inferred Synthetic lethalities; GRSMF, graph regularized self-representative matrix factorization; CMF, collective matrix factorization; SGD, saccharomyces genome database; SCNA, somatic copy number alterations. Performance scores and validation scheme of the methods involved in this review Notes: AUROC, area under receiver optimizer characteristics curve; AUPRC, area under precision-recall curve; ACC, accuracy; MCC, Matthews correlation coefficient.

Statistical-based methods

This section focuses on the related works of statistical methods on the SL prediction task. Based on the knowledge of systems biology, statistical-based methods learn to fit existing SL data using particular assumptions. The assumptions are based on prior biological knowledge, such as the fact that SL genes are frequently co-expressed, having similar functions, or exhibiting mutual exclusivity with respect to specific genetic events. Models based on these assumptions are usually explainable as they can reveal statistical regularities between gene pairs at the phylogenetic level to some extent, but the accuracy of these models greatly depends on the prior statistical assumptions.

Prediction of SL gene pairs for yeast

Earlier studies mainly focused on identifying SL pairs in yeast, due to the limited access to human SL pairs. For instance, yeast SL pairs can be predicted by maximum likelihood estimation (MLE) method using the domain genetic interaction probabilities [51] or genetic interactions of significant short polypeptide clusters [52]. Furthermore, SL gene pairs of humans or other species can be predicted through yeast orthology mapping [16, 53–55]. However, orthology mapping has two major limitations. First, only a small portion of yeast genes have human orthologs as humans are evolutionarily distant from yeast. Second, SL relationships may develop independently across species [14].

Global prediction of SL gene pairs for human

With the rapidly accumulation of human genome data, global human SL interactions prediction has started to be investigated. Among them, DAISY is the most representative approach. DAISY is a data-driven computational pipeline based on large amounts of cancer genomic data proposed by Jerby-Arnon et al. [13] in 2014. They identify SL interactions in cancer through three statistical procedures in parallel (Figure 3A): (i) Genomic survival of the fittest. It is based on the observation that the coinactivition of SL pairs leads to cell death. Therefore, SL pairs can be selected by identifying gene coinactivation events that occur substantially less than expected. (ii) shRNA-based functional examination. It is based on the fact that knocking out the SL gene is lethal to cells when its SL partner gene is inactive. This can be implemented by an integrated analysis of shRNA essentiality screening, their somatic copy number alterations (SCNA) and transcriptomic profiles. (iii) Pairwise gene coexpression. SL pairs are likely to be involved in closely associated biological processes and hence tend to be co-expressed [56, 57]. Then a cancer genome-wide SL interactions network is constructed from SL gene pairs identified by all the three procedures. DAISY successfully identifies SL pairs by capturing the results obtained from large-scale genomic data and shRNA screens, but these data are at times noisy and inaccurate.
Figure 3

The flowcharts of some selected typical methods reviewed in this manuscript. (A) DAISY [13], a statistical-based method. (B) MNMC [72], an ensemble classifier. (C) SL2MF [29], a logistic matrix factorization method. (D) DDGCN [31], a GCN-based method. (E) KG4SL [80], a KG-based method.

Other researches have also developed some valuable statistical inferring methods and more assumptions have been proposed for SL prediction. For example, gene pairs altered in a mutually exclusive pattern are likely to be SL pairs [58, 59]; SL pairs upon coinactivation may exhibit prolonged patients’ survival [60]; SL pairs tend to have high phylogenetic similarity [60].

SL prediction for specific genes or cancers

Wang et al. [61] identified differentially expressed genes between tumors with and without functional p53 mutations by univariate F-test or t-test. The genes which exhibit higher relative expression in p53 mutated tumors were further selected as the candidate SL partner genes for p53. Chang et al. [62] selected lung adenocarcinoma-dependent genes through computing gene expression of lung adenocarcinoma versus nontumorous tissues, and then associated with five clinical factors to obtain predicted SL pairs. Feng et al. [63] developed an integrated computational pipeline based on ISLE (identification of clinically relevant SL) [60], which determine SL partner genes of GNAQ following four aspects: molecular condition (differentially overexpressed genes), clinical condition (genes associated with poor prognosis), phenotypic condition (more essential genes) and druggable condition. Recently, Yang et al. [64] inferred SL gene pairs in liver cancer based on DAISY and ISLE, which contain five inference analyses (functional similarity, differential gene expression, pairwise gene coexpression, pairwise survival and rank aggregation). Sinha et al. [65] proposed a computational pipeline called Mining Synthetic Lethals (MiSL) to identify mutation-specific SL pairs for specific cancers. Their basic assumption is that SL partner genes of a mutated gene tend to be amplified more frequently or deleted at a lower frequency in primary tumor samples containing the mutated gene.

Network-based methods

This section focuses on the network-based methods for SL prediction. Compared with statistical methods, network-based methods provide a more comprehensive understanding of genes in the entire biological network and improve our understanding of the mechanisms of SL. Currently, network-based methods predict SL pairs through constructing biological networks (PPI networks, signaling networks or metabolic networks), then analyzing the topological characteristics of genes in biological networks and assessing the network changes in response to knocking out gene pairs.

P‌PI network-based methods

Kranthi et al. [15] pointed out that the connectivity of the protein in the PPI network and the structure of the network are related to its functional characteristics. In general, the protein nodes with high degrees are usually functionally basic, and a lack of them would lead to lethality. Based on this, they developed graph information centrality measures in biological systems to identify SL gene pairs. They modified the information centrality method by knocking out two nodes. However, this method does not take the efficiency changes of knocking out a single node in the network into account, as the network changes may be caused by knocking out one gene at times [18]. Jacunski et al. [14] evaluated the connectivity homology by calculating the network parameters in the PPI network and designing an SL prediction model based on connectivity homology. Ku et al. [17] identified functionally distinct KRAS SL subnetworks or modules based on the MCODE clustering algorithm in the PPI network, all of which can be traced back to a specific pathway or protein complex.

Signaling network-based methods

Zhang et al. [19] predicted SL gene pairs by combining a data-driven method with the knowledge of pathway information from signaling networks to mimic the influence of single gene knockdown and double genes knockdown to cell viability. Gene pairs are considered as potential SL pairs when double genes knockdown significantly increase the likelihood of cell death, whereas single gene knockdown does not. Liu et al. [18] constructed human cancer signaling network (HCSN) by calculating the shortest path between no cancer gene and cancer gene pairs. Then they screened SL pairs from HCSN by three procedures: network-based method (according to the distance between cancer genes and noncancer genes), frequency-based method and function-based method. This method screens SL pairs by a multistep strategy, thus it might get better results.

Metabolic network-based methods

Apaolaza et al. [20] developed a genetic minimal cut set (gMCS)-based method to predict SL interactions and revealed a potential mechanism explaining the effect of specific gene knockout to disrupt cell growth. gMCS refers to minimal sets of reactions, the removal of which will invalidate the function of specific metabolic tasks. Megchelenbrink et al. [21] presented a network modeling method called identifying dosage lethality effects (IDLE). IDLE predicts enzymatic SDLs from a genome-scale model of metabolism (GSMM). For each pair of enzymes (A, B) in the human GSMM, they predicted SDL by measuring the growth reduction level caused by changing the enzyme flux of A and B. IDLE identifies SDLs in clinical settings, but it does not integrate more data sources such as patient-specific omics data. In addition, Pratapa et al. [22] developed Fast-SL, an algorithm to rapidly identify SL gene sets in metabolic networks. The algorithm overcomes the issue of computational complexity encountered in previous methods by iteratively narrowing the searching space for SLs, thus substantially reducing the computational time. Indeed, network-based methods can only integrate one or more interaction networks among genes. Relationships between genes and other entities like patients cannot be directly modeled [30]. In addition, they cannot utilize other data that contain related information about SL, such as sequence and function properties of genes. What is more, they do not use the existing SL samples so the underlying patterns of known SL pairs are not being exploited.

Classic ML methods

This section mainly introduces some classic ML methods for SL prediction tasks. Compared with the network-based methods, ML methods can effectively integrate multidimensional data and achieve feature learning through parameter fitting, providing more comprehensive information for SL prediction. Classical ML methods attempt to reveal the patterns of observed samples that cannot be acquired through principle analysis, in order to achieve reliable prediction of unknown data (Figure 2). There are two main types of classical ML methods: supervised learning and unsupervised learning. Most SL prediction tasks adopt supervised ML models. Support vector machine (SVM), Decision Tree (DT), k-nearest neighbors (k-NN), RF, ensemble classifier and matrix factorization are applied for SL prediction.
Figure 2

Workflow of ML methods used in SL prediction. SVM refers to support vector machine. DT refers to Decision Tree. RF refers to random forest.

Workflow of ML methods used in SL prediction. SVM refers to support vector machine. DT refers to Decision Tree. RF refers to random forest.

Support vector machine

The principle of SVM is to create an optimal decision boundary that maximizes the distance between two classes [66]. Paladugu et al. [67] proposed an SVM model that uses topological properties of two genes in a PPI network as features for SL prediction in yeast.

Decision Tree

DT model creates tree-like structure for classification purpose, where each internal node corresponds to a test of a feature and each leaf node refers to a classification result [68]. Yin et al. [69] predicted SL interactions in breast cancer based on DT. Two features [mutation coverage and copy number variations (CNV)] are classified and optimized by experimentally validated SL pairs, which are used to predict SL interactions based on DT.

K- nearest neighbors

K-NN algorithm is a nonparametric method [70] that classifies unknown samples by a plurality voting of its neighbors. Wu et al. [71] proposed a k-NN model to achieve the similarity-based classification of gene pairs. The basic hypothesis of this model is that unknown gene pairs which exhibit high levels of similarity to the known SL pairs are more likely to be potential SL pairs.

Ensemble classifiers

Ensemble classifiers achieve predictions by integrating the results of several independently trained weak models on the same samples. The integrated models outperform those of separate models. How to choose the independent weaker models and how to integrate their learning results are the main challenges of this algorithm. Pandey et al. [72] defined a large number of features for characterizing SL interactions from diverse data sources. Then, they design an integrated multi-network and multi-classifier (MNMC) framework composed of six different classifiers to predict yeast SL gene pairs. Wu et al. [73] also developed an ensemble algorithm (MetaSL) that integrates RF, DT, SVM and other ML classifiers based on a variety of biological features. Compared with MNMC [72], MetaSL assigns different weights to different classifiers according to their performances in the training process. Thus the prediction results are based on a weighted consensus from the participating classifiers. However, the limitation of this study is that inter-dependence exists among the input features.

Random forest

RF [74] actually belongs to ensemble classifiers, but all of the integrated classifiers are DTs. RF achieves strong predictive power by combining the simplicity of DTs with the flexibility and powerful functions of ensemble classifiers. Besides, it can cope with high dimensional (containing many features) data without feature selection as it is able to randomly select a subset of features. Das et al. [23] developed an RF-based R package DiscoverSL to predict SL interactions in cancers using multiomics cancer data. Li et al. [24] encoded genes as enrichment scores based on GO terms and KEGG annotation and a gene pair is represented by numerous features derived from their enrichment scores. Following this, they utilized SL label data to build an RF-based prediction model with optimized functional features. In particular, the maximum relevance and minimum redundancy method [75] is used to generate a ranked feature list and incremental feature selection method is applied to select the most appropriate number of features. Benstead-Hume et al. [25] also extracted features from the graph in the PPI network and use the RF model to predict SL gene pairs. Considering paralog pairs share functionality similarities and are more likely to be SL pairs, De Kegel et al. [26] developed an RF classifier to predict paralog SL pairs. Specifically, they applied TreeExplainer [76] to compute the influence of each feature on a specific prediction, so the classifier is able to make interpretable predictions. Benfatto et al. [27] developed an algorithm called PAn-canceR Inferred Synthetic lethalities (PARIS) that can address the importance of individual gene deficiency in explaining their dependencies in multiple cancer cells. The core of the PARIS algorithm lies in the feature selection step, achieved by RF through assigning importance scores to each mutation and expression feature based on CRISPR screening data across multiple cancer cell lines.

Matrix factorization

The classic ML methods described above are based on a supervised learning frame that requires both positive and negative training samples. However, SL prediction tasks lack real negative samples, as the majority of them are randomly selected from unknown samples, which may pick up false negative data. Matrix factorization methods effectively avoid this defect by capturing the underlying mechanisms of SL samples and integrating relevant information. Matrix factorization aims to decompose an input matrix into the product of two low-rank matrices, and then the data-missing matrix is filled with data obtained through model training. Huang et al. [28] designed a graph regularized self-representative matrix factorization (GRSMF) model which uses the linear representation of matrix X’s rows and columns to decompose itself. What is more, authors integrate GO similarity matrix data as a graph regularization term to address the sparse input data and improve the prediction accuracy. Compared with the conventional matrix factorization, GRSMF has the ability of data-adaptiveness and avoids determining the dimension of the latent space. To further differentiate the importance weights between SL pairs and unknown pairs, Liu et al. [29] proposed a logistic matrix factorization model, called SL2MF (Figure 3C), to learn latent representations of SL pairs. The combination of the latent vectors determines the probability of SL pairs. Moreover, they apply neighborhood regularization to constrain the latent vector, based on the hypothesis that genes with similar GO or PPI properties should be factorized into similar latent vectors. In addition, conventional matrix factorization methods have limited capability on complicated heterogeneous data. To address this issue, Liany et al. [30] improved the collective matrix factorization (CMF) method through three measures. The first two measures rely on a transformation (principal components analysis and graph features). The third measure is to extend the model by using matrix-specific weights. This modified model figures out the problem that conventional CMF cannot learn the unique representation of each entity when multiple input matrices contain the same entity types. The flowcharts of some selected typical methods reviewed in this manuscript. (A) DAISY [13], a statistical-based method. (B) MNMC [72], an ensemble classifier. (C) SL2MF [29], a logistic matrix factorization method. (D) DDGCN [31], a GCN-based method. (E) KG4SL [80], a KG-based method.

Deep learning methods

This section discusses the application of deep learning in SL prediction. Deep learning is a subset of ML methods. Compared to classical ML methods that extract features manually based on knowledge, deep network structures can better capture nonlinear and complex relationships between inputs and outputs, allowing them to identify complex patterns behind the data. Interdependent relationships always exist in biological entities and processes, which are often inherently noisy and occur at multiple scales. Therefore, biological data can be well suited for deep learning. Neural networks are the most commonly employed models in deep learning as they show high significant fit for complex nonlinear problems [111]. Given the fact that most SL pairs are cell-line specific, Wan et al. [41] develop a semisupervised neural network method called EXP2SL to identify SL pairs. For a pair of gene, they use cell line shRNA perturbation (LINCS L1000 project) gene expression profile to construct 978-dimensional features as inputs of the encoding layers and predict the potential SL pairs. EXP2SL is the first model to predict cell-line-specific SL pairs and it makes full use of unlabeled SL data. Currently, there are plenty of deep learning methods, among which the following two types are the most frequently and effectively used models in SL prediction: GNN and KG embedding.

Graph neural network

GNN can efficiently capture the structures of graph and model complex relations between neighbor nodes in the graph. Three archetypes of GNN are adopted in SL prediction, including graph convolutional network (GCN), graph attention network (GAT) and graph auto-encoder (GAE).

Graph convolutional network

GCN is an extension of convolutional neural network on graph structure. Compared with matrix factorization mentioned in the previous section, GCN can capture the information of neighbor nodes in the graph. Cai et al. [31] apply GCN to SL prediction and propose a model called dual-dropout GCN (DDGCN) (Figure 3D). DDGCN can aggregate information of neighboring genes in a graph by convolution operators. Furthermore, researchers adopt a dual dropout regularization technique [77] during the training process to avoid overfitting due to the sparse SL data. However, there are two limitations in this study. First, DDGCN only utilizes the information of the known SL pairs and lacks information on other features. Second, DDGCN does not assign different weights to different neighbors.

Graph attention network

GAT [78], which is able to assign different weights to each neighbor nodes, adopts attention mechanism to counter the shortcomings of GCN. Long et al. [32] developed a Graph Contextual Attention Network model called GCATSL that effectively integrates multiple biological data for SL prediction. After constructing multiple gene feature graphs with different data source as model’s inputs, a dual attention mechanism (node-level and feature-level) is designed for each feature to capture local and global neighbors’ importance to learn their representations. Multilayer perceptron is further exploited to aggregate the extracted features with original features.

Graph auto-encoder

GAE extends the idea of autoencoder to a graph. The node embeddings in the graph can be obtained through the encoder–decoder structure. In general, GAE uses GCN as the encoder. After inputting the topology and node information of the graph into the encoder, the inner product is adopted as the decoder to reconstruct the original graph. Hao et al. [33] combined GCN with autoencoder to construct a multiview graph autoencoder (SLMGAE) with a variety of data for SL prediction. SLMGAE takes SL graph as main view and graphs of other data (PPI, GO, etc.) as support views. Multiple GAEs are applied to graph reconstruction and GCN is used as the encoder. SLMGAE is able to integrate various data sources of genes in a GNN based framework and differentiate each data source by an attention mechanism.

Knowledge graph

The network embedding-based methods mentioned above integrate the information of multiple or heterogeneous biological networks, but in essence, there is no unified consideration for different relationship types. KG demonstrates excellent performance to this problem, which is a kind of knowledge-rich heterogeneous network composed of interconnected entities and relevant properties. It embeds the rich entities and the relationship information into the continuous vector space with low dimension [79], so as to facilitate computation while retaining the structural information. Due to the complexity and diversity of biological information, KG performs well in biological tasks such as SL prediction. Wang et al. [80] constructed a KG algorithm (KG4SL) for SL prediction (Figure 3E). The algorithm consists of three modules. First, a gene-specific weighted subgraph is generated for each gene. Second, gene representation is updated by aggregating its neighbors’ representations in its weighted subgraph. Third, SL score can be calculated through the inner product of the two genes’ aggregation result. However, this method may not fully integrate the neighborhood topological structures when generating a gene-specific weighted subgraph due to the large degrees of some nodes. In addition, some neighbors might be uninformative and promiscuous in the process of message passing. Zhang et al. [2] developed the Synthetic Lethality Knowledge Graph (SLKG), which integrates three types of entities (genes, drugs and diseases) and four types of relationships. Drug repositioning is achieved by defining three core scoring functions: SLScore (SDLScore), DrugScore and CancerScore. SLScore is calculated by integrating different SL evidences. In general, the above four methods have their own characteristics. Statistics-based methods and network-based methods are usually interpretable for novel predictions and do not require known SL samples. Statistics-based methods are based on statistical assumptions on the biological data, thus the accuracy of the assumptions and the quality of the biological data are needed. For example, considering SL gene pairs tend to be coexpression and seldom coinactiviton, DAISY [13] identifies SL gene pairs from large-scale genomic data. Network-based methods are based on a deep knowledge of a single or heterogeneous biological network, often accompanied by some creative concepts to identify the potential nodes that play an important role in the biological network. For instance, Kranthi et al. [15] developed a graph information centrality to identify SL gene pairs from human cancer protein interaction network. ML methods are trained on known SL samples. Classic ML methods tend to get better results than deep learning methods in small and medium-sized data sets (below hundreds or tens of thousands of samples) due to their fewer hyperparameters. Deep learning can achieve a better prediction under a big data set and it can extract high-level features from the data based on its complex network structure and a large number of parameters. However, due to the end-to-end learning process, the intermediate process of deep learning is a black box with good performance but a lack of interpretability [81]. This will lead to great uncertainty and unreliability when applied in biological or medical practice.

Negative sampling methods

SL computational models normally follow a supervised learning framework. Experimental data are composed of positive and negative samples. Positive samples can be extracted from databases in Table 2. How to prepare negative samples is one of the challenges for SL prediction. Randomly picking up unknown gene pairs is a commonly used method for negative sampling [32, 80]. This approach is relatively simple and can obtain enough negative samples. However, as shown in Figure 4, this method may pick up unidentified positive samples. Mislabeled data would lead to the worse performance of the model.
Figure 4

Negative sampling methods. (A) Randomly picking up unknown gene pairs as negative samples. (B) Extracting gene pairs from GI databases with certain GI scores as negative samples.

Negative sampling methods. (A) Randomly picking up unknown gene pairs as negative samples. (B) Extracting gene pairs from GI databases with certain GI scores as negative samples. Another negative sampling method is extracting gene pairs from GI databases with certain GI scores as negative samples [33, 41]. For instance, Hao et al. [33] extracted negative SL samples with GI scores around 0 and positive SL samples with GI scores below −3. This negative sampling method avoids introducing potential positive samples, but the number of negative samples is relatively small.

Tools and applications

To predict or identify SL interactions based on various data, number of easy-to-use tools have been developed. In this section, a brief introduction of these representative tools is presented. Further details of these tools are listed in Table 8.
Table 8

Tools and applications reviewed in this study

ToolDescriptionAvailabilityWebsite
G2GPredict SL interactions based on mapping genes to GO termsOnline http://bnet.cs.tau.ac.il/g2g/
SPAGE-FinderPredict SL interactions from TCGA dataOnline https://amagen.shinyapps.io/spage/
SynLeGGPredict SL interactions utilizing multiSEp gene expression clusters to Partition CRISPR essentiality scores and mutations from whole-exome sequencingOnline www.overton-lab.uk/synlegg
SL-BioDPPredict SL interactions from hallmark cancer pathways by mining cancer’s genomic and chemical interactionsOnline https://sl-biodp.nci.nih.gov/sl_index.php
DiscoverSLR package for multiomic data-driven prediction of SL interactions in cancerStandalone https://github.com/shaoli86/DiscoverSL/releases/tag/V1.0
ISLEIdentify the most likely clinically relevant SL interactions by mining TCGA cohortStandalone https://github.com/jooslee/ISLE/
GEMINIIdentify SL interactions from combinatorial CRISPR experimentsStandalone https://github.com/sellerslab/gemini
Fast-SLidentify synthetic lethal sets in metabolic networksStandalone https://github.com/RamanLab/FastSL

Note: SynLeGG, Synthetic Lethality using Gene expression and Genomics; SL-BioDP, Synthetic Lethality BioDiscovery portal.

Tools and applications reviewed in this study Note: SynLeGG, Synthetic Lethality using Gene expression and Genomics; SL-BioDP, Synthetic Lethality BioDiscovery portal.

G2G

G2G is a web server for the human SL interactions prediction published by Almozlino et al. [82]. The web server provides access to predicting phenotypes of paired gene deletions by an improved algorithm based on RF. Followed by submitting a source gene and a target gene, the phenotype for that gene pair can be computed. Furthermore, users can submit only one gene and then G2G returns all predicted interacting genes according to their neighbor relationships in the PPI network.

Synthetic lethality bio discovery portal and discover SL

Synthetic Lethality Bio Discovery Portal is a comprehensive web tool to predict SL [23] interactions from hallmark cancer pathways through mining genetic and chemical interactions in cancer. The web tool was developed by Deng et al. [83] in 2019 based on the previous statistical approach DiscoverSL (refer to section Statistical-based methods). Users can search the web tool from three modes: ‘GENES’ (including 623 commonly mutated cancer genes), ‘CANCER’ (including 18 histology types) and ‘DRUG’. In addition, the ‘INFERRED DRUG SYNERGY’ provides potential synergistic drug combinations.

SPAGE finder

Magen et al. [84] define ‘survival-associated pairwise gene expression states’ (SPAGEs) as pairs of genes whose co-expression levels are related to cell survival. They present a data-driven pipeline named SPAGE finder that identifies 71 946 SPAGEs from TCGA data, spanning 12 distinct types and a small portion of which are SL pairs. They provide a webserver visualizing the SPAGEs identified by the original manuscript, and allowing input or upload a gene list file of comma-separated gene names which will be rendered on the left panel.

Synthetic lethality using gene expression and genomics

Synthetic Lethality using Gene expression and Genomics (SynLeGG) is a web server developed by Wappett et al. [85] in 2021. SynLeGG utilizes MultiSEp algorithm to partition gene expression to discover SL-related characteristics. It predicts genetic dependency relationships including SL spanning 30 tissues and 783 cancer cell lines.

GEMINI

GEMINI [47] is an R package based on the variational Bayesian method to identify genetic interactions from combinatorial CRISPR perturbation studies. Scoring systems related to the individual and combined effects are defined to identify SL interactions.

CHALLENGES AND FUTURE WORK

Traditional genetically targeted cancer therapies normally focus on targeting gene products that are mutated or overexpressed in specific cancer types. However, from a drug discovery perspective, the loss-of-function mutations are much harder to target, and the same is true for several undruggable overexpressed genes. Fortunately, SL provides an avenue for the treatments of these targets as they facilitate the indirect targeting of nondruggable genes through the identification of a second druggable target that can interact with the primary genes [10]. Despite a marked increase in the identification of SL gene pairs, relatively few SL drug candidates have entered into clinic, and the field remains largely in its infancy [4]. Computational methods hold great prospects in this field but still remain some challenges. In this section, we will discuss these challenges and possible work in the future, mainly including biological issues and data and algorithm issues.

Biological issues

Expand the concept of SL

The conventional concept of SL is defined as the interaction between two genes. With a deeper understanding of SL, some studies expand the concept of it.

SL interactions among multiple genes

Most present studies focus on identifying SL interactions between two genes. However, the biological genetic interactions are complex and it is imperative to identify multiple genetic interactions. Kuzmin et al. [86] scored trigenic interactions in ~200 000 yeast and identified 3196 trigenic negative interactions. The global trigenic interaction network is estimated nearly 100-fold larger than the digenic network. Pratapa et al. [22] develop Fast-SL to identify high order SL interactions, including triplets and quadruplets. Prediction of SL interactions between multiple genes may be one of the challenges in the future.

Soft SL

Ryan et al. [87] consider that SL can be divided into hard SL and soft SL. Conventional SL is called hard SL. Soft SL exists between gene A and gene B but can be rescued by other genes. These reverse effects are called synthetic rescue (SR) or synthetic viability (SV). Gu et al. [88] identified candidate SR (SV) pairs by applying a statistical-based method and demonstrated that SR (SV) enables the prediction of drug resistance. The integration of SL and SR (SV) may result in higher reproducibility of SL prediction, thus future work for SL prediction should take SR (SV) into consideration.

Phenotype-centric SL

Conventionally, the mutated genes are utilized to distinguish cancer cells from normal cells and pharmacological inhibition of their partner genes is commonly adopted for SL-based cancer therapies. However, this concept can be extended. Akimov et al. [89] point out that the main determinant of any SL interaction is the phenotype alteration caused by a specific mutation or molecular perturbation. Therefore, considering the polygenic nature of the phenotype, they propose that phenotype might be a more robust differentiating context for SL interactions. SL interaction between WRN gene and microsatellite instability phenotype is an example phenotype-centric SL [90, 91]. The identification of more phenotype-centric SLs is a meaningful work in the future.

SL interactions between two signals

The integrated signaling system is critical for cell survival. Within it, various pathways interact with each other for survival and disrupting signals involved in multiple pathways is a practice of SL [92]. From the perspective of signals, we can get a deeper understanding of the biological mechanism of SL. In this regard, SL interactions can expand from genes to any signals, such as epigenetic regulators [93]. Integrating different types of signals to predict SL interactions would be crucial for future researches.

Expand the application of SL

At present, the main application of SL is still focused on the discovery of new anticancer targets. However, some researches indicate that SL would be applied in a wider range. These studies are explained in this section, which may give reference to the researchers in this field.

Nononcological diseases

SL has been successfully applied to identify anti-cancer targets but has found limited use in other diseases. There have been some researches probed into nononcological diseases, such as bacterial infection [94-97], malaria [98] and virus infection [99]. Computational methods may assist further application of SL in more nononcological diseases in the future.

Biological mechanisms revealed by SL

The essence of SL is a kind of genetic interaction and the analysis of SL can provide mechanistic insight into genes. Lippert et al. predict gene function from SL networks [100]. Guell et al. [101] analyze and categorize SL gene pairs in metabolic networks, and unveil plasticity and redundancy are indispensable mechanisms for biological systems. Cheng et al. analyze the role of SL in cancer risk and their findings support a possible role for SL in tumorigenesis [102]. In the future, as more SL pairs would be discovered, more biological mechanisms about genes will be revealed.

Drug repositioning researches

In spite of various promising computational methods that have been developed to identify SL interactions, drug repositioning researches based on SL have seldom been explored. After all, the ultimate goal of identifying novel SL pairs is to develop novel tumor target therapy. Recently, Zhang et al. [2] develop SLKG, a comprehensive KG aimed at providing the computational basis to tumor therapies based on SL. They demonstrate that SLKG is able to identify the optimal repurposing drugs and drug combinations. Future efforts are expected for these pioneer studies to achieve the clinical translation of SL.

Other application

Some researchers explored wider applications of SL. It is reported that [103, 104] SL interaction may be a new approach in chemoprevention of cancer, but this approach is to a great extent in its infancy. Additionally, Lee et al. [105] developed a precision oncology framework to predict patients’ cancer therapy response based on SL and SR interactions.

Data and algorithm issues

Data quality

The quality of training data is crucial to SL prediction. However, high false-positive and false-negative ratios often be observed in SL data generated by high-throughput screenings. In addition, positive SL samples could be negative under certain conditions. All of the abovementioned issues could lead to label inaccuracy and inconsistency. Besides, the performance of current models used to predict SL interactions is difficult to assess due to a lack of a gold standard source of human SL pairs. Therefore, to preprocess the SL data before applying it in computational models, establishing a gold standard source of SL pairs is necessary.

Sparse data and imbalanced samples

Due to the limited technology, known SL pairs are less than 0.1% of all potential pairs [31], which lead to two issues concerning training data. The first issue is sparse data. When applying these sparse data in ML models, overfitting tends to occur. Cai et al. [31] propose dual forms of dropout in their DDGCN model to avoid overfitting problems. For future work, more SL gene pairs would be identified by the cooperation of biological and computational researchers to address the sparse data. Moreover, computational models fitted better to sparse data should be developed. The second issue is imbalanced samples. The performance of the model deteriorates as the imbalance between the two classes increases. To address this problem, appropriate evaluation metrics should be adopted. Area under precision-recall curve is a more effective metric than area under receiver optimizer characteristics curve when applied on highly skewed tasks [31, 84]. Matthews correlation coefficient [106] has also been successfully used in SL prediction study [24] of which the samples are highly imbalanced. Besides, Li et al. [24] generated pseudopositive SL samples by synthetic minority oversampling technique method, which is designed to generate a number of predefined new samples from samples of minority class [107]. With these studies, we are one step closer to resolve the issue of imbalanced samples, and researchers would be inspired to explore more innovative solutions in the future researches, such as developing computational models that fit the imbalanced SL data better.

Lack of informative features

The mechanism behind SL is complex and cannot be generalized. Li et al. [9] propose a novel SL classification based on the specificity of its biological mechanism, which contains organelle level, pathway level, gene level and conditional SL. Conducting feature selection according to its biological mechanism before training a model is essential. In this way, informative features could be selected, which would help to explore more efficient computational models.

Lack of interpretability

Most ML approaches have not achieved clinical practice owing to lack of interpretability [108]. These models are regarded as ‘black boxes’, which optimize prediction accuracy without understanding the biological mechanisms behind the predicted results [81]. To resolve these difficulties, model interpretation is now a fast-growing subfield of ML methods [109]. Several efforts have been made on this issue for genetic prediction [110, 111]. For SL prediction, interpretable models have not yet been reported. More attention should be paid in developing interpretable models for SL prediction in the future.

Conclusion

Identification of SL gene pairs is imperative as it can provide novel targets for targeted therapy. However, the search space of gene combinations is too large to be investigated experimentally. Computational methods have been advanced to complement experimental approaches, which can reduce the search space of SL gene pairs. This review provides a comprehensive overview of computational methods, databases and tools for SL prediction. It introduces six types of label databases, three types of feature databases, three types of other related databases and six tools for SL prediction. Moreover, four types of computational methods with a detailed description of strengths and weaknesses have been summarized. In addition, we highlight several challenges in this field, some of which may inspire the future researches. Computational methods for SL can accelerate the discovery of novel SL-based targeted cancer therapies. This study reviews six types of label databases, three types of feature databases, three types of other related databases and six tools for SL. The related information and links of all databases are provided. Computational methods including statistical-based methods, network-based methods, classic machine learning methods and deep learning methods are introduced, and their merits and demerits are discussed. The challenges include biological issues and data and algorithm issues. Expanding the concept of SL and expanding the application of SL are discussed in the section of biological issues. In addition, data quality, sparse data and imbalanced samples, lack of informative features and lack of interpretability require further exploration in future studies. Click here for additional data file.
  129 in total

1.  A Platform of Synthetic Lethal Gene Interaction Networks Reveals that the GNAQ Uveal Melanoma Oncogene Controls the Hippo Pathway through FAK.

Authors:  Xiaodong Feng; Nadia Arang; Damiano Cosimo Rigiracciolo; Joo Sang Lee; Huwate Yeerna; Zhiyong Wang; Simone Lubrano; Ayush Kishore; Jonathan A Pachter; Gabriele M König; Marcello Maggiolini; Evi Kostenis; David D Schlaepfer; Pablo Tamayo; Qianming Chen; Eytan Ruppin; J Silvio Gutkind
Journal:  Cancer Cell       Date:  2019-02-14       Impact factor: 31.743

2.  A landscape of synthetic viable interactions in cancer.

Authors:  Yunyan Gu; Ruiping Wang; Yue Han; Wenbin Zhou; Zhangxiang Zhao; Tingting Chen; Yuanyuan Zhang; Fuduan Peng; Haihai Liang; Lishuang Qi; Wenyuan Zhao; Da Yang; Zheng Guo
Journal:  Brief Bioinform       Date:  2018-07-20       Impact factor: 11.622

Review 3.  Hereditary cancer syndromes as model systems for chemopreventive agent development.

Authors:  Farzana L Walcott; Jigar Patel; Ronald Lubet; Luz Rodriguez; Kathleen A Calzone
Journal:  Semin Oncol       Date:  2015-09-07       Impact factor: 4.929

4.  Combinatorial CRISPR-Cas9 Metabolic Screens Reveal Critical Redox Control Points Dependent on the KEAP1-NRF2 Regulatory Axis.

Authors:  Dongxin Zhao; Mehmet G Badur; Jens Luebeck; Jose H Magaña; Amanda Birmingham; Roman Sasik; Christopher S Ahn; Trey Ideker; Christian M Metallo; Prashant Mali
Journal:  Mol Cell       Date:  2018-02-15       Impact factor: 17.970

Review 5.  Synthetic Lethality and Cancer - Penetrance as the Major Barrier.

Authors:  Colm J Ryan; Ilirjana Bajrami; Christopher J Lord
Journal:  Trends Cancer       Date:  2018-09-15

Review 6.  Epigenetic synthetic lethality approaches in cancer therapy.

Authors:  Haoshen Yang; Wei Cui; Lihui Wang
Journal:  Clin Epigenetics       Date:  2019-10-07       Impact factor: 6.551

7.  Genetic Interaction-Based Biomarkers Identification for Drug Resistance and Sensitivity in Cancer Cells.

Authors:  Yue Han; Chengyu Wang; Qi Dong; Tingting Chen; Fan Yang; Yaoyao Liu; Bo Chen; Zhangxiang Zhao; Lishuang Qi; Wenyuan Zhao; Haihai Liang; Zheng Guo; Yunyan Gu
Journal:  Mol Ther Nucleic Acids       Date:  2019-07-17       Impact factor: 8.886

8.  CORUM: the comprehensive resource of mammalian protein complexes-2019.

Authors:  Madalina Giurgiu; Julian Reinhard; Barbara Brauner; Irmtraud Dunger-Kaltenbach; Gisela Fobo; Goar Frishman; Corinna Montrone; Andreas Ruepp
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

9.  G2G: A web-server for the prediction of human synthetic lethal interactions.

Authors:  Yom Tov Almozlino; Iftah Peretz; Martin Kupiec; Roded Sharan
Journal:  Comput Struct Biotechnol J       Date:  2020-04-27       Impact factor: 7.271

10.  Synthetic dysmobility screen unveils an integrated STK40-YAP-MAPK system driving cell migration.

Authors:  Ling-Yea Yu; Ting-Jen Tseng; Hsuan-Chao Lin; Chi-Lin Hsu; Ting-Xuan Lu; Chia-Jung Tsai; Yu-Chiao Lin; I Chu; Chien-Tzu Peng; Hou-Jen Chen; Feng-Chiao Tsai
Journal:  Sci Adv       Date:  2021-07-28       Impact factor: 14.136

View more
  1 in total

1.  SLOAD: a comprehensive database of cancer-specific synthetic lethal interactions for precision cancer therapy via multi-omics analysis.

Authors:  Li Guo; Yuyang Dou; Daoliang Xia; Zibo Yin; Yangyang Xiang; Lulu Luo; Yuting Zhang; Jun Wang; Tingming Liang
Journal:  Database (Oxford)       Date:  2022-08-27       Impact factor: 4.462

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.