Literature DB >> 36246661

Computational approaches for predicting variant impact: An overview from resources, principles to applications.

Ye Liu1, William S B Yeung1,2, Philip C N Chiu1,2, Dandan Cao1.   

Abstract

One objective of human genetics is to unveil the variants that contribute to human diseases. With the rapid development and wide use of next-generation sequencing (NGS), massive genomic sequence data have been created, making personal genetic information available. Conventional experimental evidence is critical in establishing the relationship between sequence variants and phenotype but with low efficiency. Due to the lack of comprehensive databases and resources which present clinical and experimental evidence on genotype-phenotype relationship, as well as accumulating variants found from NGS, different computational tools that can predict the impact of the variants on phenotype have been greatly developed to bridge the gap. In this review, we present a brief introduction and discussion about the computational approaches for variant impact prediction. Following an innovative manner, we mainly focus on approaches for non-synonymous variants (nsSNVs) impact prediction and categorize them into six classes. Their underlying rationale and constraints, together with the concerns and remedies raised from comparative studies are discussed. We also present how the predictive approaches employed in different research. Although diverse constraints exist, the computational predictive approaches are indispensable in exploring genotype-phenotype relationship.
Copyright © 2022 Liu, Yeung, Chiu and Cao.

Entities:  

Keywords:  genotype-phenotype relationship; human genetics; in silico prediction; nonsynonymous variants; variant impact

Year:  2022        PMID: 36246661      PMCID: PMC9559863          DOI: 10.3389/fgene.2022.981005

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.772


1 Introduction

One of the primary goals of human genetics is to discover the genetic variants associated with the onset and progression of human disease. The challenge is a “a needle in haystack” problem: how to pinpoint the potential causative ones from millions of individual variants (Genomes Project et al., 2015) spreading over the newly assembled, non-gap 3.055 billion–base pair human genome sequence (Nurk et al., 2022). Efforts to achieve this goal, such as linkage analysis and genome-wide association studies, were inadequately effective in identifying causative candidates and had poor clinical predictive value (Tam et al., 2019). Over the last decade, the next generation sequencing (NGS) has been extensively utilized in biomedical research as consequences of its substantially reduced cost and generation of large volume of data. According to the fact sheets on genomic cost provided by the National Human Genome Research Institute (NHGRI) (KA., 2021), NGS technology achieved one hundred-fold cost reduction compared to Sanger sequencing, and the price is currently less than $1,000 per human genome. Nowadays, NGS platforms can finish one run within 2 days producing billions of reads for up to 48 samples (Hu et al., 2021). With the raw NGS data, standard and well-recognized variant format files can be generated using upstream analysis pipeline (Kanzi et al., 2020). Whereas the downstream disease-causing variant fishing step among ∼50,000 variants from WES, or even millions of variants from WGS is the most challenge part (Eberle et al., 2017; Koboldt, 2020). There are plenty of data resources storing evidenced genotype-phenotype relationship information. To a certain extent, clinicians and researchers are able to utilize these records to interpret the formation, progress, diagnosis and treatment of diseases from a genetic perspective. However, even the most well-recognized databases, such as ClinVar (Landrum et al., 2020), only contain around 14,000 of highly confident variants with evidence evaluated by genetic experts, which is a small fraction compared to the huge number of variants identified from NGS. This situation dramatically reduces clinical utility from genetics. In addition, it also poses great challenges for understanding differential actions of genes between/among individuals, populations and species, as well as deciphering the genotype-phenotype relationship (Orgogozo et al., 2015). To address these issues, computational tools for predicting variant impact have emerged which can help bridge the gap between vast amount of genomic data generated and limited known genetic evidence, and finally build up the potential genotype-phenotype relationship for the newly identified variants. Variant call format (VCF) files store identified variants providing variant genomic position, nucleotide substitution, assessed quality score, genotype and other relevant information according to alignment and variant calling information (Danecek et al., 2011). Based on the specified information, variant annotation can locate them to specific genes or transcripts, classify them into different types and conclude on their impactable consequences (Wang et al., 2010; Cingolani et al., 2012; McLaren et al., 2016). Variants causing sequence alteration are mainly categorized into four types: insertion, deletion, single nucleotide variant (SNV) and other substitution, including multiple nucleotide variant (MNV) (Eilbeck et al., 2005). Among them, SNVs are the most frequently identified (Genomes Project et al., 2015; Lek et al., 2016) and annotated (Cunningham et al., 2015). SNVs are composed of non-synonymous SNVs (nsSNVs) and synonymous SNVs (sSNVs). Comparing to sSNVs, nsSNVs, which will cause amino acid change based on the protein translation codons, are estimated at higher frequency in individuals with excess deleteriousness (Genomes Project et al., 2012). Therefore, in this review, we focus on the computational approaches which are developed to infer the impact of nsSNVs in coding regions. The database resources that are utilized by majority of the predictive methods (we name them as predictors throughout this review) are firstly introduced. Following that, we discuss the underlying motivation and constraints of those predictors with which we group them into six categories in an innovative manner. We also present their corresponding predictive performance and concerns from assessment studies. Finally, we demonstrate the application performance of the predictors in large-scale studies, as well as their ability to reveal the genotype-phenotype associations.

2 Database resources for variant predictors

Models are not created out of thin air; rather, they are designed to identify hidden correlations in massive volumes of real data, allowing data to be interpreted and used to generate predictions. Since the deployment of the Human Genome Project in the 1990s, various relevant databases and knowledgebases have been established and maintained by academic institutions, organizations, consortia, and communities to collect, store, and retrieve records pertaining to genetic, clinical, and phenotypic information. They provide sufficient accessible evidences and facts to reliably demonstrate the genotype-phenotype association, which explains the functional and pathogenic importance of genetic variations (Johnston and Biesecker, 2013). Databases can be categorized according to their scope, purpose, and scale. Several reviews (Thorisson et al., 2009; Brookes and Robinson, 2015; Zhang et al., 2019; Banck et al., 2021; Katsonis et al., 2022) provided comprehensive details of the content, usage, comparisons, and limitations for those databases. In this section, we briefly review the most frequently used databases (Table 1) containing sequence information, population-scale data, phenotype ontology, clinical and experimental evidence.
TABLE 1

Summary of resources for human genotypes and phenotypes relationships.

Type of dataNameFull nameTechniquesType of variantsTargeted diseasesWebsiteContaining entries (until writtern in June 2022)CompositionFirst publication yearLast update (until writtern in June 2022)AccessiblePublications
Protein dataUniprotUniversal protein resourceCuratedGeneral https://www.uniprot.org/ 567,483 entries in Swiss-Prot and 231,354,261 entries in TrEMBLUniProt Knowledgebase, UniProt Reference Clusters, and UniProt Archive19972 February 2021Free UniProt, (2021)
Protein informationUniProtKBUniprot KnowledgebaseCuratedGeneral https://www.uniprot.org/uniprot/ Swiss-Prot and TrEMBL22 November 2021Free UniProt, (2021)
Protein sequencesUniRefUniprot Reference ClustersCuratedGeneral https://www.uniprot.org/uniref/ UniRef100, 90, 5029 November 2021Free UniProt, (2021)
Protein sequencesUniParcUniprot ArchiveCuratedGeneral https://www.uniprot.org/uniparc/ 24 March 2022Free UniProt, (2021)
Protein, DNA and RNA structural dataPDBProtein data bankStructural data from X-ray, NMR, electron microscopyGeneral https://www.rcsb.org/ 191,565 Biological Macromolecular Structures197114 June 2022Free Berman et al. (2000)
Protein data with themodynamic parametersProThermDBThermodynamic Database for Proteins and MutantsCuratedGeneral https://web.iitm.ac.in/bioinfo2/prothermdb/index.html ∼0.12 million thermodynamic data obtained for different organisms and cell lines, >32,000 entries, ∼20,000 mutations199922 September 2021Free Nikam et al. (2021)
Protein dataONGeneCuratedCancer https://ongene.bioinfo-minzhao.org/index.html 803 oncogenes2016Free Liu et al. (2017)
Protein dataTSGene2.0Tumor suppressor gene databaseCuratedCancer https://bioinfo.uth.edu/TSGene/ 1217 human tumor suppressor genes20124 January 2016Free Zhao et al. (2016)
Population data1000 Genome ProjectWGSSNVs, indelsGeneral https://www.internationalgenome.org/ Genotypes for 2,504 healthy donor samples from 26 populations20081 October 2015Free Sudmant et al. (2015)
Population dataGnomAD (previously ExAC)Genome aggregation databaseWGS, WESSNVs, indelsGeneral https://gnomad.broadinstitute.org/ 76,156 genomes data of diverse ancestries in v3.1 and 141,456 individuals exomes or genomes data in v2201421 January 2022Free Karczewski et al. (2020)
Population dataESPThe NHLBI exome sequencing projectWESSNVs, indelsDisease-, phenotype-related https://evs.gs.washington.edu/EVS/ 6,503 unrelated individual exom data201123 April 2019Free Fu et al. (2013)
Population dataUK BiobankDisease-, phenotype-related https://www.ukbiobank.ac.uk/ 49,960 exome data200619 March 2019Registration fee needed
Population dataUK10KWGS, WESHealthy and disease-related cohorts https://www.uk10k.org/ Nearly 10,000 individuals in UK populationWhole genome, Neurodevelopment, Obesity, Rare Diseases Sample Sets20101 October 2015Access control Consortium et al. (2015)
Phenotype and genotype dataOMIMOnline Mendelian Inheritance in ManClassificationDisease-, phenotype-, gene-related https://www.omim.org/ 26,446 entries, including all known mendelian disorders and over 16,000 genes196027 May 2022Free Amberger et al. (2019)
Phenotype and genotype dataOrphanetThe portal for rare diseases and orphan drugsClassificationDisease-, phenotype-related https://www.orpha.net/ 6,172 disease, 5835 genes199731 May 2022Free
OntologyHPOHuman phenotype ontologyClassificationDisease-, phenotype-, gene-related https://hpo.jax.org/ >13,000 terms, > 156,000 annotations200814 April 2022Free Kohler et al. (2021)
OntologyGOGene ontologyClassificationGene-specific http://geneontology.org/ 7,510,543 annotationsMolecular Function, Cellular Component, and Biological Process200016 May 2022Free(Ashburner et al., 2000; Gene Ontology, 2021)
OntologyMammalian Phenotype OntologyClassificationPhenotype-related https://bioportal.bioontology.org/ontologies/MP/?p=summary 14,716 classes200514 June 2022Free Smith et al. (2005)
Genomic dataHGMDHuman gene mutation databaseCuratedSNVs, indelsDisease-, phenotype-related http://www.hgmd.cf.ac.uk/ac/index.php 352,731 mutation entries352,731 mutation entries199631 May 2022Registration needed Maffucci et al. (2019)
Genomic dataVariBenchA benchmark database for variationsCuratedSNVs, indels http://structure.bmc.lu.se/VariBench/index.php VariBench datasets include disease-causing missense variations, neutral high frequency SNPs, protein stability affecting missense variations, variations affecting transcription factor binding sites, variations affecting splice sites2012Free Sasidharan Nair and Vihinen, (2013)
Genomic dataVariSNPCuratedSNVs, indels http://structure.bmc.lu.se/VariSNP/index.php 145,435,955 variantsDatasets selected from dbSNP which were filtered for disease-related variants found in ClinVar, Swiss-Prot and PhenCode201416 February 2017Free Schaafsma and Vihinen, (2015)
Genomic datadbSNPSingle nucleotide polymorphism databaseCuratedSNVs, indels, retroposable element insertions and microsatellite repeat variationsGeneral https://www.ncbi.nlm.nih.gov/snp/ 1,085,850,277 refSNP199926 May 2020Free Sherry et al. (2001)
Genomic dataClinVarCuratedSNVs, indelsDisease-, phenotype-, gene-related https://www.ncbi.nlm.nih.gov/clinvar/ 1,540,318 unique variation records20135 May 2022Free Landrum et al. (2020)
Genomic dataClinGenCuratedSNVs, indelsDisease-, phenotype-related https://clinicalgenome.org/ Unique 3692 variants in unique 2278 genes20131 April 2022Free Rehm et al. (2015)
Genomic dataDoCMDatabase of Curated MutationsCuratedSNVs, indelsCancer http://www.docm.info/ 1,364 variants among 122 disease type2014Free Ainscough et al. (2016)
Genomic dataVKGLVereniging klinisch genetischeCuratedSNVs, indelsDisease-, phenotype-related https://vkgl.molgeniscloud.org/ 188,502 variants2018December 2021Free Fokkema et al. (2019)
Genomic dataCIViCClinical interpretation of variants in cancerCuratedSNVs, indels, SVsCancer https://civicdb.org/welcome 3165 variants, 470 genes with clinical interpretation20151 May 2022Free Griffith et al. (2017)
Genomic dataCOSMICCatalogue of somatic mutations in cancerCuratedSNVs, indelsCancer https://cancer.sanger.ac.uk/cosmic 29,399,170 variants, 1,207,190 CNVs, 19,422 fusions200431 May 2022Free Tate et al. (2019)
Genomic dataLOVD3.0Leiden open variation database 3.0CuratedSNVs, indelsDisease-, phenotype-related https://www.lovd.nl/3.0/home 800,780 variants200217 August 2021Free Fokkema et al. (2021)
Genomic dataInSightThe International Society for Gastrointestinal Hereditary TumoursCuratedSNVs, indelsGene-specific http://insight-database.org/ 35,644 variant entries from 9 genes related to gastrointestinal tumoursVariants are automatically sourced from LOVD32005Free Fokkema et al. (2021)
Genomic dataHuVarBaseHuman variants databaseCuratedMissense, nonsense, insertion, deletionDisease-, phenotype-related https://www.iitm.ac.in/bioinfo/huvarbase/index.php 774,863 variants from 18,318 proteins, including 702,048 disease-causing and 72,815 neutral variantsSources from 1000 Genomes, ClinVar, COSMIC, Humsavar, SwissVar, MutHTP, PROXiMATE201815 October 2018Free Ganesan et al. (2019)
Genomic dataDVDDeafness variation databaseCuratedSNVs, indelsDeafness-related https://deafnessvariationdatabase.org/ 223 genesSources from ClinVar, dbNSFP, gnomAD, VEP, CADD, dbSNP, Population Analysis and others20184 January 2021Free Azaiez et al. (2018)
Genomic dataMETABRICMolecular Taxonomy of Breast Cancer International ConsortiumTargeted NGSSNVs, indelsBreast cancerMutation details can be retrived from https://www.cbioportal.org/study/summary?id=brca_metabric Mutation data in 173 genes from 2433 primary breast tumor samples and 650 normal controlsGenomic mutation data, copy number aberration (CNA), gene expression and long-term clinical follow-up data2012Free(Curtis et al., 2012; Pereira et al., 2016)
Genomic dataTCGA-BRCAWESSNVs, indelsBreast cancer https://portal.gdc.cancer.gov/projects/TCGA-BRCA Mutation data from WES of 817 Breast Invasive Carcinoma tumor/normal pairsGenomic mutation data, copy number aberration (CNA), gene expression and long-term clinical follow-up data20128 October 2015Free Ciriello et al. (2015)
Genomic data BRCA1 datasetSaturation genome editing assaysSNVs BRCA1 gene https://sge.gs.washington.edu/BRCA1/ 3,893 SNVs located within or near 13 exons that encode for the RING and BRCT domains of BRCA1 (exons 2–5 and 15–23, respectively)2018Free Findlay et al. (2018)
Genomic dataVarCardsCuratedSNVs, indelsGeneral http://varcards.biols.ac.cn/ 110,154,363 SNVs, and 1,223,370 indels in coding regions or splicing sitesVariant-level and gene-level resources201628 June 2020Free Li et al. (2018)
Summary of resources for human genotypes and phenotypes relationships.

2.1 Sequence resources

GenBank (Sayers et al., 2022), hosted by National Institutes of Health (NIH), European Nucleotide Archive (ENA) (Baker et al., 2000), hosted by European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), as well as the DNA Data Bank of Japan (DDBJ) (Okido et al., 2022) are the most widely used sequence databases, storing over 2.5 billion nucleotide sequences for over 504,000 formally described species. They serve as a basis for genetic analysis since aligning clean reads to the reference genome is an indispensable step in NGS analysis. As sequences of plethora species become accessible, protein sequences with 100%, 95%, and 50% identity are assembled to create clusters that are stored in informative databases such as UniProt Reference Clusters (UniRef) (Suzek et al., 2007), a branch of Universal Protein Resource (UniProt) (UniProt, 2021). These clusters are utilized to build multiple sequence alignment (MSA) sets, which form the basis of homology sequence-based approach.

2.2 Population resources

Several worldwide population projects exist, including the NCBI dbSNP (Smigielski et al., 2000), 1000 genome project (1KGP) (Sudmant et al., 2015), HapMap (International HapMap, 2003), UK10K (Consortium et al., 2015), Genome Aggregation Database (gnomAD) (Karczewski et al., 2020), and NHLBI GO Exome Sequencing Project (ESP) (Fu et al., 2013). With their progress and completion, their reports are now public and offer an exquisite view of the landscape of human genetic variants ranging from common to extremely rare ones. They also provide valuable information allowing the examination of variants between and within subpopulations with different ethnicities or disease status like heart, lung and blood disorders. Furthermore, minor allele frequency (MAF) from these databases is usually a useful indicator for prioritization or pertain as important feature for building prediction models.

2.3 Phenotype resources

Phenotype databases describe phenotypes and illnesses in conjunction with genetic information. The most widely known are OMIM (Online Mendelian Inheritance in Man) (Amberger et al., 2019) and Orphanet (Ayme et al., 1998). Their goal is to offer high-quality information on common and rare diseases or phenotypes in order to comprehensively review the genotype-phenotype association. To assist the investigation on connections between phenotypes and genes and to describe diseases in an algorithm-friendly data structure, ontology databases such as Human Phenotype Ontology (HPO) (Robinson et al., 2008), Mammalian Phenotype Ontology (Smith et al., 2005), and Gene Ontology (GO) (Ashburner et al., 2000) were developed. They are designed to annotate clinical phenotypes and genes with well-structured, computational-friendly, precise, and accurate terminology. Overall, these databases provide valuable insights for prioritization and interpretation of genetic data.

2.4 Clinical genetic resources

Several databases curated genetic data with clinical significance information. These databases are also known as Locus-Specific Databases (LSDB). Data and entries are often curated from literature and clinical trials. LSDBs range in scale from a single gene with roughly 4000 variants (Findlay et al., 2018) to hundreds of millions of variants (Schaafsma and Vihinen, 2015). The goal of LSDBs is to unambiguously and accurately define and categorize genotype-phenotype correlation, to understand gene functions and effects, to provide a map of genetic distribution across populations and diseases, and to assist clinicians/diagnostic laboratories in conducting further validation assays by providing detail molecule, pathogenicity, and effects of variants (Greenblatt et al., 2008). A well-curated and annotated LSDB is a valuable resource for constructing and evaluating prediction models. But note in mind that there would be overlapped variants in different LSDBs, even with contradictory classification of clinical impact due to inconsistent rules and subjective opinion of different curators. Phenotype-/disease- specific LSDBs are established, such as DVD (Deafness Variation Database) (Azaiez et al., 2018) for deafness, RAPID (Resource of Asian Primary Immunodeficiency Diseases) (Keerthikumar et al., 2009) for primary immunodeficiency disease, InSiGHT (The International Society for Gastrointestinal Hereditary Tumors) (Fokkema et al., 2021) for gastrointestinal tumors, fabry-database.org (Saito et al., 2011) for Fabry disease. Thanks to the effort of the Leiden Open source Variation Database (LOVD) (Fokkema et al., 2021) platform, a comprehensive list of public LSDBs are presented with details for researchers and clinicians to retrieve gene and mutation information from different resources.

3 Various variant predictors

Each predictor has a unique biochemical or biological basis. It is important to remember that the outcome of the predictor on different bases has different implications. The terms “dangerous,” “pathogenic,” “conservative,” and “damaging” do not necessarily denote causal of a specific phenotype or condition. Knowing the principles and drawbacks of each type of predictor aids in correctly interpreting the variants. Variant impact predictors can be categorized in different ways: machine-learning (ML) and non-ML models based on the used algorithms; homology sequence-based and structural-based models regarding the features they used in prediction; supervised and unsupervised ML-models. Unlike the category of sequence-, structure- and meta-methodologies in other reviews (Hassan et al., 2019b; Yazar and Ozbek, 2021), we introduced an innovative category here based on the characteristics and included features of each type (Figure 1). We discuss these categories by outlining the rational reasoning behind the predictors and provide an overview of the constraints. Later, we discuss predictor performance evaluation and underline current concerns and remedies. Details for each tool are present in Supplementary Table S1 and Table 2.
FIGURE 1

Summarized workflow of variants impact predictors. Protein structure and protein features of BRCA1 BRCT mutant M1775K are retrived from studies. (Birrane, 2006: Tischkowitz et al., 2008). The minor allele frequency (MAF) information of variant rs41293463 (chr17-43051071-A-C(GRCh38)) was retrived from gnomAD (Genome Aggregation Database).

TABLE 2

Representative diseases-, phenotypes-, genes-specific variants impact predictors.

Characteristic categoryNameType of variantsTargeted disease/phenotype/gene# of genesWebsiteDistribution (web-server/stand-alone)First publicationProgramming languageAlgorithm/modelFeaturesDataset for modelingClassification indexClassificationAdditional dataPublication
Meta-predictorVIPPID (Variant Impact Predictor for PIDs)MissensePrimary immunodeficiency (PID) diseases146 https://mylab.shinyapps.io/VIPPID/ Web and stand-aloneApril 2022Perl, RConditional Inference Forest85 features including AA, exonic, protein structural, conservation, and 20 pre-existing prediction tools4,865 disease-associated variants from Asian Primary Immunodeficiency Diseases (RAPID) database, HGMD and ClinVar; 4,237 neutral variants from gnomADClassifierPathogenic/non-pathogenic26 reviewed P/LP variants of known PID pathogenic genes from 1318 patients cohort and 39 validated in-house variants Fang et al. (2022)
Meta-predictorCanPredictMissenseCancer http://www.canpredict.org/or http://www.cgl.ucsf.edu/Research/genentech/canpredict/, both are not accessibleMay 2007RRFSIFT, Pfam-based LogR.E-value and GO Similarity Score (GOSS) metricsClassifierLikely cancer/likely non-cancer/not determined Kaminker et al. (2007)
Meta-predictorPolyPhen-HCMMissenseHypertrophic cardiomyopathy6 http://genetics.bwh.harvard.edu/hcm/ Pre-computed resultsFebruary 2011Naïve bayes classifierPrediction scores, protein structure comparison score74 curated variants from literitures and manually classified by Laboratory for Molecular Medicine standard variant-assessment pipeline (41 pathogenic, 26 benign)ClassifierPathogenic/benign/no call Jordan et al. (2011)
Meta-predictorCadioboostMissenseCardiomyopathies and arrhythmias22 https://www.cardiodb.org/cardioboost/ Pre-computed resultsOctober. 2020R2 Adaptive Boosting (Adaboost) classifiers76 functional featuresCM datasets: 356 rare P/LP variants from 9,007 clinical CM patients, 302 rare missense variants in CM genes from 2,090 healthy controls. Inherited arrhythmia dataset: 252 P/LP in arrhythmia-associated genes from ClinVar, 237 rare missense variants in arrhythmia genes from 2,090 healthy controlsPathogenicity scoreDisease-causing/VUS/Benign4 datasets from ClinVar, HGMD, Oxford Medical Genetics Laboratory (OMGL), a large registry of HCM patients, SHaRe Zhang et al. (2021)
Multiple featuresGENESIS (GENe-specific EnSemble grId Search)Variants of uncertain clinical significanceCatecholaminergic polymorphic ventricular tachycardia and long QT syndrome (LQTS)4 https://github.com/rachellea/medgenetics Stand-alone and pre-computed resultsMarch 2022PythonLogistic regression and multilayer perceptron model8 kinds of features including AA features, domain, conservation, rate of evolution, signal-to-noise ratio, and a position-specific scoring matrix (PSSM) score717 pathogenic variants and 3,164 benign variants curated from literitureProbabilities of pathogenicityPathogenic/VUS/benign925 VUS classified according to ACMG Draelos et al. (2022)
Multiple featuresCACNA1F-vpMissenseX-linked incomplete Congenital Stationary Night Blindness (iCSNB)1 https://github.com/shalawsallah/CACNA1F-variants-analysis Stand-aloneApril 2020PythonLogistic regression modelVariant-level features and structural features72 disease-implicated from HGMD or MGDL database, 322 benign variants from gnomADProbabilities of pathogenicityPathogenic/benign- Sallah et al. (2020)
Optimized PON-P2PON-MMR2AA substitutionMismatch repair (MMR)4 http://structure.bmc.lu.se/PON-MMR2/ Web and stand-aloneSeptember 2015RRF5 features: sequence conservation, physical and biochemical properties of AA109 pathogenic, 99 neutral, 354 VUS from InSiGHT database and VariBenchProbabilities of pathogenicityPathogenic/VUS/benign354 VUS dataset Niroula and Vihinen, (2015)
Optimized MAPPCoDP (Combination of Different Properties of MSH6 protein)MissenseLynch syndrome (LS)1 http://cib.cf.ocha.ac.jp/CoDP/ WebApril 2013Logistic regression modelMSA, phylogenetic tree, structral properties, MAPP, SIFT, PolyPhen2294 missense variants from InSiGHT, MMRUV, UniProt, dbSNP, ESP, HapMap Project, 1KGP and literatureProbabilities of pathogenicityLikely LS/Unlikely LS260 unclassified variants dataset Terui et al. (2013)
Meta-predictor with MAF as featuresDvPrednsSNVsGenetic hearing loss (HL)157 https://github.com/WCH-IRD/DVPred/tree/main/DVPred_score Stand-alone and pre-computed resultsFebruary 2022PythonGradient boosting decision tree (GBDT)65 features include conservation scores, prediction scores, MAF, gene intolerance scores and other features1,318 P/LP and 4,628 B/LB from China Deafness Genetics Consortium (CDGC), Deafness Variation Database (DVD), ClinVar, HGMDDvPred scoreDeleterious/neutral463 pathogenic and 454 benign variants from new version of CDGC and ClinVar Bu et al. (2022)
Meta-predictorNBDriverMissenseCancer58 https://github.com/RamanLab/NBDriver Stand-aloneMay 2021PythonRF, extra tress (ET) classifier, generative KDE classifier3 types of features: one-hot encoding, overlapping k-mers, 27 genomic features5,265 disease-associated variants from five literaturesClassifier Banerjee et al. (2021)
Combination of rule-based and meta-predictorCancerVarExon variants, CNVs, indelsCancer1911 https://cancervar.wglab.org/index.php Web, stand-alone and pre-computed resultsMay 2022PythonSemi-supervised generative adversarial network used in scoring method OPAI12 clinical evidence prediction scores and 23 precomputed scores by other computational tools13 million variants from 7 cancer knowledgebasesOPAI scoreOncogenic/benign4 datasets from OncoKB and CIViC, IARC and literatures Li et al. (2022)

*VUS, variant of uncertain significance.

Summarized workflow of variants impact predictors. Protein structure and protein features of BRCA1 BRCT mutant M1775K are retrived from studies. (Birrane, 2006: Tischkowitz et al., 2008). The minor allele frequency (MAF) information of variant rs41293463 (chr17-43051071-A-C(GRCh38)) was retrived from gnomAD (Genome Aggregation Database). Representative diseases-, phenotypes-, genes-specific variants impact predictors. *VUS, variant of uncertain significance.

3.1 Types of tools and their principles

3.1.1 Homologous sequence-based predictors

This class of predictors are derived from comparative genomics. The assumption is straightforward: under natural selection, amino acid changes in conservative sequences are more “deleterious” determined by homologous sequence searching across species, than that happened in other non-homologous positions which would be deemed as “tolerant” (Cooper and Shendure, 2011). Methodologically, these predictors firstly construct the multiple sequence alignment (MSA) either by grouping multiple protein sequences with a given similarity from BLAST alignment (Altschul et al., 1990), or just retrieval customed selective sequences from afroed-mentioned genomic databases (Section 2.1) for multiple alignment using MULTIZ (Blanchette et al., 2004), or MUSCLE (Edgar, 2004). Based on MSA, a position-specific scoring matrix (PSSM) (Gribskov et al., 1987) is computed to generate the prediction outcome with probability score (Ng and Henikoff, 2001), likelihood ratio (Chun and Fay, 2009), the average distance between targeted species and others in subfamilies (Choi et al., 2012), or the entropy difference (Reva et al., 2007; Hopf et al., 2017). The predictive outcomes are normally continuous values with the designer’s recommended threshold validated in mutation datasets. Apart from computing scores using empirically rational equations, ML algorithms were commonly utilized as classifiers. Classical models include random forest (RF) (Capriotti et al., 2006), and hidden Markov Model (HMM) (Thomas et al., 2003; Siepel et al., 2005; Garber et al., 2009; Pollard et al., 2010; Shihab et al., 2013). Although they are both ML techniques, the attributes they employ are distinct. For example, PhD-SNP (Capriotti et al., 2006) converted MSA and mutation to a 40-feature variables in support vector machine (SVM). The 40 features are composed of two parts: the first 20 vectors explicitly define the mutation residues, with -1 for the wild-type residue, 1 for the mutation, and 0 for the others. The second set of 20 vectors represents the mutation sequence environment, which is the frequency of each 20 amino acid residue in a 20 amino acid length window centered on the targeted site. Unlike unweighted and balanced MSA, HMM is a probabilistic profile of MSA that captures position-specific information (Krogh et al., 1994). Two different configurations of HMM were observed. One assumed three hidden states: “match,” “insertion,” and “deletion” to build a profile-HMM MSA (Thomas et al., 2003; Shihab et al., 2013), while the other considered a two-hidden state as “conserved” and “non-conserved” according to the phylogenetic information from tree topologies (Siepel et al., 2005; Garber et al., 2009; Pollard et al., 2010). More recently, a novel unsupervised ML model is utilized to discover patterns and correlations between absolute locations in the MSA, allowing direct observation of both conservation and coevolution (Riesselman et al., 2018; Frazer et al., 2021). This deep generative model captured the latent structure from MSA using Variational Autoencoders (VAEs), which was proved to be an outstanding model for separation of β-lactamase protein family, at the phyla level (Detlefsen et al., 2022). By assuming the observed data s are generated from latent variable z, the decode part of VAE consists of modeling the conditional probability. Hence, the encode part is the neural network modeling of approximate posterior distribution (Riesselman et al., 2018; Frazer et al., 2021). ML models’ predictions were normally given as log odds ratio scores between the probabilities of “substitution” and “wild-type” or “conserved” and “non-conserved”. In other words, under wild-type or neutral model, higher scores represent higher probability of unexpected substitution, thus are more evolutionary constraint. There are two considerations regarding homologous sequence-based predictions (Eilbeck et al., 2017). Firstly, many known disease-causing alleles reside in poorly or non-conserved regions will be false-negatively classified as neutral by predictors. Secondly, the tools are inadequate for predicting stop-gain and frameshift variations since they are not included in other organisms in the MSA (Eilbeck et al., 2017). The stop-gain and frameshift variants are rated as “HIGH” impact on biological sequence in annotation tools, e.g., VEP (McLaren et al., 2010) and SnpEff (Cingolani et al., 2012). But the impact on protein is not always concordant. The amino acid changes seem to be tolerant especially the ones located near C-terminal of protein (MacArthur et al., 2012). Some frameshift variants, even in homozygous state, were frequently observed among population suggesting limited impact on human health (Eilbeck et al., 2017). Thus, additional information such as protein structure might help improve the predictive power and efficiency of the predictors, which will be discussed in the following subsections in more detail.

3.1.2 Structure-based predictors

Apart from the primary structure of protein, the folding and stability are also essential for protein function normally. Early findings of variants that affect protein structure leading to aberrant phenotypes can be dated back to the 1950s, when the amino acid substitution in the half molecule of hemoglobin was discovered to cause sickle cell anemia (Ingram, 1957). Since then, thousands of mutations (Giardine et al., 2014) were described to impact on the function (increase (Jones et al., 1979) or decrease (Bonaventura and Riggs, 1968) oxygen affinity), stability (Martinez et al., 1977) and conformation (Moo-Penn et al., 1988) of hemoglobin. Indeed, missense variants also affect protein expression (Haraksingh and Snyder, 2013), post-translational modification (Kim et al., 2015) or binding affinity (Pires et al., 2015; Morningstar-Kywi et al., 2021). An estimation of ∼75% disease-causing variants directly lead to protein destabilization, making protein stability the major contributor to disease pathology (Yue et al., 2005), whereas ∼7% variants in disease dataset also have functional role (Yue et al., 2014). The location of the mutation has a preference. In comparison to polymorphisms, disease-causing mutations predominantly impact the core of the protein, whereas ∼70% are found in structural and functionally essential regions (Sunyaev et al., 2000; de Beer et al., 2013). Protein-protein interfaces are hot spots for disease-causing nsSNVs (David et al., 2012; Petukh et al., 2015). Again, disease-causing variations were 49% more likely (interface core vs interface rim odds ratio (OR) 1.49, 95% CI 1.24–1.80, p < 0.00001) to be found in the interface core than in the rim, possibly due to their differences in energy contribution to protein stability, physicochemical and evolutionary properties (David and Sternberg, 2015). Typically, nsSNVs impact on protein stability is estimated by computing the variation of Gibbs free energy change (∆∆Gf) resulting from an amino acid substitution. Physical effective energy function, statistical potential function, and empirical defined potential function are the three types of energy computing methodologies (Guerois et al., 2002). Because the first function is computationally intensive, the latter two are more frequently utilized. Structure-based predictors of protein stability mainly attribute to empirical potentials that integrate physical and statistical structure-related energy components (Guerois et al., 2002), and ML techniques (Dehouck et al., 2009; Laimer et al., 2015). In theory, these approaches should potentially give greater insights into the mutation effect than the homologous sequence-based predictors since they are built on the direct impact of mutation on protein structure and function. However, the truth is that protein-based predictors are still limited because of the unbalance and intrinsic variability of the thermodynamic data and their prediction performance (Sanavia et al., 2020). On one hand, despite that the Protein Data Bank (PDB) (Berman et al., 2000) contains over 50,000 human protein records, many of them are redundant, covering only 70% of reference human proteome at a sequence identity level higher than 30% (Somody et al., 2017). The development of AlphaFold2 (Tunyasuvunakool et al., 2021), to an extent increases the protein structure coverage; but its capability to predict the impact of single mutation is questionable (Pak et al., 2021; Buel and Walters, 2022). On the other hand, sequence-based techniques, under certain circumstances, outperform structure-based stability prediction tools (Hoie et al., 2022). Thus, combining sequence with structural information may aid in improving prediction capacity of variant impact.

3.1.3 Sequence and structure combination-based predictors

The approaches of this category consider both the previously described homologous sequence and protein structure information. Predictions take benefit from the combination of homology sequence information (e.g., conservative scores), and the structure features, such as hydropathy, polarity, backbone angles and electrostatic interactions, supplemented with energy features and biochemical features such as solvent accessible surface area of the interface (Kulshreshtha et al., 2016). Those features are sometimes transformed or selected for model training to achieve high prediction efficiency. Sometimes hundreds of features might be incorporated into the final model (Niroula et al., 2015). Algorithmically, supervised ML approaches including SVM (Calabrese et al., 2009; Li et al., 2009), naïve bayes classifier (Adzhubei et al., 2010), neural network (NN) (Hecht et al., 2015), RF (Carter et al., 2013; Niroula et al., 2015) and boosted tree regression (Zhou et al., 2016) are commonly applied in the multiple features predictors.

3.1.4 Meta-predictors

Meta-predictors are tools that make predictions by integrating results of pre-existing predictors. The term “meta-” sometimes corresponds to the term “consensus” in other studies (Bendl et al., 2014). The basic idea behind meta-predictors is to leverage on potential complimentary performance of selected predictors in classifying variants. There are mainly two improvements regarding meta-predictors comparing to aforementioned counterparts. First of all, meta-predictors give a comprehensive evaluation on the selected pre-existing tools. Each predictor has its own metric and scale making it difficult to compare across multiple predictors hindering the simultaneous usage. Meta-predictors have their own way to interpret scores from selected tools, by transforming to a comparable range as normalized scores (Bendl et al., 2014) or binary values (Gonzalez-Perez and Lopez-Bigas, 2011). In addition, meta-predictors are able to improve prediction performance by integrating prediction scores from different predictors, which allows the avoidance of bias and anti-generalization by single predictors (Kircher et al., 2014). In terms of missing value, where partly pre-existing tools fail to predict, some meta-predictors impute them using deleterious/neutral threshold (Capriotti et al., 2013), average score (Kircher et al., 2014; Quang et al., 2015), fixed score (Quinodoz et al., 2022), the maximal pathogenic score (Jagadeesh et al., 2016), or a flexible imputation using average value of k-nearest neighbors (Ioannidis et al., 2016) and Bayesian principle component analysis (BPCA) (Dong et al., 2015). There is currently no gold standard for imputation. Although machine-learning imputation appears to be more accurate (Brock et al., 2008; Wei et al., 2018), meta-predictor builders revealed that missing values account for less than 10% of their training and testing datasets (Dong et al., 2015), making the imputation methods less significant difference. While prediction performance studies suggested that meta-predictors surpassed other counterparts (Tian et al., 2019), concerns regarding circularity occurred, which will be discussed in Section 3.3.

3.1.5 Combining population data

A polymorphism is defined as an alteration in DNA sequence found in the general population at a MAF greater than 1%. According to The American College of Medical Genetics (ACMG) and the Association for Molecular Pathology (AMP) guidelines for clinical variant interpretation, a variant with >5% MAF is considered as a stand-alone support for benign interpretation for a rare Mendelian disorder (Richards et al., 2015). This is supported by the “neutral theory”, which defines neutral variants as the ones settled in the population through random genetic drift causing neither harmful nor beneficial effect to the survival of individual organisms (Kimura, 1979). When training and validating predictors, variants with higher than specified allele frequency (e.g., 5% or 1%) from population-scale databases were usually denoted as benign or neutral. However, predictors in predicting neutral variants differ greatly in capacity and specificity. For example, PON-P2 (Niroula et al., 2015) had a 95% specificity, while the poorest predictor incorrectly categorized more than one-third of polymorphisms as disease-causing (Niroula and Vihinen, 2019). Classifying the impact of variants according to their MAF were further argued by different hypothesis including “rare variant for Mendelian disease” (Pritchard and Cox, 2002), “Common disease, common variant” (CDCV) and “Common disease/rare variant” (CDRV). Researchers now have access to an exquisitely detailed view of the landscape of common and rare human genetic variants. Another issue that predictors should be careful with when utilizing MAF is that MAF is largely dependent on the population size and varies among subpopulations leading to population stratification (Eilbeck et al., 2017). For example, rs79444516, which is common in African population (13%), exhibited its extreme rareness in European and Asian population, with MAF <0.05%. When estimated in the mixture population, the MAF is 1.2% which will cause confounding classification. Varied MAF for the same variant because of different scales of sample size could be largely mitigated with the completion of the huge population-scale projects. To better utilize MAF in prediction, ClinPred (Alirezaie et al., 2018), a meta-predictor using ML approach, employs MAFs from diverse populations as part of their features, instead of classifying variants based on single arbitrary MAF cutoffs. Together with feature scores from 16 pre-existing tools, ClinPred trains on clinically curated pathogenic and benign datasets and outperforms other meta-predictors when applied to datasets of rare diseases and cancer (Alirezaie et al., 2018). Therefore, MAFs from population data is capable to enhance the prediction. Similarly, more and more tools (Chennen et al., 2020; do Nascimento et al., 2020; Lai et al., 2020; Li et al., 2020) integrated MAFs as predicting features and achieved competitive performance on pathogenicity prediction.

3.1.6 Disease-, phenotype-, gene-specific predictors

The ultimate goal of the variant prediction tools is to accelerate the development of precision medicine. Majority of the strategies discussed above aim to estimate disease occurrence based on the assumption that changes in protein function leads to a decrease in organismal fitness (Boucher et al., 2016). They are trained in a large-scale datasets in a genome-wide and pan-disease manner neglecting the complexity among different diseases and making the prediction results suboptimal (Dorfman et al., 2010). Therefore, with the necessity to precisely estimate the impact of variants on specific disease/phenotype, a class of disease-, gene-, phenotype-specific prediction has emerged. The phenotype-targeting predictors range widely from common cardiac (Zhang et al., 2021), cancer (Kaminker et al., 2007), and neurodegenerative disease (Ahmed et al., 2015), to rare diseases, such as methylmalonic acidemia (Peng et al., 2019), X-linked incomplete Congenital Stationary Night Blindness (Sallah et al., 2020) and Pompe disease (Adhikari, 2019). More details are presented in Table 2. There are over 13,000 terms defined in HPO. While LSDBs provide benchmarked variant datasets, such as COSMIC, CIViC and OncoKB for cancer, which can be utilized for disease-specific predictors construction, limited datasets are available for majority of phenotypes. For a particular disease/phenotype, the training and validation datasets can be prepared by curation of variants and genes from literatures, or re-analysis of unpublished sequencing data of case-control cohorts, followed by manually classification using recognized guidelines such as ACMG/AMP. The scale of the curated databases for different diseases/phenotypes varies in gene (from one to hundreds) and variant number (from thousands to millions) which is largely dependent on the number of relevant publications. With the curated databases, most of this category of predictors utilize ML methodology. They can be grouped into three classes. The first class overlaps with previous mentioned categories but has distinct characteristic. It includes sequence and structure combination-based (Sallah et al., 2020; Draelos et al., 2022), sharing the same strategy with previously mention predictors in Section 3.1.3, and meta-predictors (Jordan et al., 2011; Bu et al., 2022) similar to the ones discussed in Section 3.1.4. The distinct characteristic is the difference in training and validation datasets selection. Also, disease-related genes are known, making predictors capable of constructing sub-model for each gene, resulting in better prediction performance (Fang et al., 2022). The second class aims to optimize pre-existing predictor, usually sequence-based model, by re-constructing MSAs and phylogenetic tree of targeted gene(s) (Niroula and Vihinen, 2015; Fortuno et al., 2018). These predictors share the same strategy as their precursors, with distinct features selection. The third class predicts variants in a comprehensive and robust way, utilizing additional rule-based classification system. For example, CancerVar (Li et al., 2022), integrates rule-based categorization with ML-based meta-predictor scores to interpret the predicting clinical significance. These well-calibrated and sculpted predictors demonstrate their capability in targeted sequencing disease-specific panels to the utmost (Peng et al., 2019). In contrast, their ability to generalize is then questioned. When utilizing these techniques, note in mind the key target phenotypes and genes.

3.2 Performance assessment of predictors

As dozens of predictors exist, choosing the appropriate one(s) becomes challenging for end users. Several assessment criteria, such as sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and Matthews correlation coefficient (MCC), are commonly used to demonstrate model performance (Vihinen, 2012). The values for sensitivity, specificity, PPV, NPV, and accuracy range from 0 to 1, with higher values indicating better performance. MCC benefits from true and false positives and negatives with values on a scale of -1 to 1, with values closer to 1 indicating perfect prediction. Furthermore, a visualization measurement, receiver operating characteristics (ROC) analysis is frequently used to intuitively compare the area under the ROC curve (AUC) of multiple predictors (Vihinen, 2012). For non-intersecting curves, the AUC value closer to 1 suggests better overall performance, while a value of 0.5 indicates random and useless classification. Most predictors, when developed, would be assessed using respective training and validation datasets presenting supreme or acceptable performance. However, evaluation using consensus datasets would be more informative for tool selection. There are dozens of comparison studies on the performance assessment of different selected tools using different benchmark datasets. When Performance evaluation of pathogenicity-computation methods for missense variants, meta-predictors such as REVEL, Meta-SNP, generally have better performance and stronger evidence in clinical interpretation (Accetturo et al., 2020; Cubuk et al., 2021; Anderson and Lassmann, 2022). In the assessment of 23 predictors, Li et al. (Li et al., 2018) revealed that meta-predictors achieved higher AUC than others of sequence-based and structure-based predictors using the ClinVar benchmark dataset, indicating better performance of meta-predictors. However, when regarding somatic variants and PPARG gene benchmark datasets, meta-predictors and structure-based predictors exhibited comparable performance (AUC>0.8) (Li et al., 2018), and were superior to homology sequence-based predictors (AUC>0.7). Hassan et al. (Hassan et al., 2019a) revealed that meta-predictor which integrated 4 pre-existing prediction scores, outperformed other 8 predictors achieving ∼10%, 20%, 15% improvement in specificity, sensitivity and AUC, respectively. The performance of different categories are not always consistent, and sometimes are contradictory. Poon’s study (Poon, 2021) on BRCA1/2 datasets revealed that SIFT and PolyPhen2’s performance differed among genes. Meléndez-Aranda et al. (Melendez-Aranda et al., 2019) compared the performance of 6 in silico tools on 215 missense mutations in hemophilia B causative gene F9, and the results showed that the most popular tool, SIFT, was the most accurate. When applying to a somatic dataset containing 4319 somatic missense variants, the performance of SIFT was sub-optimal (Suybeng et al., 2020). As a result, it is critical to have pre-knowledge of your testing data and predictive goal when selecting appropriate tools. In order to address the confounding situation and objectively determine the appropriate usage and accuracy of predictors, the Critical Assessment of Genome Interpretation (CAGI) (Andreoletti et al., 2019) community started their experiments in 2010. Until now, there are six editions with 63 challenges and over 50 articles released. Participants predict the phenotypic impact of unpublished genetic variants collected from experimental and clinical labs provided by CAGI. Later, independent assessors test the predictions against experimental characterized phenotypes, and the results will be presented at the CAGI conference and published in special journal issues. The challenges released include a wide range of topics, from nsSNVs to splicing variants, and from disease panels to databases including curated variants. However, the reality of the outcome is frequently far more complex than the challenges’ initial objective. Predictors with superior performance in one challenge, would fail to call the pathogenicity of variants in other datasets (Katsonis and Lichtarge, 2019; Savojardo et al., 2019). Complex gene datasets caused divergence predictions and confounding outcomes, raising concerns about the possibility of experimental mistakes as the basis of disagreement (Miller et al., 2019). All the above suggested the caution when interpreting the evaluation results.

3.3 Concerns of current predictors and remedies

Majority of predictors are trained, validated and tested using benchmarked sets of variants with explicit classification labels. When evaluated 10 predictors across major public databases, Grimm et al. (2015) raised concerns about “circularity” involving in the usage of predictors and conduction of comparative studies. The term “circularity” refers to the situation that same variants are recursively used in both training and evaluating models. “Type 1 circularity” refers to the overlap between training and evaluation particularly for supervised ML-based predictors, resulting in poor generalization on new data (Grimm et al., 2015). Selecting predictions from unsupervised tools as features or filtering overlapping sets during training might assist to minimize the “type 1 circularity” during model construction (Alirezaie et al., 2018; Won et al., 2021). Furthermore, avoiding overuse of individual dataset (Vihinen, 2013; Weber et al., 2019) and choosing benchmark database which addressed overlapped issue (Sasidharan Nair and Vihinen, 2013; Sarkar et al., 2020) also helps when conducting comparative studies. Grimm et al. (2015) observed that weighted FatHMM (Shihab et al., 2013) achieved outstanding performance in 2 datasets but severe drop in performance in subset from SwissVar. They found that the ratio of pathogenic and neutral variants in the same protein family was the key element for weighting scheme, leading to higher pathogenic score assigned to both neutral and pathogenic variants in the same gene with higher ratio (Grimm et al., 2015). This strategy made weighted FatHMM statistically successful in some datasets, but ultimately inappropriate. Therefore, they defined the “Type 2 circularity” as the circumstance in which all variants from the same gene are jointly labeled as pathogenic or neutral. To address this problem, it was suggested to use datasets with an appropriate pathogenic-to-neutral ratio and avoid genes with exclusive pathogenic or neutral variations when reporting performance (Bu et al., 2022; Quinodoz et al., 2022). Another concern is about “collinearity,” which generally occurs with the regression models. ‘Collinearity’ refers to the circumstance in which significant correlation between two or more feature variables resulting in independent regression coefficients estimation problems and leading to redundancy in the set of variables (Bayman and Dexter, 2021). This situation might be mitigated via feature selection and estimator modification (Zheng et al., 2020; Chan et al., 2022). From another perspective, “collinearity” should not be a problem because more complicated machine learning algorithms including SVM, Random Forest, and Neural Network, can handle large-scale and multi-collinear datasets in a better way (Dong et al., 2015; Perez-Enciso and Zingaretti, 2019).

4 Application

In-silico approaches combined mathematical strategies with expert opinion allows researchers to analyze the biological meaning of genetic data efficiently and economically (Trisilowati and Mallet, 2012). In-silico predictors on variant effect aids in genome interpretation. The prediction-based categorization provides insight into variant characterization and prioritization. Regards to large-scale population study, in silico predictors aid in variant classification for pattern overview and comparison at subpopulation level. For example, Palmer et al. (2022) subdivided missense variants by SIFT and PolyPhen2 prediction in research on bipolar disorder (BD) and revealed an obvious enrichment in ultra-rare harmful missense variation outside of confined missense areas, particularly in bipolar II disorder (BD2). This observation contrasted with the findings in schizophrenia cases (Singh et al., 2022) of enrichment within constrained missense regions. The authors speculated this signal may capture something distinct to mood disorders relative to psychotic disorders (Palmer et al., 2022). For large-scale population, in silico predictors also facilitate the detection of variant-level signals under natural-selection for those living in extreme environments or with a diverse geographic distribution. Deng et al. (2019) ranked variants by calculating the functional importance score (FIS) from four in silico predictors. Based on the ranking of adaptive genetic variants, they revealed a seldom studied gene, TMEM247 with a missense variant rs116983452, to be the most-differentiated functional variant identified between Tibetan and non-Tibetan populations (Deng et al., 2019). When studying non-homogeneous Taiwanese Han population, integrated selection of allele favored by evolution (iSAFE) was incorporated with the CADD functional impact score to identify 16 natural-selection signals by geographic distribution that were unambiguously localized to 5 single genes (Lo et al., 2021). Meanwhile, in the western Roma population, Font-Porterias et al. (2021) categorized missense variants based on GERP, PolyPhen2 and CADD, revealing significant difference in common deleterious variant portion between Roma and non-Roma population. Furthermore, runs of homozygozity (ROH), which are continuous homozygous regions of the DNA sequence, exhibit ancestry-specific patterns of accumulation of deleterious homozygotes. In addition to characterization for population-level study, predictors have also been widely used for prioritization of disease-causing candidates in case-control or pedigree studies, finally leading to the identification of genotype-phenotype association. There are commonly two strategies for variant prioritization in which predictors help. Several frameworks and platforms are listed in Table 3.
TABLE 3

Representative prioritization frameworks and tools.

Characteristic categoryNameType of Targeted variants*WebsiteDistribution (web-server/stand-alone)First publicationLast updateProgramming languageAlgorithm/modulesInput typeDataset for modelingPublications
User-defined rule-basedVCF.FilterSNVs, indels https://biomedical-sequencing.at/VCFFilter/ Web and stand-aloneJuly 2017JavaFilter cohort, prioritize on pedigree and search variant in cohort modulesVCF files, targeted regions, cohort allele frequencies, pedigree information Muller et al. (2017)
User-defined rule-basedBiERappSNVs, indels, CNVs, MNVs, SVs http://bioinfo.cipf.es/apps-beta/bierapp/2.0.0/#home Web and stand-aloneApril 2014HTML5 and JSCellBase annotation, consecutive filtering strategyMulti-sample VCF files Aleman et al. (2014)
User-defined rule-basedKGGSeqSNVs, indel, CNVs http://pmglab.top/kggseq/ Stand-aloneJanuary. 20121 January 2022Java5 major modules: quality control, filtration, annotation, pathogenic prediction and statistic testsVCF files, pedigree information7,296 disease-causing variants from OMIM and 48,089 neutral variants Li et al. (2012); Li et al. (2017)
User-defined rule-basedVPOT (variant prioritization ordering tool)SNVs, indel https://github.com/VCCRI/VPOT/ Stand-aloneNovember. 201927 October 2021Python2 steps: prioritization of variants based on user-defined parameters, post-processing of variant priority ordered listANNOVAR annotated VCF or TXT files, inheritance model Ip et al. (2019)
ACMG guideline basedTAPESSNVs, indel https://github.com/a-xavier/tapes Stand-aloneOctober. 2019PythonBayesian classification frameworkVCF files Xavier et al. (2019)
ACMG guideline basedInterVarSNVs, indel https://github.com/WGLab/InterVar, http://wintervar.wglab.org/ Web, stand-alone and pre-computed resultsFebruary 201713 June 2022PythonAutomated or manually scoring system. Manual review and adjustment on specific criteriaAnnotated or unannotated VCF files Li and Wang, (2017)
ACMG guideline realtedVarFishSNVs, indels https://varfish-kiosk.bihealth.org/, https://github.com/bihealth/varfish-server Web and stand-aloneJuly 2020June 2022PythonQuality control, database- and user-based annotation, filtering interface, joint filtering of multiple casesVCF files, optional pedigree information- Holtgrewe et al. (2020)
Phenotype-drivenExomiserSNVs, indels https://www.sanger.ac.uk/tool/exomiser/ Stand-aloneNovember 2015November 2021JavaFiltering and Prioritization based on logistical regression model. Four prioritization method include PHIVE, PhenIX, ExomeWalker, hiPHIVE.VCF files, HPO terms, optional pedigree information Smedley et al. (2015)
Phenotype-driveneXtasynsSNVs https://extasy.esat.kuleuven.be/ Web and stand-aloneSeptember 2013RubyRFVCF files, HPO terms24,454 disease-causing nsSNV from HGMD associated with 1,142 HPO terms. Control datasets: common polymophisms and rare variants from 1KGP, rare variants in in-house control samples Sifrim et al. (2013)
Phenotype-drivenAMELIE (Automatic Mendelian Literature Evaluation)Missense, stopgain, splicing, indels, duplication https://amelie.stanford.edu/ Web and stand-aloneMay 2020May 2021Natural language processing (NLP) and logistic regression classifierVCF files, HPO termsA set of 681 simulated patients using data from OMIM, ClinVar and 1KGP Birgmeier et al. (2020)
Phenotype-drivenPhen-GenMissense, nonsense, splice site and indels https://github.com/pkuerten/phen-gen Stand-aloneSeptember 2014PerlRandom walk–with–restart algorithm, Bayesian framework based on genotype and phenotype dataVCF files, HPO termsHGMD 2011.4 datasets Javed et al. (2014)
Phenotype-drivenLIRICAL (LIkelihood Ratio Interpretation of Clinical AbnormaLities)SNVs, indels https://github.com/TheJacksonLaboratory/LIRICAL Stand-aloneSeptember 2020September 2021JavaLikelihood-ratioVCF files, HPO terms Robinson et al. (2020)
Phenotype onlyPhrank (phenotype ranking) https://bitbucket.org/bejerano/phrank/src/master/ Stand-aloneFebruary 2019PythonBoolean Bayesian networkHPO termsKnowledgebase of gene-disease-phenotype relationships, HPO-A Jagadeesh et al. (2019)
Phenotype onlyPhenoRank https://github.com/alexjcornish/PhenoRank Stand-aloneJune 2018PythonPhenotypic similarity measured by simGIC, gene scores calculation by random walk with restart (RWR) methodHPO terms5,685 unique associations between 4,729 diseases and 3,713 genes from ClinVar, OMIM and UniProtKB Cornish et al. (2018)
Phenotype onlyPhen2Gene https://phen2gene.wglab.org/, https://github.com/WGLab/Phen2Gene Web and stand-aloneJune 2020March 2021PythonWeighting by skewnessHPO termsHPO–gene annotation files downloaded from the Jackson Laboratory for Genomic Medicine; gene-disease databases OMIM, ClinVar, Orphanet, GeneReviews; gene-gene relationship databases HPRD, HGNC, Biosystem, HTRI Zhao et al. (2020)
Representative prioritization frameworks and tools. First, empirical criteria are used to filter variations. With high quality variants, many studies (Ma et al., 2013; Blue et al., 2014) performed prioritization based on in silico predictions, MAF in population database and control groups, inheritance pattern, and functional effect. By this method, less than 10 variants are distilled out of hundreds of thousands obtained from WES analysis. Following the validation of orthogonal assays (e.g., Sanger sequencing), true positive causal candidates will be examined for the functional effect on protein in vitro and/or in vivo. The relationship between variants-phenotype is therefore thoroughly investigated. Several user-friendly rule-based frameworks (Coutant et al., 2012; Li et al., 2012; Aleman et al., 2014; Muller et al., 2017) have been built to make the filtering procedure easier to implement. Researchers can set their own criteria and get the findings in readable files with detailed annotation information. The prioritization can also be supplemented with adoption of consensus recommendations, such as ACMG/AMP standards and guidelines (Richards et al., 2015). The guideline includes a comprehensive set of definitions and criteria for variation interpretation, ranging from standardized nomenclature to evidence-based rating yielding a five-tier terminology system outcome. Results from in silico predictors are accounted as “supporting” evidence for benign or pathogenic classification. Some automatic tools (Li and Wang, 2017; Xavier et al., 2019) have also been developed for variant classification based on the guidelines, although the manual classification by professional geneticists would be deemed as the golden standard. The second strategy refers to phenotype-driven frameworks, which combine phenotype and variants data for prioritization and interpretation. Clinical diagnosis would be straightforward when the disease is known. However, before the identification of candidate disease, the procedure to explain a set of clinical features is challenging due to the absence or presence of unrelated features and various degrees of specificity (Kohler et al., 2009). To extract standardized and normalized phenotypic terminologies from sparse clinical abnormalities in case studies, some tools like Phenomizer (Kohler et al., 2009) and Doc2HPO (Liu et al., 2019) are recommended to map the clinical symptoms to the list of known disorders and estimate the significance of each disease match. Prediction scores from in silico predictors are integrated in this kind of framework as “pathogenicity” or “deleteriousness” features. Most of phenotype-driven tools (Sifrim et al., 2013; Javed et al., 2014; Smedley et al., 2015; Birgmeier et al., 2020; Robinson et al., 2020) require variant files and HPO terms as input, while some tools (Cornish et al., 2018; Jagadeesh et al., 2019; Zhao et al., 2020) require only HPO terms. Yuan et al. (2022) investigated causal-gene prioritizing performance of both types on two benchmark datasets in a recent comparative study and revealed that former ones performed better overall than latter ones. Their results also indicated the complementarity of multiple phenotype-driven tools towards a viable integrated strategy may improve diagnostic efficiency (Yuan et al., 2022).

5 Discussion

In this review, we firstly summarized the database resources frequently used during predictor development. We then discussed the rational, necessity and limitations for the newly categorized predictors: homologous sequence-based, structural-based, combination of sequence and structural, meta-predictors, population-based, and gene-, phenotype-, disease-specific predictors. Predictor performance as well as their limitations and possible remedies were then outlined. The application of the predictors in real studies was finally presented demonstrating their efficient assistance in variant characterization and prioritization, as well as the discovery of genotype-phenotype association. When building predictors, unambiguous labeled datasets are critical. Avoiding overlapping and contradicting data, as well as balancing the positive-negative ratio in training and validation datasets, will definitely minimize the negative influence of circularity. Further examination on the collinearity between/among feature variables will facilitate the optimization of prediction models, even though some algorithms are literally not affected. Among the predictors, meta-predictors outperform others in general; however, their prediction performance is considerably discounted in some disease-specific datasets, raising concern about their applications especially in clinical settings (Schiemann and Stowell, 2016; Mahmood et al., 2017). Employment of disease-, gene-, phenotype-specific predictors can to an extent solve the above issue. When selecting predictors for a particular study, efforts should be given on screening whether the genes and phenotype predictor calibrated perfectly matching your research, and understanding the scope and predictive performance of each predictor. On the other hand, we look forward to more specialized predictors sculpted for a variety of phenotypes covering both common and rare diseases. According to Variation Ontology (VariO) (Vihinen, 2014), variant impacts on protein level can be annotated with effects on function, structure and property. Variants impact on protein functional or property effects can be classified as follows: abundance, which includes gene dosage, expression, degradation and mis-localization; activity, which includes enzymatic, kinetic and regulation; enzymatic specificity, and molecular affinity (Vihinen, 2021). Most of above-mentioned predictors computed the possibility of pathogenic effect on protein function and structure in a broad range, rather the effects on protein abundance, activity or affinity properties separately. This may indicate a challenging future orientation of variant predictors development. The correlation between variants pathogenic prediction on protein function or structure and abnormal clinical outcomes are validated by experimental facts at the current stage. For certain phenotypes, an evident enrichment of deleterious variants in a set of disease-related genes, such as the increased mutational burden in essential genes in autism spectrum disorder (Ji et al., 2016), WNT signaling genes in myelomeningocele (Hebert et al., 2020), a set of 5 genes in epilepsy (Leu et al., 2015). The gap between observed higher burden genes and clinical phenotype is then bridged by functional or mechanical experimental studies. For example, meiocytes with pathogenic mutation p.S167L in HSF2BP found in premature ovarian insufficiency (POI) patients from a family, showed a reduced number of foci formed by the recombinases RAD51/DMC1, leading to crossover defect, which provided an insight into the molecular mechanism of mutation in POI and subfertility (Felipe-Medina et al., 2020). Currently, variant impact predictors are insufficient for indicating molecular mechanism of pathogenicity. However, the advancement of protein structure prediction may assist the interpretation of pathogenic variants since structural information gives useful insights in evaluating variant impact on protein or biological systems (Diwan et al., 2021). Impacts of mutations on protein synthesis includes transcriptional and translational influences. For SNVs, the impact on transcription involves in changes in transcript sequence and influence in gene regulation (Haraksingh and Snyder, 2013). Tools for predicting impact on gene regulation have been timely and systematically reviewed by other studies (Li et al., 2015; Ohno et al., 2018; Rojano et al., 2019; Canson et al., 2020). In terms of translation, SNVs-induced amino acid substitution causes protein structure and function abnormalities, and the prediction methods have been explored in this study. The deeper association between SNVs for protein folding and post-translation modification is still being investigated. With the development of a cutting-edge structure prediction tool, AlphaFold2, the unstructured human protein narrowed down to less than 30% (Porta-Pardo et al., 2022). However, examples showed that AlphaFold2 was not capable for predicting protein structure modification caused by pathogenic mutations, particularly those having experimentally proven destabilizing effect (Buel and Walters, 2022). The reasons for this limitation may relate to the bioinformatics and physical methodologies utilized in modeling, as well as the resources from protein sequence and PDB structure data employed, instead of the fundamental driving forces of protein folding (Jumper et al., 2021; Buel and Walters, 2022). The AlphaFold team is presently considering solutions for new mutations, which may give better prediction on unfolding to folding state, based on protein physics instead of sequence evolutionary (Callaway, 2022). We anticipate that its success will usher in a new age of human genetic research, including the acceleration of in silico functional and mechanical genotype-phenotype association investigations. Finally, although the variant effect predictors greatly help the genomic interpretation, end-users should keep in mind that the predictor’s role is only an assistance to clinical diagnosis, and merely a starting point (Eilbeck et al., 2017). The unequal relationship between predicted damaging effect and pathogenicity warns their usage. In addition, under some circumstances, the predicted scores overstating the effect of uncommon mutations, will cause inflated estimation affecting the specificity and sensitivity (Lanktree et al., 2018). Therefore, experimental validations, the golden standard in variant impact evaluation, are still indispensable.
  254 in total

1.  SIFT missense predictions for genomes.

Authors:  Robert Vaser; Swarnaseetha Adusumalli; Sim Ngak Leng; Mile Sikic; Pauline C Ng
Journal:  Nat Protoc       Date:  2015-12-03       Impact factor: 13.491

2.  Pathogenicity of the BRCA1 missense variant M1775K is determined by the disruption of the BRCT phosphopeptide-binding pocket: a multi-modal approach.

Authors:  Marc Tischkowitz; Nancy Hamel; Marcelo A Carvalho; Gabriel Birrane; Aditi Soni; Erik H van Beers; Simon A Joosse; Nora Wong; David Novak; Louise A Quenneville; Scott A Grist; Petra M Nederlof; David E Goldgar; Sean V Tavtigian; Alvaro N Monteiro; John A A Ladias; William D Foulkes
Journal:  Eur J Hum Genet       Date:  2008-02-20       Impact factor: 4.246

Review 3.  A review study: Computational techniques for expecting the impact of non-synonymous single nucleotide variants in human diseases.

Authors:  Marwa S Hassan; A A Shaalan; M I Dessouky; Abdelaziz E Abdelnaiem; Mahmoud ElHefnawi
Journal:  Gene       Date:  2018-09-18       Impact factor: 3.688

Review 4.  Computational approaches for predicting mutant protein stability.

Authors:  Shweta Kulshreshtha; Vigi Chaudhary; Girish K Goswami; Nidhi Mathur
Journal:  J Comput Aided Mol Des       Date:  2016-05-09       Impact factor: 3.686

5.  Hemoglobin Kansas, a human hemoglobin with a neutral amino acid substitution and an abnormal oxygen equilibrium.

Authors:  J Bonaventura; A Riggs
Journal:  J Biol Chem       Date:  1968-03-10       Impact factor: 5.157

6.  Multicollinearity in Logistic Regression Models.

Authors:  Emine Ozgur Bayman; Franklin Dexter
Journal:  Anesth Analg       Date:  2021-08-01       Impact factor: 5.108

7.  Can AlphaFold2 predict the impact of missense mutations on structure?

Authors:  Gwen R Buel; Kylie J Walters
Journal:  Nat Struct Mol Biol       Date:  2022-01       Impact factor: 15.369

8.  Interpretable Clinical Genomics with a Likelihood Ratio Paradigm.

Authors:  Peter N Robinson; Vida Ravanmehr; Julius O B Jacobsen; Daniel Danis; Xingmin Aaron Zhang; Leigh C Carmody; Michael A Gargano; Courtney L Thaxton; Guy Karlebach; Justin Reese; Manuel Holtgrewe; Sebastian Köhler; Julie A McMurry; Melissa A Haendel; Damian Smedley
Journal:  Am J Hum Genet       Date:  2020-08-04       Impact factor: 11.025

9.  A comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases.

Authors:  Miao-Xin Li; Hong-Sheng Gui; Johnny S H Kwan; Su-Ying Bao; Pak C Sham
Journal:  Nucleic Acids Res       Date:  2012-01-12       Impact factor: 16.971

10.  An expanded phenotype centric benchmark of variant prioritisation tools.

Authors:  Denise Anderson; Timo Lassmann
Journal:  Hum Mutat       Date:  2022-03-09       Impact factor: 4.700

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.