| Literature DB >> 36246661 |
Ye Liu1, William S B Yeung1,2, Philip C N Chiu1,2, Dandan Cao1.
Abstract
One objective of human genetics is to unveil the variants that contribute to human diseases. With the rapid development and wide use of next-generation sequencing (NGS), massive genomic sequence data have been created, making personal genetic information available. Conventional experimental evidence is critical in establishing the relationship between sequence variants and phenotype but with low efficiency. Due to the lack of comprehensive databases and resources which present clinical and experimental evidence on genotype-phenotype relationship, as well as accumulating variants found from NGS, different computational tools that can predict the impact of the variants on phenotype have been greatly developed to bridge the gap. In this review, we present a brief introduction and discussion about the computational approaches for variant impact prediction. Following an innovative manner, we mainly focus on approaches for non-synonymous variants (nsSNVs) impact prediction and categorize them into six classes. Their underlying rationale and constraints, together with the concerns and remedies raised from comparative studies are discussed. We also present how the predictive approaches employed in different research. Although diverse constraints exist, the computational predictive approaches are indispensable in exploring genotype-phenotype relationship.Entities:
Keywords: genotype-phenotype relationship; human genetics; in silico prediction; nonsynonymous variants; variant impact
Year: 2022 PMID: 36246661 PMCID: PMC9559863 DOI: 10.3389/fgene.2022.981005
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
Summary of resources for human genotypes and phenotypes relationships.
| Type of data | Name | Full name | Techniques | Type of variants | Targeted diseases | Website | Containing entries (until writtern in June 2022) | Composition | First publication year | Last update (until writtern in June 2022) | Accessible | Publications |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Protein data | Uniprot | Universal protein resource | Curated | — | General |
| 567,483 entries in Swiss-Prot and 231,354,261 entries in TrEMBL | UniProt Knowledgebase, UniProt Reference Clusters, and UniProt Archive | 1997 | 2 February 2021 | Free |
|
| Protein information | UniProtKB | Uniprot Knowledgebase | Curated | — | General |
| — | Swiss-Prot and TrEMBL | — | 22 November 2021 | Free |
|
| Protein sequences | UniRef | Uniprot Reference Clusters | Curated | — | General |
| — | UniRef100, 90, 50 | — | 29 November 2021 | Free |
|
| Protein sequences | UniParc | Uniprot Archive | Curated | — | General |
| — | — | — | 24 March 2022 | Free |
|
| Protein, DNA and RNA structural data | PDB | Protein data bank | Structural data from X-ray, NMR, electron microscopy | — | General |
| 191,565 Biological Macromolecular Structures | — | 1971 | 14 June 2022 | Free |
|
| Protein data with themodynamic parameters | ProThermDB | Thermodynamic Database for Proteins and Mutants | Curated | — | General |
| ∼0.12 million thermodynamic data obtained for different organisms and cell lines, >32,000 entries, ∼20,000 mutations | — | 1999 | 22 September 2021 | Free |
|
| Protein data | ONGene | — | Curated | — | Cancer |
| 803 oncogenes | — | 2016 | — | Free |
|
| Protein data | TSGene2.0 | Tumor suppressor gene database | Curated | — | Cancer |
| 1217 human tumor suppressor genes | — | 2012 | 4 January 2016 | Free |
|
| Population data | 1000 Genome Project | — | WGS | SNVs, indels | General |
| Genotypes for 2,504 healthy donor samples from 26 populations | — | 2008 | 1 October 2015 | Free |
|
| Population data | GnomAD (previously ExAC) | Genome aggregation database | WGS, WES | SNVs, indels | General |
| 76,156 genomes data of diverse ancestries in v3.1 and 141,456 individuals exomes or genomes data in v2 | — | 2014 | 21 January 2022 | Free |
|
| Population data | ESP | The NHLBI exome sequencing project | WES | SNVs, indels | Disease-, phenotype-related |
| 6,503 unrelated individual exom data | — | 2011 | 23 April 2019 | Free |
|
| Population data | UK Biobank | — | — | — | Disease-, phenotype-related |
| 49,960 exome data | — | 2006 | 19 March 2019 | Registration fee needed | — |
| Population data | UK10K | — | WGS, WES | — | Healthy and disease-related cohorts |
| Nearly 10,000 individuals in UK population | Whole genome, Neurodevelopment, Obesity, Rare Diseases Sample Sets | 2010 | 1 October 2015 | Access control |
|
| Phenotype and genotype data | OMIM | Online Mendelian Inheritance in Man | Classification | — | Disease-, phenotype-, gene-related |
| 26,446 entries, including all known mendelian disorders and over 16,000 genes | — | 1960 | 27 May 2022 | Free |
|
| Phenotype and genotype data | Orphanet | The portal for rare diseases and orphan drugs | Classification | — | Disease-, phenotype-related |
| 6,172 disease, 5835 genes | — | 1997 | 31 May 2022 | Free | — |
| Ontology | HPO | Human phenotype ontology | Classification | — | Disease-, phenotype-, gene-related |
| >13,000 terms, > 156,000 annotations | — | 2008 | 14 April 2022 | Free |
|
| Ontology | GO | Gene ontology | Classification | — | Gene-specific |
| 7,510,543 annotations | Molecular Function, Cellular Component, and Biological Process | 2000 | 16 May 2022 | Free | ( |
| Ontology | Mammalian Phenotype Ontology | — | Classification | — | Phenotype-related |
| 14,716 classes | — | 2005 | 14 June 2022 | Free |
|
| Genomic data | HGMD | Human gene mutation database | Curated | SNVs, indels | Disease-, phenotype-related |
| 352,731 mutation entries | 352,731 mutation entries | 1996 | 31 May 2022 | Registration needed |
|
| Genomic data | VariBench | A benchmark database for variations | Curated | SNVs, indels | — |
| — | VariBench datasets include disease-causing missense variations, neutral high frequency SNPs, protein stability affecting missense variations, variations affecting transcription factor binding sites, variations affecting splice sites | 2012 | — | Free |
|
| Genomic data | VariSNP | — | Curated | SNVs, indels | — |
| 145,435,955 variants | Datasets selected from dbSNP which were filtered for disease-related variants found in ClinVar, Swiss-Prot and PhenCode | 2014 | 16 February 2017 | Free |
|
| Genomic data | dbSNP | Single nucleotide polymorphism database | Curated | SNVs, indels, retroposable element insertions and microsatellite repeat variations | General |
| 1,085,850,277 refSNP | — | 1999 | 26 May 2020 | Free |
|
| Genomic data | ClinVar | — | Curated | SNVs, indels | Disease-, phenotype-, gene-related |
| 1,540,318 unique variation records | — | 2013 | 5 May 2022 | Free |
|
| Genomic data | ClinGen | — | Curated | SNVs, indels | Disease-, phenotype-related |
| Unique 3692 variants in unique 2278 genes | — | 2013 | 1 April 2022 | Free |
|
| Genomic data | DoCM | Database of Curated Mutations | Curated | SNVs, indels | Cancer |
| 1,364 variants among 122 disease type | — | 2014 | — | Free |
|
| Genomic data | VKGL | Vereniging klinisch genetische | Curated | SNVs, indels | Disease-, phenotype-related |
| 188,502 variants | — | 2018 | December 2021 | Free |
|
| Genomic data | CIViC | Clinical interpretation of variants in cancer | Curated | SNVs, indels, SVs | Cancer |
| 3165 variants, 470 genes with clinical interpretation | — | 2015 | 1 May 2022 | Free |
|
| Genomic data | COSMIC | Catalogue of somatic mutations in cancer | Curated | SNVs, indels | Cancer |
| 29,399,170 variants, 1,207,190 CNVs, 19,422 fusions | — | 2004 | 31 May 2022 | Free |
|
| Genomic data | LOVD3.0 | Leiden open variation database 3.0 | Curated | SNVs, indels | Disease-, phenotype-related |
| 800,780 variants | — | 2002 | 17 August 2021 | Free |
|
| Genomic data | InSight | The International Society for Gastrointestinal Hereditary Tumours | Curated | SNVs, indels | Gene-specific |
| 35,644 variant entries from 9 genes related to gastrointestinal tumours | Variants are automatically sourced from LOVD3 | 2005 | — | Free |
|
| Genomic data | HuVarBase | Human variants database | Curated | Missense, nonsense, insertion, deletion | Disease-, phenotype-related |
| 774,863 variants from 18,318 proteins, including 702,048 disease-causing and 72,815 neutral variants | Sources from 1000 Genomes, ClinVar, COSMIC, Humsavar, SwissVar, MutHTP, PROXiMATE | 2018 | 15 October 2018 | Free |
|
| Genomic data | DVD | Deafness variation database | Curated | SNVs, indels | Deafness-related |
| 223 genes | Sources from ClinVar, dbNSFP, gnomAD, VEP, CADD, dbSNP, Population Analysis and others | 2018 | 4 January 2021 | Free |
|
| Genomic data | METABRIC | Molecular Taxonomy of Breast Cancer International Consortium | Targeted NGS | SNVs, indels | Breast cancer | Mutation details can be retrived from | Mutation data in 173 genes from 2433 primary breast tumor samples and 650 normal controls | Genomic mutation data, copy number aberration (CNA), gene expression and long-term clinical follow-up data | 2012 | — | Free | ( |
| Genomic data | TCGA-BRCA | — | WES | SNVs, indels | Breast cancer |
| Mutation data from WES of 817 Breast Invasive Carcinoma tumor/normal pairs | Genomic mutation data, copy number aberration (CNA), gene expression and long-term clinical follow-up data | 2012 | 8 October 2015 | Free |
|
| Genomic data |
| — | Saturation genome editing assays | SNVs |
|
| 3,893 SNVs located within or near 13 exons that encode for the RING and BRCT domains of BRCA1 (exons 2–5 and 15–23, respectively) | — | 2018 | — | Free |
|
| Genomic data | VarCards | — | Curated | SNVs, indels | General |
| 110,154,363 SNVs, and 1,223,370 indels in coding regions or splicing sites | Variant-level and gene-level resources | 2016 | 28 June 2020 | Free |
|
FIGURE 1Summarized workflow of variants impact predictors. Protein structure and protein features of BRCA1 BRCT mutant M1775K are retrived from studies. (Birrane, 2006: Tischkowitz et al., 2008). The minor allele frequency (MAF) information of variant rs41293463 (chr17-43051071-A-C(GRCh38)) was retrived from gnomAD (Genome Aggregation Database).
Representative diseases-, phenotypes-, genes-specific variants impact predictors.
| Characteristic category | Name | Type of variants | Targeted disease/phenotype/gene | # of genes | Website | Distribution (web-server/stand-alone) | First publication | Programming language | Algorithm/model | Features | Dataset for modeling | Classification index | Classification | Additional data | Publication |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Meta-predictor | VIPPID (Variant Impact Predictor for PIDs) | Missense | Primary immunodeficiency (PID) diseases | 146 |
| Web and stand-alone | April 2022 | Perl, R | Conditional Inference Forest | 85 features including AA, exonic, protein structural, conservation, and 20 pre-existing prediction tools | 4,865 disease-associated variants from Asian Primary Immunodeficiency Diseases (RAPID) database, HGMD and ClinVar; 4,237 neutral variants from gnomAD | Classifier | Pathogenic/non-pathogenic | 26 reviewed P/LP variants of known PID pathogenic genes from 1318 patients cohort and 39 validated in-house variants |
|
| Meta-predictor | CanPredict | Missense | Cancer | — |
| — | May 2007 | R | RF | SIFT, Pfam-based LogR.E-value and GO Similarity Score (GOSS) metrics | — | Classifier | Likely cancer/likely non-cancer/not determined | — |
|
| Meta-predictor | PolyPhen-HCM | Missense | Hypertrophic cardiomyopathy | 6 |
| Pre-computed results | February 2011 | — | Naïve bayes classifier | Prediction scores, protein structure comparison score | 74 curated variants from literitures and manually classified by Laboratory for Molecular Medicine standard variant-assessment pipeline (41 pathogenic, 26 benign) | Classifier | Pathogenic/benign/no call | — |
|
| Meta-predictor | Cadioboost | Missense | Cardiomyopathies and arrhythmias | 22 |
| Pre-computed results | October. 2020 | R | 2 Adaptive Boosting (Adaboost) classifiers | 76 functional features | CM datasets: 356 rare P/LP variants from 9,007 clinical CM patients, 302 rare missense variants in CM genes from 2,090 healthy controls. Inherited arrhythmia dataset: 252 P/LP in arrhythmia-associated genes from ClinVar, 237 rare missense variants in arrhythmia genes from 2,090 healthy controls | Pathogenicity score | Disease-causing/VUS/Benign | 4 datasets from ClinVar, HGMD, Oxford Medical Genetics Laboratory (OMGL), a large registry of HCM patients, SHaRe |
|
| Multiple features | GENESIS (GENe-specific EnSemble grId Search) | Variants of uncertain clinical significance | Catecholaminergic polymorphic ventricular tachycardia and long QT syndrome (LQTS) | 4 |
| Stand-alone and pre-computed results | March 2022 | Python | Logistic regression and multilayer perceptron model | 8 kinds of features including AA features, domain, conservation, rate of evolution, signal-to-noise ratio, and a position-specific scoring matrix (PSSM) score | 717 pathogenic variants and 3,164 benign variants curated from literiture | Probabilities of pathogenicity | Pathogenic/VUS/benign | 925 VUS classified according to ACMG |
|
| Multiple features | CACNA1F-vp | Missense | X-linked incomplete Congenital Stationary Night Blindness (iCSNB) | 1 |
| Stand-alone | April 2020 | Python | Logistic regression model | Variant-level features and structural features | 72 disease-implicated from HGMD or MGDL database, 322 benign variants from gnomAD | Probabilities of pathogenicity | Pathogenic/benign | - |
|
| Optimized PON-P2 | PON-MMR2 | AA substitution | Mismatch repair (MMR) | 4 |
| Web and stand-alone | September 2015 | R | RF | 5 features: sequence conservation, physical and biochemical properties of AA | 109 pathogenic, 99 neutral, 354 VUS from InSiGHT database and VariBench | Probabilities of pathogenicity | Pathogenic/VUS/benign | 354 VUS dataset |
|
| Optimized MAPP | CoDP (Combination of Different Properties of MSH6 protein) | Missense | Lynch syndrome (LS) | 1 |
| Web | April 2013 | — | Logistic regression model | MSA, phylogenetic tree, structral properties, MAPP, SIFT, PolyPhen2 | 294 missense variants from InSiGHT, MMRUV, UniProt, dbSNP, ESP, HapMap Project, 1KGP and literature | Probabilities of pathogenicity | Likely LS/Unlikely LS | 260 unclassified variants dataset |
|
| Meta-predictor with MAF as features | DvPred | nsSNVs | Genetic hearing loss (HL) | 157 |
| Stand-alone and pre-computed results | February 2022 | Python | Gradient boosting decision tree (GBDT) | 65 features include conservation scores, prediction scores, MAF, gene intolerance scores and other features | 1,318 P/LP and 4,628 B/LB from China Deafness Genetics Consortium (CDGC), Deafness Variation Database (DVD), ClinVar, HGMD | DvPred score | Deleterious/neutral | 463 pathogenic and 454 benign variants from new version of CDGC and ClinVar |
|
| Meta-predictor | NBDriver | Missense | Cancer | 58 |
| Stand-alone | May 2021 | Python | RF, extra tress (ET) classifier, generative KDE classifier | 3 types of features: one-hot encoding, overlapping k-mers, 27 genomic features | 5,265 disease-associated variants from five literatures | Classifier | — | — |
|
| Combination of rule-based and meta-predictor | CancerVar | Exon variants, CNVs, indels | Cancer | 1911 |
| Web, stand-alone and pre-computed results | May 2022 | Python | Semi-supervised generative adversarial network used in scoring method OPAI | 12 clinical evidence prediction scores and 23 precomputed scores by other computational tools | 13 million variants from 7 cancer knowledgebases | OPAI score | Oncogenic/benign | 4 datasets from OncoKB and CIViC, IARC and literatures |
|
*VUS, variant of uncertain significance.
Representative prioritization frameworks and tools.
| Characteristic category | Name | Type of Targeted variants* | Website | Distribution (web-server/stand-alone) | First publication | Last update | Programming language | Algorithm/modules | Input type | Dataset for modeling | Publications |
|---|---|---|---|---|---|---|---|---|---|---|---|
| User-defined rule-based | VCF.Filter | SNVs, indels |
| Web and stand-alone | July 2017 | — | Java | Filter cohort, prioritize on pedigree and search variant in cohort modules | VCF files, targeted regions, cohort allele frequencies, pedigree information | — |
|
| User-defined rule-based | BiERapp | SNVs, indels, CNVs, MNVs, SVs |
| Web and stand-alone | April 2014 | — | HTML5 and JS | CellBase annotation, consecutive filtering strategy | Multi-sample VCF files | — |
|
| User-defined rule-based | KGGSeq | SNVs, indel, CNVs |
| Stand-alone | January. 2012 | 1 January 2022 | Java | 5 major modules: quality control, filtration, annotation, pathogenic prediction and statistic tests | VCF files, pedigree information | 7,296 disease-causing variants from OMIM and 48,089 neutral variants |
|
| User-defined rule-based | VPOT (variant prioritization ordering tool) | SNVs, indel |
| Stand-alone | November. 2019 | 27 October 2021 | Python | 2 steps: prioritization of variants based on user-defined parameters, post-processing of variant priority ordered list | ANNOVAR annotated VCF or TXT files, inheritance model | — |
|
| ACMG guideline based | TAPES | SNVs, indel |
| Stand-alone | October. 2019 | — | Python | Bayesian classification framework | VCF files | — |
|
| ACMG guideline based | InterVar | SNVs, indel |
| Web, stand-alone and pre-computed results | February 2017 | 13 June 2022 | Python | Automated or manually scoring system. Manual review and adjustment on specific criteria | Annotated or unannotated VCF files | — |
|
| ACMG guideline realted | VarFish | SNVs, indels |
| Web and stand-alone | July 2020 | June 2022 | Python | Quality control, database- and user-based annotation, filtering interface, joint filtering of multiple cases | VCF files, optional pedigree information | - |
|
| Phenotype-driven | Exomiser | SNVs, indels |
| Stand-alone | November 2015 | November 2021 | Java | Filtering and Prioritization based on logistical regression model. Four prioritization method include PHIVE, PhenIX, ExomeWalker, hiPHIVE. | VCF files, HPO terms, optional pedigree information | — |
|
| Phenotype-driven | eXtasy | nsSNVs |
| Web and stand-alone | September 2013 | — | Ruby | RF | VCF files, HPO terms | 24,454 disease-causing nsSNV from HGMD associated with 1,142 HPO terms. Control datasets: common polymophisms and rare variants from 1KGP, rare variants in in-house control samples |
|
| Phenotype-driven | AMELIE (Automatic Mendelian Literature Evaluation) | Missense, stopgain, splicing, indels, duplication |
| Web and stand-alone | May 2020 | May 2021 | — | Natural language processing (NLP) and logistic regression classifier | VCF files, HPO terms | A set of 681 simulated patients using data from OMIM, ClinVar and 1KGP |
|
| Phenotype-driven | Phen-Gen | Missense, nonsense, splice site and indels |
| Stand-alone | September 2014 | — | Perl | Random walk–with–restart algorithm, Bayesian framework based on genotype and phenotype data | VCF files, HPO terms | HGMD 2011.4 datasets |
|
| Phenotype-driven | LIRICAL (LIkelihood Ratio Interpretation of Clinical AbnormaLities) | SNVs, indels |
| Stand-alone | September 2020 | September 2021 | Java | Likelihood-ratio | VCF files, HPO terms | — |
|
| Phenotype only | Phrank (phenotype ranking) | — |
| Stand-alone | February 2019 | — | Python | Boolean Bayesian network | HPO terms | Knowledgebase of gene-disease-phenotype relationships, HPO-A |
|
| Phenotype only | PhenoRank | — |
| Stand-alone | June 2018 | — | Python | Phenotypic similarity measured by simGIC, gene scores calculation by random walk with restart (RWR) method | HPO terms | 5,685 unique associations between 4,729 diseases and 3,713 genes from ClinVar, OMIM and UniProtKB |
|
| Phenotype only | Phen2Gene | — |
| Web and stand-alone | June 2020 | March 2021 | Python | Weighting by skewness | HPO terms | HPO–gene annotation files downloaded from the Jackson Laboratory for Genomic Medicine; gene-disease databases OMIM, ClinVar, Orphanet, GeneReviews; gene-gene relationship databases HPRD, HGNC, Biosystem, HTRI |
|