| Literature DB >> 30564131 |
Yitian Zhou1, Kohei Fujikura2, Souren Mkrtchian1, Volker M Lauschke1.
Abstract
Up to half of all patients do not respond to pharmacological treatment as intended. A substantial fraction of these inter-individual differences is due to heritable factors and a growing number of associations between genetic variations and drug response phenotypes have been identified. Importantly, the rapid progress in Next Generation Sequencing technologies in recent years unveiled the true complexity of the genetic landscape in pharmacogenes with tens of thousands of rare genetic variants. As each individual was found to harbor numerous such rare variants they are anticipated to be important contributors to the genetically encoded inter-individual variability in drug effects. The fundamental challenge however is their functional interpretation due to the sheer scale of the problem that renders systematic experimental characterization of these variants currently unfeasible. Here, we review concepts and important progress in the development of computational prediction methods that allow to evaluate the effect of amino acid sequence alterations in drug metabolizing enzymes and transporters. In addition, we discuss recent advances in the interpretation of functional effects of non-coding variants, such as variations in splice sites, regulatory regions and miRNA binding sites. We anticipate that these methodologies will provide a useful toolkit to facilitate the integration of the vast extent of rare genetic variability into drug response predictions in a precision medicine framework.Entities:
Keywords: ADME; NGS; noncoding variation; personalized medicine; pharmacogenomics; precision medicine; rare variant analysis; variant effect prediction
Year: 2018 PMID: 30564131 PMCID: PMC6288784 DOI: 10.3389/fphar.2018.01437
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Figure 1Overview of features that can be assessed by current computational prediction methods. Different parameters and features are assessed for genetic variants depending on whether they are localized in putatively regulatory sequences, untranslated regions (UTR) of the gene, its coding sequences (CDS) or within introns. ESE/ESS, exonic splicing enhancer/silencer; ISE/ISS, intronic splicing enhancer/silencer; NMD, nonsense-mediated decay; RBP, RNA binding protein.
Methods to predict the functional effect of missense variants based on sequence information.
| SIFT | Direct | Prediction of functionality based on sequence conservation metrics that make use of Dirichlet priors | Variants from protein specific studies (LacI, HIV-1 Protease and Bacteriophage T4 Lysozyme) | Ng and Henikoff, |
| PANTHER | HMM | Sequence conservation analysis using HMM | Variants from HGMD and dbSNP as deleterious and functionally neutral variants, respectively | Thomas et al., |
| MAPP | Direct | Quantification of the physicochemical characteristics at each position of the amino acid sequence based on observed evolutionary variation | Protein specific studies (LacI, HIV-1 Protease, HIV reverse transcriptase and Bacteriophage T4 Lysozyme) | Stone and Sidow, |
| PhastCons | HMM | Identification of conserved elements using a two-state phylogenetic HMM | Calibration on genomes from four model species (human, D. melanogaster, C. elegans, and S. cerevisiae) | Siepel et al., |
| SNPs3D | SVM | Variant effect prediction based on amino acid sequence conservation metrics and folded state stability of protein structure | Variants from HGMD and dbSNP as deleterious and functionally neutral variants, respectively | Yue et al., |
| PhD-SNP | SVM | Prediction of variant pathogenicity based on sequence profiles | Variants from HumVar and HumVarProf datasets | Capriotti et al., |
| SiPhy | HMM | Sequence conservation analysis using HMM | ENCODE Phase I regions | Garber et al., |
| LRT | Direct | Evolutionary conservation model across 32 vertebrates | Variants in three sequenced human genomes | Chun and Fay, |
| SNPs&GO | SVM | Variant effect prediction based on sequence information, evolutionary conservation and defined gene ontology score | Variants from SwissProt | Calabrese et al., |
| B-SIFT | Direct | Sequence conservation metrics that calculate the difference between wild-type and mutant allele | Variants from SwissProt database and protein specific study (Dnase I) | Lee et al., |
| PolyPhen-2 | NB | Considering sequence conservation, Structure parameters such as hydrophobic propensity and B factor | Variants fromn HumDiv and HumVar from UniProt Database | Adzhubei et al., |
| MutationTaster | NB | Prediction of mutation pathogenicity based on evolutionary conservation, splice-site changes, loss of protein features and changes that affect expression levels | Variants from OMIM database, HGMD and the literature as pathogenic set and neutral variants from dbSNP as controls | Schwarz et al., |
| MutationAssessor | Direct | Evolutionary conservation patterns within protein families and across species using combinatorial entropy | Variants from UniProt database (HumSaVar) | Reva et al., |
| Condel | Direct | Integration of five algorithms (SIFT, PolyPhen-2 MAPP, MutationAssessor, and Log R Pfam E-value) into single output score | Variants from HumVar, HumDiv, Cosmic database, IARC TP53 database | González-Pérez and López-Bigas, |
| PROVEAN | Direct | Alignment-based score that can also assess in-frame insertions, deletions, and multiple amino acid substitutions | Missense variants and indels, replacements from UniProt database | Choi et al., |
| FATHMM | HMM | Identification of pathogenic variants based on sequence conservation, protein domain-based information and species-specific pathogenicity weights. Also suitable for prediction of non-coding variations. | Variants from the HGMD and Uniprot databases | Shihab et al., |
| VEST | RF | Prioritization of variants underlying Mendelian diseases | Rare variants from HGMD database as pathogenic set and variants from ESP | Carter et al., |
| Evolutionary Action | Direct | Prediction of variant effects on evolutionary fitness using a formal genotype-phenotype perturbation equation | Variants from 1000 Genomes Project | Katsonis and Lichtarge, |
| MetaSVM | SVM | Ensemble score integrating nine functionality predictors (SIFT, PolyPhen-2, GERP++, MutationTaster, MutationAssessor, FATHMM, LRT, SiPhy and PhyloP) | Variants causing Mendelian diseases as pathogenic set and variants that are not associated with any phenotypes as controls, all from Uniprot database | Dong et al., |
| MetaLR | RM | Same as MetaSVM but using logistic regression instead of SVM. | Dong et al., | |
| SuSPect | SVM | Sequence conservation metrics, structure features and additional network information | Variants from Humsavar database | Yates et al., |
| PredictSNP | EL | Ensemble score integrating six functionality predictors (MAPP, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP) | Variants mainly from SwissProt, HGMD, dbSNP and Humsavar database | Bendl et al., |
| SNAP2 | NN | Prediction of amino acid variations based on amino acid properties, predicted binding residues, predicted disordered and low-complexity regions, proximity to N- and C-terminus, statistical contact potentials, co-evolving positions, secondary structure and solvent accessibility | Variants from PMD, Swiss-Prot, OMIM, HumVar and protein specific data sets (LacI) | Hecht et al., |
| REVEL | RF | Ensemble method tailored specifically for the prediction of rare genetic variant effects integrating MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons | Variants from HGMD as pathogenic set and neutral variants from ESP as controls | Ioannidis et al., |
| ConSurf | Empirical Bayesian method and maximum likelihood estimation | Mapping of evolutionarily conserved residues on protein surfaces by estimating the evolutionary rates of each nucleic acid and amino acid sequence position using multiple sequence alignments. Also offers RNA secondary structure predictions. | Protein with at least five known 3D structure homologs and precise annotation of their functional sites (with different nature) | Ashkenazy et al., |
| VIPUR | RM | Combination of sequence- and structure-based features to identify and functionally interpret deleterious variants | Variants from HumDiv and UniProt with clear evidence of protein disruption | Baugh et al., |
| Envision | GTB | Decision tree ensemble-based tool using a stochastic gradient boosting learning algorithm | Variants from nine large-scale experimental mutagenesis datasets in eight proteins | Gray et al., |
| EVmutation | Direct | Unsupervised method exploiting sequence conservation by incorporating interaction information between all pairs of residues in protein | 34 data sets from 21 proteins and a tRNA gene extracted from 27 publications | Hopf et al., |
| PredSAV | GTB | Identification of pathogenic variants based on sequence, structure, residue-contact networks as well as structural neighborhood features | Human variants from Uniprot and OMIM as pathogenic set and Ensemble variants as neutral controls | Pan et al., |
| SNPMuSiC | NN | Structure stability based, implement PoPMuSiC and HoTMuSiC on the basis of 13 statistical potentials (distence potentials, solvent accessibility potentials and torsion potentials) and 2 biophysical characteristics (solvent accessibility of mutated residue and difference in volume) | Variants from dbSNP, SwissVar and HumSaVar datasets | Ancien et al., |
| DEOGEN2 | RF | Integration of 11 scores and metrices into one meta-score, considering evolutionary features, folding predictions, domain information as well as gene features to identify deleterious variants | Training and test on variants from the UniProt Humsavar16 dataset | Raimondi et al., |
| ADME prediction framework | Direct | Integration of prediction scores from five orthogonal algorithms (LRT, MutationAssessor, PROVEAN, VEST3 and CADD) using parameters optimized for pharmacogenes | Training and validation specifically on experimentally characterized pharmacogenetic data sets from 43 ADME genes | Zhou et al., |
HMM, hidden Markov model; SVM, support vector machine; NB, naïve Bayes classifier; EL, ensemble learning; RF, random forest; RM, regression model; NN, neural networks; GTB, gradient tree boosting; HGMD, Human Gene Mutation Database; OMIM, Online Mendelian Inheritance in Man; ESP, Exome Sequencing Project; PMD, Protein Mutant Database.
Methods to predict the functional effect of missense variants based primarily on structural features.
| SDM | Direct | Predicts variant effects on thermal protein stability using conformationally constrained environment-specific substitution tables derived from 2,054 protein family sequence and structure alignments from the TOCCATA database | Validated on 2,690 SNVs from 132 different protein structures. | Topham et al., |
| I-Mutant | SVM | Protein structure or sequence-based prediction of point mutation effects on protein stability | Training and testing on thermodynamic experimental data of free energy changes of protein stability upon mutation from the ProTherm database | Capriotti et al., |
| HOPE | Direct | Analyzes the structural and functional effects of point mutations based on available crystal structures, homology modeling and sequence information. | Evaluated using case studies. | Venselaar et al., |
| mCSM | RM | Translation of distance patterns between atoms into graph-based signatures providing data that is complementary to potential energy based approaches | Prediction of protein stability changes, protein-protein and protein-nucleic acid interactions and pathogenicity based on an array of preexisting experimental data sets | Pires et al., |
| DUET | SVM | SVM predictor that integrates mCSM and SDM in a consensus prediction | Benchmarking again mCSM and SDM alone on p53 data set. | Pires et al., |
| STRUM | GTB | Predicts variant effects on protein stability based on 3D models constructed by iterative threading assembly refinement simulations | Evaluated on 3,421 experimentally determined mutations distributed across 150 proteins. | Quan et al., |
| ELASPIC | GTB | Predicts effects of mutations on protein folding and protein–protein interactions using homology modeling of domains and domain–domain interactions | Performance analysis via case study using EP300 mutations found in COSMIC | Witvliet et al., |
| SAAFEC | RM | Prediction of effects of amino acid changes on folding free energy using a Molecular Mechanics Poisson-Boltzmann approach | Training and testing on thermodynamic experimental data of free energy changes of protein stability upon mutation from the ProTherm database | Getov et al., |
SVM, support vector machine; RM, regression model; GTB, gradient tree boosting.
Tools for the prediction of variant effects on splicing, transcript levels or translation.
| NMD Classifier | NMD | Prediction of NMD for a given transcript based on comparison to most similar coding transcript | Simulation-based evaluation based on screening artificial transcript structure-altering events | Hsu et al., |
| NNSplice | Splicing (splice sites) | Sequence splice site analysis using HMM | Distinguish splice site sequences from sequences in the neighborhood of real splice sites | Reese et al., |
| MaxEntScan | Splicing (splice sites) | Splice site analysis by modeling short sequence motifs using the maximum entropy principle with constraints estimated from available data. | 1,821 transcripts unambiguously aligned across the entire coding region, spanning a total of 12,715 introns | Yeo and Burge, |
| GeneSplicer | Splicing (splice sites) | Splice site prediction using maximal dependence decomposition with the addition of markov model to capture dependencies among neighboring bases | Annotated genes from the Exon-Intron Database | Pertea et al., |
| SplicePort | Splicing (splice sites) | Splice site prediction using C-modified least squares learning based on positional and compositional sequence features | Training on 4,000 pre-mRNA human RefSeq sequences and test on B2Hum data set | Dogan et al., |
| Skippy | Splicing (regulatory sequences) | Prediction of variants causing exon skipping, exon inclusion or ectopic splice site activation based on sequence information, proximity to splice junctions and evolutionary constraint of the peri-variant region | Multiple exonic splicing regulatory elements datasets as positive data and HapMap variants as splicing-neutral variants | Woolfe et al., |
| MutPred Splice | Splicing (regulatory sequences) | Prediction of auxiliary splice sequences using multiple variant-, flanking exon- and gene-based features | Splicing variants from HGMD as pathogenic set and non-splicing variants from both HGMD and 1000G as neutral controls | Mort et al., |
| scSNVEL | Splicing (splice sites) | Ensemble prediction using 8 algorithms using random forest learning | Splice variants from HGMD, SpliceDisease database and DBASS as pathogenic set and variants not implicated in splicing from both HGMD and 1000G as controls | Jian et al., |
| SPANR | Splicing (splice sites and splice regulatory sequences) | Integrating 1,393 sequence features from each exon and its neighboring introns and exons to identify splice sites as well as intronic and exonic splice regulators | 10,689 exons that displayed evidence of alternative splicing | Xiong et al., |
| CryptSplice | Splicing (splice sites) | Prediction of cryptic splice-site activation using an SVM model | Sequences from the annotated NN269 and HS3D splice datasets with positive sequence in splice sites and control sequence outside splice sites | Lee et al., |
| Corvelo | Splicing (branch points) | Analysis of splice site sequence conservation and position bias using SVM | A set of 8,156 conserved putative branch point sequences from 7 mammalian species | Corvelo et al., |
| BPP | Splicing (branch points) | Identification of branch point motifs by integrating information on the branch point sequence and the polypyrimidine tract | Intron sequences longer than 300 nucleotides | Zhang et al., |
| TurboFold | Splicing (pre-mRNA structure) | Probabilistic method that integrates comparative sequence analyses with thermodynamic folding models | Thorough benchmarking against three methods that estimate base pairing probabilities and eight tools for structural predictions based on known RNA structures | Harmanci et al., |
| CentroidFold | Splicing (pre-mRNA structure) | RNA secondary structure prediction using the γ-centroid estimator | Validation based on 151 RNA experimentally determined RNA structures | Sato et al., |
| mrSNP | miRNA binding | miRNA binding energy calculations for reference and variant containing sequence and report of binding difference | Evaluation based on variants that map to miRNA targets predicted by TargetScan | Deveci et al., |
| PinPor | RBP binding | Bayesian network approach that incorporates information about sequence features, stabilization of RNA secondary structure and evolutionary conservation | Inframe indels from HGMD as pathogenic and common indels from 1000G as neutral controls | Zhang et al., |
HGMD, Human Gene Mutation Database; 1000G = 1000 Genomes Project; DBASS, Database for Aberrant Splice Sites; NMD, nonsense-mediated decay; HMM, hidden Markov model; RBP, RNA binding protein.
Algorithms for the functional interpretation of regulatory variants.
| FATHMM | HMM | Pathogenic variants | HGMD regulatory variants as pathogenic set and common 1000G variants as controls | Evolutionary conservation data (PhastCons and PhyloP), chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data, genome segmentation, frequency data (1000G and ESP) and information about genic and sequence context | Shihab et al., |
| GWAVA | RF | Pathogenic variants | HGMD regulatory variants as pathogenic set and common 1000G variants as controls | Evolutionary conservation data (GERP), chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data, genome segmentation, frequency data (1000G) and information about genic and sequence context | Ritchie et al., |
| CADD | SVM | Deleterious variants | Sites with MAF < 5% where for which the human genome differed from the inferred human-chimp ancestral genome and equal number of simulated variants | Evolutionary conservation data (GERP++, PhastCons and PhyloP), chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data, genome segmentation, frequency data (1000G and ESP) and information about genic and sequence context | Kircher et al., |
| DANN | NN | Deleterious variants | Same as CADD but using deep neural networks instead of linear SVM. | Quang et al., | |
| DeepSEA | NN | Variants that affect gene expression | HGMD regulatory variants, eQTLs and NHGRI GWAS phenotype-associated SNPs | Evolutionary conservation data (GERP++, PhastCons and PhyloP), chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data | Zhou and Troyanskaya, |
| gkm-SVM | SVM | Variants that affect gene expression | Tissue-specific enhancer sequences marked by H3K4me1 from length-, GC content- and repeat-matched random control | Definition of tissue-specific regulatory dictionary based on chromatin accessibility (DNase-HSS) and H3K4me1 ChIP-Seq data | Lee et al., |
| fitCons | INSIGHT | Prediction of | Unsupervised classifier that clusters genomic regions on the basis of functional genomic data and then estimates a probability of fitness consequences for each group from associated patterns of genetic polymorphism and divergence. | Evolutionary conservation data (GERP, PhastCons and PhyloP), chromatin accessibility (DNase-HSS), TF binding and histone modification ChIP-Seq data, genome segmentation and RNA-Seq data | Gulko et al., |
| GenoCanyon | US | Identification of functional regions | Unsupervised classifier based on the estimated proportion of functional regions in the human genome. | Evolutionary conservation data (GERP and PhyloP), chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data | Lu et al., |
| DIVAN | EL | Disease-specific risk variants | Disease-specific regulatory NHGRI GWAS SNPs and common 1000G variants or benign GWAS SNPs as controls | Chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data | Chen et al., |
| Genomiser | RF | Mendelian disease | Sites with MAF < 5% where for which the human genome differed from the inferred human-chimp ancestral genome as functionally neutral variation and 453 positive variants based on literature review | Evolutionary conservation data (GERP++, PhastCons and PhyloP), chromatin accessibility (DNase-HSS), TF binding and histone modification ChIP-Seq data, frequency data (1000G and ESP) and information about enhancer context from FANTOM5 | Smedley et al., |
| Eigen | US | Effect of variants on gene expression and disease risk | Unsupervised classifier based on the blockwise conditional independence between annotations given the functional impact of the variant. | Evolutionary conservation data (GERP, PhastCons and PhyloP), chromatin accessibility (DNase-HSS and FAIRE-Seq), TF binding and histone modification ChIP-Seq data and frequency data (1000G) | Ionita-Laza et al., |
RF, random forest; SVM, support vector machine; HMM, hidden Markov model; EL, ensemble learning; NN, neural networks; INSIGHT, Inference of Natural Selection from Interspersed Genomically Coherent Elements Gronau et al., .
Figure 2The past, present and future of pharmacogenetic phenotype predictions. (A) Conventionally, pharmacogenetic predictions were based on the interrogation of few common candidate SNPs, whose functional effects were predicted based on extensive literature evidence, resulting in high predictive accuracy but only few considered variations. (B) With increasing prevalence of whole exome sequencing (WES), a multitude of pharmacogenetic variants with unknown functional relevance are identified. These variants can be interpreted using computational methods. However, current algorithms are generally trained to detect the pathogenicity rather than the functionality of queried variants, resulting in overall relatively low predictive accuracy. Furthermore, only effects of missense and nonsense variants are evaluated. (C) In the near future, whole genome sequencing (WGS) will become the predominant genotyping methodology, revealing not only coding variants but also variants in regulatory regions and introns. To facilitate interpretation of this data, we envision that pharmacogenetic predictors will be directly trained on functionally annotated ADME data sets. Emerging technologies, such as deep mutational scanning for the systematic interrogation of missense variants or mutagenesis screens in microphysiological systems (MPS) for the characterization of variants in regulatory regions, provide powerful tools to generate these data, boosting the predictive performance of data hungry machine learning tools. These advances allow to go beyond the interpretation of missense and nonsense variants and to include also non-coding and regulatory variations into pharmacogenetic assessments.