| Literature DB >> 28882004 |
Kymberleigh A Pagel1, Vikas Pejaver1, Guan Ning Lin2, Hyun-Jun Nam2, Matthew Mort3, David N Cooper3, Jonathan Sebat2,4, Lilia M Iakoucheva2, Sean D Mooney5, Predrag Radivojac1.
Abstract
MOTIVATION: Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28882004 PMCID: PMC5870554 DOI: 10.1093/bioinformatics/btx272
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Number of variants (proteins) present in each data set
| Disease | Neutral | Total | |
|---|---|---|---|
| Frameshift | 18 116 (1545) | 90 135 (13 427) | 108 251 (13 713) |
| Stop gain | 14 318 (1681) | 7960 (4990) | 22 278 (6137) |
| Total | 32 434 (1995) | 98 095 (13 605) |
The set of canonical sequences was derived from UniProt (Suzek et al., 2007). The number of available stop-loss variants was too small to be included in this work.
Fig. 1Illustration of the impacted portions of the protein for loss-of-function variants. The impacted region can be shorter or longer for the mutant protein (if translated); its length is zero for the stop gain variants
Predicted structural and functional features
| Property category | Predicted features |
|---|---|
| Structure and dynamics | Three classes |
| Signal peptide and transmembrane regions | Seven classes—N- and C-termini of signal peptide, signal helix, signal peptide cleavage site, transmembrane segment, cytoplasmic and non-cytoplasmic loops |
| Enzyme activity | Catalytic residues |
| Regulation | Allosteric residues |
| Macromolecular binding | DNA |
| Metal-binding | Cd; Ca; Co; Cu; Fe; Mg; Mn; Ni; K; Na; Zn |
| Post-translational modification (PTM) ( | Acetylation, ADP-ribosylation, Amidation, Carboxylation, Disulfide linkage, Farnesylation, Geranylgeranylation, Glycosylation (C-linked, N-linked and O-linked), GPI anchor amidation, Hydroxylation, Methylation, Myristoylation, N-terminal acetylation, Palmitoylation, Phosphorylation, Proteolytic cleavage, Pyrrolidone carboxylic acid, Sulfation, SUMOylation, Ubiquitylation |
| Motifs | From PROSITE ( |
Indicates in-house predictors.
Fig. 2Approximate probability density functions of an average conservation measure of the wildtype protein (A) at the amino side of the variant and (B) at the carboxyl side of the variant for pathogenic (blue) and neutral (orange) variants. To ensure clarity, we omit proteins that contain both pathogenic and neutral variants from this figure. Proteins harboring disease variants are generally more conserved at both sides of the variant
Fig. 3Homology profiles for the types of loss-of-function variants. Each plot shows the average number of sequences affected by pathogenic and putatively neutral variants, within a particular range of global sequence identity against human and mouse genomes
Fig. 4Receiver operating characteristic (ROC) curves and Areas Under the ROC Curves (AUC). (A) Cross-validation performance of MutPred-LOF with per-variant, per-protein, and per-cluster cross-validation; (B) Cross-validation performance of MutPred-LOF for frameshifting and stop gain variants separately; (C) The performance for other methods based upon the testing set. Black curves represent the performance of MutPred-LOFʹ. The dotted line represents performance of each model on the subset of variants from bi-class proteins; (D) Proportion of high-scoring de novo variants implicated in neurodevelopmental disorders in the case and control datasets based upon 5 and 10% false positive rate thresholds. The P-value derived from Fishers exact test is shown above
Per-feature evaluation: top ten performing feature sets
| Feature set | Full model |
|---|---|
| Predicted GO Terms | 0.729 |
| Maximum conservation | 0.707 |
| Metal binding | 0.660 |
| Structure and dynamics | 0.652 |
| Enzyme activity | 0.645 |
| Regulation | 0.641 |
| Macromolecular binding | 0.633 |
| Homology counts | 0.614 |
| Post-translational modification | 0.611 |
| Signal peptide and transmembrane | 0.610 |
For each set of features we train ensembles of neural networks with the same parameters in all models. The performance (AUC) of a model trained on a feature set is used to estimate the performance of each feature separately.