| Literature DB >> 23589683 |
Hoan Nguyen1, Tien-Dao Luu, Olivier Poch, Julie D Thompson.
Abstract
Understanding the effects of genetic variation on the phenotype of an individual is a major goal of biomedical research, especially for the development of diagnostics and effective therapeutic solutions. In this work, we describe the use of a recent knowledge discovery from database (KDD) approach using inductive logic programming (ILP) to automatically extract knowledge about human monogenic diseases. We extracted background knowledge from MSV3d, a database of all human missense variants mapped to 3D protein structure. In this study, we identified 8,117 mutations in 805 proteins with known three-dimensional structures that were known to be involved in human monogenic disease. Our results help to improve our understanding of the relationships between structural, functional or evolutionary features and deleterious mutations. Our inferred rules can also be applied to predict the impact of any single amino acid replacement on the function of a protein. The interpretable rules are available at http://decrypthon.igbmc.fr/kd4v/.Entities:
Keywords: SNP prediction; genotype-phenotype relation; human monogenic disease; inductive logic programming
Year: 2013 PMID: 23589683 PMCID: PMC3615990 DOI: 10.4137/BBI.S11184
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1Main steps for an ILP application include: (i) mutation selection from MSV3d, (ii) definition of negative/positive examples in the training set, (iii) background knowledge creation, (iv) selection of the ILP system, (v) selection of the ILP parameters (number of nodes, noisy..) and optimization of the predicates in the background knowledge, (vi) model evaluation using K-fold cross validation, and (vii) the final rules used for interpretation.
Figure 2Definition of neighbouring residues.
Notes: For the mutated residue, Asn180 of protein Q13496, a sphere of radius 10 A° is drawn with the residue in the centre. Any residues that lie within the sphere are defined as neighbours.
Figure 3Mutation data model.
Notes: Each missense mutation is characterised by physico-chemical features (size, charge, polarity, hydrophobicity, etc), evolutionary information and 3D structural features. In addition, it may have one or more than one neighbouring residues, each of which can belong to a single class, based on Koolman’s classification.
Figure 4Construction of background knowledge from MSV3d.
Notes: Each mutation in the database is identified by a unique identifier ‘id’ and the values of each. Modeh defines the head of a hypothesised clause, while Modeb declares the predicates that can occur in the body of a hypothesised clause. The asterisk * in the mode declarations indicates that the corresponding predicate can be called many times during the construction of a hypothesised clause.
Predicates used as background knowledge.
| Physico-chemical changes induced by the substitution | modif_size(+mutationid, #value) | Size, charge, polarity and hydrophobicity modifications |
| g_p(+mutationid, #gp) | Glycine or proline loss or apparition | |
| Evolutionary features | conservation_wt(+mutationid, −conservationwt) | Percentage of the wild type residue in the alignment column |
| conservation_mut(+mutationid, −conservationmut) | Percentage of the mutant residue in the alignment column | |
| freq_at_pos(+mutationid, −freqatpos) | Number of known mutations at this position | |
| cluster_5res_size(+mutationid, −cluster5ressize) | Number of mutations at a distance of less than 5 residues in the sequence | |
| Structural features | secondary_struc(+mutationid, #secondary_struc) | Secondary structure element (helix, sheet, no) |
| gain_contact(+mutationid, −gaincontact) | Contacts between
– the wild type residue and its direct 3D neighbours, based on the wild type 3D model the mutant residue and its direct 3D neighbours, based on the mutant 3D model | |
| gain_n1_contact(+mutationid, −gainn1contact) | Contacts between
– residues in contact with the wild type residue and their direct 3D neighbours, based on the wild type 3D model – residues in contact with the mutant residue and their direct 3D neighbours, based on the mutant 3D model | |
| wt_accessibility(+mutationid, −wtacc) | Accessibility of the wild type/mutant residue | |
| cluster3d_10(+mutationid, −cluster3d10) | Number of mutations in the 3D cluster at 10, 20 and 30 A° | |
| stability_decrease(+mutationid) | The change in protein relative stability upon mutation |
Results of 3 fold cross-validation for comparison between different values of the noise parameter.
| Noise = 0.5% | 87.97 | 50.89 | 76.25 |
| Noise = 0.75% | 87.79 | 50.89 | 76.10 |
| Noise = 1% | 89.34 | 46.97 | 75.65 |
| Noise = 2% | 90.52 | 45.26 | 75.72 |
| Noise = 3% | 92.08 | 42.08 | 75.47 |
| Noise = 4% | 91.64 | 43.67 | 75.83 |
Note: Gmean = geometric mean of accuracies.
Figure 5Part of a screenshot with four induced rules obtained using Aleph with noise = 0.5%, minpos = 5, nodes + 50,000.
Notes: Users can click on the + icon to see the covered examples. The keyword “sub_family_conservation” was used as a filter in this screenshot.
The top five predicates found in the rules defining deleterious or neutral mutations.
| secondary_struc | 12.7% |
| conservation_class | 11.0% |
| modif_charge | 7.6% |
| cluster_5res_size | 6.3% |
| conservation_wt | 6.1% |