| Literature DB >> 22759656 |
Pier Luigi Martelli1, Piero Fariselli, Eva Balzani, Rita Casadio.
Abstract
BACKGROUND: Various computational methods are presently available to classify whether a protein variation is disease-associated or not. However data derived from recent technological advancements make it feasible to extend the annotation of disease-associated variations in order to include specific phenotypes. Here we tackle the problem of distinguishing between genetic variations associated to cancer and variations associated to other genetic diseases.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22759656 PMCID: PMC3372458 DOI: 10.1186/1471-2164-13-S4-S8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Dataset of variations adopted for training/testing the method
| Disease type | Number of proteins | Number of variations: | |
|---|---|---|---|
| Cancer | Other diseases | ||
| Cancer | 77 | 689 | - |
| Other diseases | 495 | - | 5026 |
| Cancer and other diseases | 20 | 358 | 405 |
| Total | 592 | 1047 | 5431 |
Prediction of the disease type by protein similarity
| % Id | % prot | Sp(C) | Sp (N) | Sn (C) | Sn (N) | MCC | Q2 |
|---|---|---|---|---|---|---|---|
| ≤30 | 10 | 0.13 | 0.94 | 0.25 | 0.87 | 0.09 | 0.83 |
| ≤40 | 75 | 0.26 | 0.85 | 0.29 | 0.83 | 0.11 | 0.74 |
| ≤50 | 95 | 0.28 | 0.87 | 0.41 | 0.80 | 0.18 | 0.73 |
| ≤60 | 99 | 0.34 | 0.88 | 0.47 | 0.82 | 0.26 | 0.76 |
| ≤70 | 100 | 0.35 | 0.89 | 0.46 | 0.83 | 0.26 | 0.77 |
| ≤80 | 100 | 0.36 | 0.89 | 0.48 | 0.83 | 0.29 | 0.78 |
| ≤90 | 100 | 0.36 | 0.89 | 0.48 | 0.83 | 0.29 | 0.78 |
%prot= percentage of proteins that can be annotated with a given similarity threshold cut-off. %Id= Threshold cut-off of the sequence identity of the best hit retrieved upon a BLAST search in our dataset. For a definition of classes and scoring indexes see section: Measuring the performance.
Prediction of the disease type by protein function
| GO sub-ontology | % var | Q | MCC | Sp(C) | Sn(C) | Sp(O) | Sn(O) |
|---|---|---|---|---|---|---|---|
| C | 89 | 0.76 | 0.3 | 0.58 | 0.34 | 0.79 | 0.91 |
| F | 98 | 0.77 | 0.45 | 0.83 | 0.39 | 0.75 | 0.96 |
| P | 96 | 0.89 | 0.63 | 0.79 | 0.62 | 0.9 | 0.96 |
| CFP | 100 | 0.75 | 0.52 | 0.40 | 0.97 | 0.99 | 0.71 |
GO sub ontology: C=cellular component, F=molecular function, P=biological process. % var= percentage of predicted variations. Predicted classes: C= Cancer; O= Other genetic diseases. For a definition of classes and scoring indexes see section: Measuring the performance.
Most discriminative GO annotations
| GO-term | Description |
|---|---|
| GO:0032301 | MutSalpha complex |
| GO:0032300 | Mismatch repair complex |
| GO:0032302 | MutSbeta complex |
| GO:0005773 | Vacuole |
| GO:0005764 | Lysosome |
| GO:0000323 | Lytic vacuole |
| GO:0030877 | Beta-catenin destruction complex |
| GO:0016328 | Lateral plasma membrane |
| GO:0034747 | Axin-APC-beta-catenin-GSK3B complex |
| GO:0030983 | Mismatched DNA binding |
| GO:0032137 | Guanine/thymine mispair binding |
| GO:0032134 | Mispaired DNA binding |
| GO:0030291 | Protein serine/threonine kinase inhibitor activity |
| GO:0016538 | Cyclin-dependent protein kinase regulator activity |
| GO:0004861 | Cyclin-dependent protein kinase inhibitor activity |
| GO:0019887 | Protein kinase regulator activity |
| GO:0019207 | Kinase regulator activity |
| GO:0005099 | Ras GTPase activator activity |
| GO:0006298 | Mismatch repair |
| GO:0044271 | Cellular nitrogen compound biosynthetic process |
| GO:0006301 | Postreplication repair |
| GO:0046395 | Carboxylic acid catabolic process |
| GO:0016054 | Organic acid catabolic process |
| GO:0009310 | Amine catabolic process |
| GO:0070507 | Regulation of microtubule cytoskeleton organization |
| GO:0032886 | Regulation of microtubule-based process |
| GO:0009063 | Cellular amino acid catabolic process. |
Prediction of the disease type with a SVM-based method
| Encoding | Q | MCC | Sp(C) | Sn(C) | Sp(O) | Sn(O) |
|---|---|---|---|---|---|---|
| mut_E_W1 | 0.59 | 0.12 | 0.21 | 0.55 | 0.88 | 0.60 |
| mut_E_W5 | 0.64 | 0.17 | 0.24 | 0.55 | 0.88 | 0.66 |
| OnlyGO | 0.89 | 0.60 | 0.7 | 0.63 | 0.93 | 0.95 |
| mut_GO_E | 0.89 | 0.60 | 0.68 | 0.65 | 0.93 | 0.94 |
| mut_GO_E_W1 | 0.89 | 0.58 | 0.68 | 0.64 | 0.93 | 0.94 |
| mut_GO_E_W5 | 0.89 | 0.58 | 0.68 | 0.64 | 0.93 | 0.94 |
| mut_GO | 0.90 | 0.61 | 0.69 | 0.66 | 0.93 | 0.94 |
mut= is a 20 elements vector that encodes for the variation type; Wx= a input sequence window of dimension x centered into the variation; E= the evolutionary information on the variation obtained by extracting the 4 columns that represent the wild-type and the mutant residues as reported by PSI-BLAST PSSM/PROFILE output (-Q option); GO= the 3 GO scores. For a definition of classes and scoring indexes see section: Measuring the performance.
Cross-validation performance of a SVM-based predictor in cascade with SNPs&GO
| Method | Q | MCC | Sp(C) | Sn(C) | Sp(O) | Sn(O) |
|---|---|---|---|---|---|---|
| mut_GO+(SNP&GO) | 0.92 | 0.67 | 0.79 | 0.65 | 0.94 | 0.97 |
| mut_GO | 0.90 | 0.61 | 0.69 | 0.66 | 0.93 | 0.94 |
For a definition of classes and scoring indexes see section: Measuring the performance.