| Literature DB >> 21210975 |
Pedro R Costa1, Marcio L Acencio, Ney Lemke.
Abstract
BACKGROUND: The genome-wide identification of both morbid genes, i.e., those genes whose mutations cause hereditary human diseases, and druggable genes, i.e., genes coding for proteins whose modulation by small molecules elicits phenotypic effects, requires experimental approaches that are time-consuming and laborious. Thus, a computational approach which could accurately predict such genes on a genome-wide scale would be invaluable for accelerating the pace of discovery of causal relationships between genes and diseases as well as the determination of druggability of gene products.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21210975 PMCID: PMC3045802 DOI: 10.1186/1471-2164-11-S5-S9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Classifier performance measures for prediction of morbid and druggable genes
| Prediction of morbid genes | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Performance measure | Median [min,max] 1 | Median [min,max] 1 | |||||||||||
| Normal | Shuffled | ||||||||||||
| Precision | 0.658 [0.648,0.679] | 0.495 [0.473,0.522] | 10 | 0 | 8 * | ||||||||
| Recall | 0.648 [0.632,0.657] | 0.502 [0.471,0.521] | 10 | 0 | 8 * | ||||||||
| AUC | 0.716 [0.706,0.729] | 0.498 [0.462,0.526] | 10 | 0 | 8 * | ||||||||
| Prediction of druggable genes | |||||||||||||
| Performance measure | Median [min,max] 1 | Median [min,max] 1 | |||||||||||
| Normal | Shuffled | ||||||||||||
| Precision | 0.748 [0.72,0.763] | 0.5 [0.451,0.556] | 10 | 0 | 8 * | ||||||||
| Recall | 0.782 [0.732,0.809] | 0.492 [0.447,0.564] | 10 | 0 | 8 * | ||||||||
| AUC | 0.820 [0.801,0.835] | 0.500 [0.43,0.546] | 10 | 0 | 8 * | ||||||||
1 Of 10 datasets
2 According to table of critical values for W in [6]
* Difference statistically significant
Figure 1Frequency distribution of known morbid genes per intervals of morbidity scores Bars show the frequency distribution of known morbid genes (in percent) per 0.2 intervals of normal morbidity scores.The blue-shaded area represents the frequency distribution of known morbid genes (in percent) per intervals of shuffled morbidity scores.
Figure 2Frequency distribution of known druggable genes per intervals of druggability scores Bars show the frequency distribution of known druggable genes (in percent) per 0.2 intervals of normal druggability scores.The blue-shaded area represents the frequency distribution of known druggable genes (in percent) per intervals of shuffled druggability scores
Statistical comparison of performances of classifiers trained on normal and without-one-feature morbidity datasets
| Missing feature 1 | Median AUC [min,max]2 | N | ||||
|---|---|---|---|---|---|---|
| 0.715 [0.705,0.726] | 10 | 26 | 8 | |||
| 0.714 [0.707,0.727] | 10 | 26 | 8 | |||
| 0.713 [0.707,0.729] | 10 | 25 | 8 | |||
| 0.714 [0.703,0.726] | 9 | 18 | 6 | |||
| 0.716 [0.705,0.729] | 10 | 26 | 10 | |||
| 0.713 [0.701,0.724] | 10 | 13 | 8 | |||
| 0.711 [0.704,0.727] | 10 | 24 | 8 | |||
| 0.714 [0.707,0.727] | 10 | 25 | 8 | |||
| 0.716 [0.708,0.731] | 10 | 25 | 8 | |||
| 0.714 [0.707,0.727] | 9 | 21 | 6 | |||
| 0.714 [0.707,0.728] | 9 | 21 | 6 | |||
| 0.715 [0.706,0.727] | 10 | 25 | 8 | |||
| 0.709 [0.701,0.719] | 10 | 7 | 8* | |||
| 0.715 [0.704,0.727] | 10 | 27 | 8 | |||
| Unknown | 0.713 [0.701,0.725] | 10 | 18 | 8 | ||
| Cytoplasm | 0.715 [0.706,0.728] | 10 | 26 | 8 | ||
| Endoplasmic reticulum | 0.716 [0.705,0.727] | 10 | 26 | 8 | ||
| Mitochondrion | 0.714 [0.706,0.728] | 10 | 24 | 8 | ||
| Nucleus | 0.715 [0.704,0.728] | 10 | 24 | 8 | ||
| Other localization | 0.714 [0.704,0.726] | 10 | 21 | 8 | ||
| Cellular component | 0.714 [0.705,0.727] | 9 | 21 | 6 | ||
| Extracellular space | 0.710 [0.7,0.723] | 10 | 14 | 8 | ||
| Golgi apparatus | 0.715 [0.706,0.728] | 10 | 26 | 8 | ||
| Median AUC [min,max] for normal datasets: 0.716 [0.706,0.729] | ||||||
1 See “Methods” and Additional file 1 for a description of features
2 Of 10 datasets
3 According to table of critical values for W in [6]
4 The number of tissues (out of 32) in which the gene is expressed at least 5 transcripts per million (tpm) according to Reverter et al. [33]
5 The average expression in tpm among all the tissues in which the gene is expressed according to Reverter et al. [33]
* Difference statistically significant
Statistical comparison of performances of classifiers trained on normal and without-one-feature druggability datasets
| Missing feature 1 | Median AUC [min,max]2 | |||
|---|---|---|---|---|
| 0.819 [0.798,0.835] | 10 | 27 | 8 | |
| 0.817 [0.803,0.834] | 10 | 26 | 8 | |
| 0.817 [0.801,0.832] | 9 | 20 | 6 | |
| 0.818 [0.799,0.83] | 9 | 18 | 6 | |
| 0.818 [0.801,0.833] | 10 | 26 | 8 | |
| 0.821 [0.799,0.836] | 10 | 21 | 8 | |
| 0.819 [0.8,0.836] | 10 | 27 | 8 | |
| 0.814 [0.797,0.832] | 10 | 18 | 8 | |
| 0.821 [0.804,0.837] | 10 | 25 | 8 | |
| 0.819 [0.803,0.833] | 10 | 25 | 8 | |
| 0.82 [0.791,0.833] | 10 | 26 | 8 | |
| 0.818 [0.802,0.83] | 9 | 19 | 6 | |
| 0.806 [0.795,0.832] | 9 | 11 | 6 | |
| 0.814 [0.799,0.835] | 10 | 23 | 8 | |
| Unknown | 0.816 [0.796,0.832] | 9 | 12 | 6 |
| Cytoplasm | 0.814 [0.794,0.834] | 10 | 20 | 8 |
| Endoplasmic reticulum | 0.820 [0.799,0.834] | 10 | 27 | 8 |
| Mitochondrion | 0.820 [0.796,0.831] | 9 | 22 | 6 |
| Nucleus | 0.816 [0.793,0.831] | 10 | 20 | 8 |
| Other localization | 0.821 [0.802,0.837] | 9 | 20 | 6 |
| Cellular component | 0.82 [0.801,0.835] | 10 | 25 | 8 |
| Extracellular space | 0.817 [0.8,0.837] | 10 | 26 | 8 |
| Golgi apparatus | 0.812 [0.8,0.834] | 10 | 24 | 8 |
| Plasma membrane | 0.781 [0.762,0.816] | 10 | 1 | 8* |
| Median AUC [min,max] for normal datasets : 0.820 [0.801,0.835] | ||||
1 See “Methods” and Additional file 1 for a description of features
2 Of 10 datasets
3 According to table of critical values for W in [6]
4 The number of tissues (out of 32) in which the gene is expressed at least 5 transcripts per million (tpm) according to Reverter et al. [33]
5 The average expression in tpm among all the tissues in which the gene is expressed according to Reverter et al. [33]
* Difference statistically significant
List of the human genes in the INHGI with the 10 highest morbidity scores
| Gene | Morbidity score | (Median [min,max])1 | Morbidity evidence3 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Normal | Shuffled | |||||||||||||
| TFRC | 0.880 [0.576,0.939] | 0.568 [0.447,0.678] | 10 | 1 | 8* | 5941956 | ||||||||
| ITGA5 | 0.875 [0.635,0.916] | 0.491 [0.377,0.631] | 10 | 0 | 8* | No evidence | ||||||||
| LTF | 0.868 [0.803,0.913] | 0.509 [0.356,0.642] | 10 | 0 | 8* | 19258923 | ||||||||
| SFTPD | 0.866 [0.618,0.923] | 0.565 [0.458,0.682] | 10 | 2 | 8* | 19590686 | ||||||||
| THBS1 | 0.865 [0.831,0.918] | 0.511 [0.354,0.566] | 10 | 0 | 8* | 18178577 | ||||||||
| TIMP2 | 0.860 [0.603,0.92] | 0.574 [0.388,0.609] | 10 | 0 | 8* | 19933216 | ||||||||
| TGFB2 | 0.857 [0.565,0.918] | 0.526 [0.407,0.707] | 10 | 3 | 8* | 19258923 | ||||||||
| CGA | 0.856 [0.62,0.916] | 0.535 [0.283,0.656] | 10 | 0 | 8* | 19730683 | ||||||||
| SPP1 | 0.856 [0.577,0.887] | 0.564 [0.34,0.696] | 10 | 0 | 8* | 15868370 | ||||||||
| FLT1 | 0.854 [0.61,0.931] | 0.527 [0.424,0.715] | 10 | 3 | 8* | 19741061 | ||||||||
| NOL3 | 0.850 [0.647,0.875] | 0.576 [0.31,0.651] | 10 | 1 | 8* | 19773279 | ||||||||
1 Of 10 scores
2 According to table of critical values for W in [6]
3 Pudmed IDs of most recent article(s) clearly stating a gene-disease association
* Difference statistically significant
List of the human genes in the INHGI with the 10 highest druggability scores
| Gene | Druggability score | (Median [min,max])1 | Druggability evidence3 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Normal | Shuffled | |||||||||||||
| HLA-F | 0.887[0.803,0.915] | 0.530[0.427,0.584] | 10 | 0 | 8* | No evidence | ||||||||
| PLAU4 | 0.886[0.808,0.907] | 0.561[0.387,0.675] | 10 | 0 | 8* | 19301652 | ||||||||
| CD8A4 | 0.885[0.871,0.902] | 0.56[0.37,0.664] | 10 | 0 | 8* | No evidence | ||||||||
| CD194 | 0.880[0.751,0.907] | 0.562[0.38,0.628] | 10 | 0 | 8* | 19509168 | ||||||||
| ITGAM4 | 0.878[0.614,0.887] | 0.534[0.36,0.656] | 10 | 1 | 8* | 11931348 | ||||||||
| THBS15 | 0.875[0.53,0.9] | 0.532[0.293,0.592] | 10 | 0 | 8* | 17878288 | ||||||||
| ITGAX | 0.873[0.784,0.897] | 0.539[0.422,0.691] | 10 | 0 | 8* | No evidence | ||||||||
| CXCR5 | 0.871[0.755,0.895] | 0.537[0.49,0.59] | 10 | 0 | 8* | 17652619 | ||||||||
| EBI3 | 0.871[0.801,0.888] | 0.529[0.391,0.626] | 10 | 0 | 8* | 19556516 | ||||||||
| IL64 | 0.87[0.766,0.893] | 0.591[0.361,0.643] | 10 | 0 | 8* | 17465721 | ||||||||
| TIMP25 | 0.869[0.645,0.916] | 0.584[0.34,0.701] | 10 | 0 | 8* | 10985804 | ||||||||
1 Of 10 scores
2 According to table of critical values for W in [6]
3 Pudmed IDs of most recent articles clearly stating that such genes may be drug target candidates
4 Morbid genes according to Morbid Map [46]
5 Genes among those with 10 highest morbidity scores (Table 4)
* Difference statistically significant
Figure 3Decision tree generated by training the J48 algorithm on the normal morbidity datasets This decision tree was generated by training the J48 algorithm on the normal morbidity datasets (see “Methods”). The uppermost ellipse is the node root of tree that represents the most important condition for discriminating morbid genes from non-morbid genes. In this case, such condition is the number of transcription factors regulating the gene (regin). The remaining ellipses are internal nodes that represent additional conditions for considering a gene as morbid or non-morbid. In the left branch of tree, such conditions are a central position in a metabolic pathway (inbetmet), the extracellular or plasma membrane localization of respective encoded proteins and tendency of encoded proteins to form clusters with others (c). The rectangles depict genes that, under certain conditions (represented by the root node and internal nodes), are respectively and predominantly classified as morbid (True) and non-morbid (Unknown). In the round brackets inside rectangles, the number before the slash indicates the total number of genes that are actually morbid or non-morbid and the number after the slash indicates how many genes were incorrectly predicted.
Figure 4Decision tree generated by training the J48 algorithm on the normal druggability datasets This decision tree was generated by training the J48 algorithm on the normal druggability datasets (see “Methods”). The uppermost ellipse is the node root of tree that represents the most important condition for discriminating druggable genes from non-druggable genes. In this case, such condition is the plasma membrane localization of encoded proteins. The remaining ellipses are internal nodes that represent additional conditions for considering a gene as druggable or non-druggable. In the left branch of tree, such conditions are a central position in a transcriptional regulatory circuitry (inbetreg) and being an enzyme (metin). The rectangles depict genes that, under certain conditions (represented by the root node and internal nodes), are respectively and predominantly classified as druggable (True) and non-druggable (Unknown). In the round brackets inside rectangles, the number before the slash indicates the total number of genes that are actually druggable or non-druggable and the number after the slash indicates how many genes were incorrectly predicted.