| Literature DB >> 28617227 |
Baeksoo Kim1, Jihoon Jo2, Jonghyun Han1, Chungoo Park3, Hyunju Lee4.
Abstract
BACKGROUND: Computational approaches in the identification of drug targets are expected to reduce time and effort in drug development. Advances in genomics and proteomics provide the opportunity to uncover properties of druggable genomes. Although several studies have been conducted for distinguishing drug targets from non-drug targets, they mainly focus on the sequences and functional roles of proteins. Many other properties of proteins have not been fully investigated.Entities:
Keywords: Bioinformatics; Drug target; Proteomics
Mesh:
Substances:
Year: 2017 PMID: 28617227 PMCID: PMC5471946 DOI: 10.1186/s12859-017-1639-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flowchart of study design
Number of proteins for each dataset
| Set A | Set B | Set C | Set D | |
|---|---|---|---|---|
| Number of | 1578 | 792 | 1578 | 792 |
| Number of | 17,575 | 8361 | 15,691 | 7949 |
Fig. 2Analysis of widely used properties. The asterisk(*) represents the p-value of the statistical test. One asterisk means that the p-value is less than 0.05. Two asterisk means that the p-value is less than 0.001. Three asterisk means that the p-value is less than 0.0001. (a) Amino acid groups. (b) Primary enzyme class. (c) Subcellular location
Fig. 3Analysis of gene ontology annotation. The line graph is the number of genes belonging to the corresponding GO term and the bar graph is taken from -log base 2 of the p-value calculated via DAVID. (a) Biological processes. (b) Cellular component. (c) Molecular function
Fig. 4Analysis of PTMs. In phosphorylation, “S” indicate phosphorylation site in serin, “T” indicate threonine, and “Y” indicate tyrosine. The asterisk(*) represents the p-value of the statistical test. One asterisk means that the p-value is less than 0.05. Two asterisk means that the p-value is less than 0.001. Three asterisk means that the p-value is less than 0.0001. (a) Average proportion. (b) Average proportion in solvent accessible protein
Fig. 5Analysis of newly proposed properties. The asterisk(*) represents the p-value of the statistical test. One asterisk means that the p-value is less than 0.05. Two asterisk means that the p-value is less than 0.001. Three asterisk means that the p-value is less than 0.0001. (a) Essential proteins. (b) Expression levels. (c) Tissue specificity
Result for drug target protein prediction using machine learning methods
| SVM | Recall | Precision | F1 |
|---|---|---|---|
| Set A, | 0.7326 | 0.6594 | 0.6941 |
| Set A, | 0.7516 | 0.7422 | 0.7469 |
| Set A, | 0.7947 | 0.6681 | 0.7259 |
| Set A, | 0.8137 | 0.6982 | 0.7515 |
| Set B, | 0.7866 | 0.6416 | 0.7067 |
| Set B, | 0.7374 | 0.6496 | 0.6907 |
| Set B, | 0.7424 | 0.6585 | 0.6979 |
| Set B, | 0.8018 | 0.6580 | 0.7228 |
| Set C, | 0.7516 | 0.7808 | 0.7659 |
| Set C, | 0.7972 | 0.8003 | 0.7987 |
| Set C, | 0.8137 | 0.7965 | 0.8050 |
| Set C, |
|
|
|
| Set D, | 0.7820 | 0.7367 | 0.7587 |
| Set D, | 0.8083 | 0.7588 | 0.7828 |
| Set D, | 0.8120 | 0.7500 | 0.7798 |
| Set D, | 0.8271 | 0.7710 | 0.7981 |
| RF | |||
| Set A, | 0.7541 | 0.7682 | 0.7605 |
| Set A, | 0.6483 | 0.8130 | 0.7260 |
| Set A, | 0.7936 | 0.6763 | 0.7299 |
| Set A, | 0.8229 | 0.6986 | 0.7556 |
| Set B, | 0.7821 | 0.6547 | 0.7124 |
| Set B, | 0.7490 | 0.6493 | 0.6953 |
| Set B, | 0.7551 | 0.7805 | 0.7677 |
| Set B, | 0.8076 | 0.6767 | 0.7363 |
| Set C, | 0.7847 | 0.7358 | 0.7589 |
| Set C, | 0.8165 | 0.7960 | 0.8057 |
| Set C, | 0.8292 | 0.8118 | 0.8200 |
| Set C, |
|
|
|
| Set D, | 0.7885 | 0.7409 | 0.7636 |
| Set D, | 0.8343 | 0.7564 | 0.7934 |
| Set D, | 0.8305 | 0.7550 | 0.7908 |
| Set D, | 0.8382 | 0.7818 | 0.8088 |
Feature sets W and N represent widely used and newly proposed properties, respectively. W ′ and N ′ represent statistically significant widely used and newly proposed properties, respectively
The underline bold numbers indicate the highest values in each evaluation