| Literature DB >> 29949957 |
Hakime Öztürk1, Elif Ozkirimli2, Arzucan Özgür1.
Abstract
Motivation: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29949957 PMCID: PMC6022674 DOI: 10.1093/bioinformatics/bty287
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Extraction of protein–ligand interactions. As an example protein, Cardiac Myosin Binding Protein C is provided as input with its corresponding SCOP ID: d1gxea_
Fig. 2.Representation of biological and chemical words
Distribution of families and super-families in A-50 dataset before and after filtering
| Dataset | Number of Sequences | Super-families | Families |
|---|---|---|---|
| Before filtering | 10 816 | 1080 | 2109 |
| After filtering | 1639 | 425 | 652 |
Distribution of the top-10 most frequent super-families and families with known ligand interactions
| Super-family | No. of prots. | Family | No. of prots. | |
|---|---|---|---|---|
| 1 | Protein kinase-like (d.144.1) | 47 | Protein kinases, catalytic subunit (d.144.1.7) | 39 |
| 2 | P-loop containing nucleoside triphosphate hydrolases (c.37.1) | 43 | Fibronectin type III (b.1.2.1) | 28 |
| 3 | Immunoglobulin (b.1.1) | 41 | Eukaryotic proteases (b.47.1.2) | 25 |
| 4 | NAD(P)-binding Rossmann-fold domain (c.2.1) | 32 | EGF-type module (g.3.11.1) | 24 |
| 5 | Trypsin-like serine proteases (b.47.1) | 31 | Immunoglobulin I set (b.1.1.4) | 23 |
| 6 | Fibronectin type III (b.1.2) | 28 | SH2 domain (d.93.1.1) | 22 |
| 7 | EGF/Laminin (g.3.11) | 27 | Nuclear receptor ligand-binding domain (a.123.1.1) | 18 |
| 8 | SH2 domain (d.93.1) | 22 | Cyclin (a.74.1.1) | 15 |
| 9 | Cysteine proteinases (d.3.1) | 20 | Pleckstrin-homology domain (b.55.1.1) | 15 |
| 10 | Nuclear receptor ligand-binding domain (a.123.1) | 19 | Tyrosine-dependent oxidoreductases (c.2.1.2) | 15 |
Performance of the TransClust algorithm in super-family and family clustering for all protein similarity computation methods with Precision, Recall and F-measure values
| Super-family | Family | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| No. Clusters | Precision | Recall | No. Clusters | Precision | Recall | ||||
| Protein sequence based | |||||||||
| Blast ( | A-50 | 1596 | 0.997 | 0.261 | 0.350 | 1636 | 1.0 | 0.399 | 0.500 |
| Blast (identity) | A-50 | 606 | 0.861 | 0.550 | 0.595 | 660 | 0.781 | 0.668 | 0.631 |
| Protein Word frequency | A-50 | 708 | 0.952 | 0.621 | 688 | 0.844 | 0.777 | ||
| ProtVec Avg (word) | A-50 | 655 | 0.927 | 0.620 | 0.681 | 704 | 0.845 | 0.757 | 0.739 |
| ProtVec Avg (char) | A-50 | 707 | 0.940 | 0.603 | 0.674 | 707 | 0.842 | 0.746 | 0.729 |
| ProtVec MinMax (word) | A-50 | 586 | 0.891 | 0.623 | 0.667 | 704 | 0.829 | 0.741 | 0.718 |
| Ligand based | |||||||||
| SMILES Word frequency | A-50 | 801 | 0.951 | 0.548 | 0.624 | 957 | 0.934 | 0.658 | 0.704 |
| SMILESVec (word, chembl) | A-50 | 621 | 0.921 | 0.621 | 0.677 | 730 | 0.855 | 0.744 | 0.735 |
| SMILESVec (word, pubchem) | A-50 | 573 | 0.888 | 0.627 | 0.668 | 692 | 0.839 | 0.751 | 0.730 |
| SMILESVec (word, combined) | A-50 | 617 | 0.923 | 0.627 | 0.675 | 764 | 0.873 | 0.732 | 0.735 |
| SMILESVec (char, chembl) | A-50 | 636 | 0.920 | 0.621 | 0.678 | 710 | 0.844 | 0.743 | 0.729 |
| SMILESVec (char, pubchem) | A-50 | 714 | 0.941 | 0.600 | 0.671 | 715 | 0.845 | 0.744 | 0.729 |
| SMILESVec (char, combined) | A-50 | 712 | 0.949 | 0.602 | 0.675 | 712 | 0.850 | 0.749 | |
| MACCS | A-50 | 589 | 0.909 | 0.629 | 683 | 0.839 | 0.757 | 0.736 | |
| ECFP6 | A-50 | 611 | 0.917 | 0.627 | 725 | 0.860 | 0.746 | 0.733 | |
Note: The best F-measure values for the Protein sequence- and Ligand-based methods are shown in bold.
Performance of the MCL algorithm in super-family and family clustering for all protein similarity computation methods with Precision, Recall and F-measure values
| Super-family | Family | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| No. Clusters | Precision | Recall | No. Clusters | Precision | Recall | ||||
| Protein sequence based | |||||||||
| Blast ( | A-50 | 728 | 0.792 | 0.271 | 0.290 | 728 | 0.687 | 0.406 | 0.379 |
| Blast (identity) | A-50 | 783 | 0.882 | 0.496 | 0.540 | 783 | 0.803 | 0.622 | 0.592 |
| Protein Word frequency | A-50 | 411 | 0.769 | 0.625 | 0.590 | 411 | 0.643 | 0.767 | 0.606 |
| ProtVec avg (word) | A-50 | 1001 | 0.964 | 0.514 | 1001 | 0.909 | 0.639 | ||
| ProtVec avg (char) | A-50 | 1017 | 0.964 | 0.508 | 0.590 | 1017 | 0.910 | 0.633 | 0.662 |
| ProtVec MinMax (word) | A-50 | 1014 | 0.964 | 0.508 | 0.590 | 1014 | 0.909 | 0.634 | 0.662 |
| Ligand based | |||||||||
| SMILES Word frequency | A-50 | 312 | 0630 | 0.550 | 0.470 | 312 | 0.497 | 0.686 | 0.475 |
| SMILESVec (word, chembl) | A-50 | 867 | 0.937 | 0.544 | 867 | 0.870 | 0.672 | 0.667 | |
| SMILESVec (word, pubchem) | A-50 | 857 | 0.931 | 0.544 | 0.604 | 857 | 0.861 | 0.673 | 0.664 |
| SMILESVec (word, combined) | A-50 | 894 | 0.940 | 0.540 | 0.607 | 894 | 0.877 | 0.666 | 0.668 |
| SMILESVec (char, chembl) | A-50 | 999 | 0.962 | 0.514 | 0.596 | 999 | 0.908 | 0.641 | 0.668 |
| SMILESVec (char, pubchem) | A-50 | 977 | 0.958 | 0.514 | 0.595 | 977 | 0.900 | 0.643 | 0.667 |
| SMILESVec (char, combined) | A-50 | 1006 | 0.963 | 0.514 | 0.595 | 1006 | 0.909 | 0.641 | |
| MACCS | A-50 | 874 | 0.936 | 0.540 | 0.606 | 874 | 0.866 | 0.668 | 0.667 |
| ECFP6 | A-50 | 618 | 0.863 | 0.582 | 0.599 | 618 | 0.762 | 0.710 | 0.631 |
Note: The best F-measure values for the Protein sequence- and ligand-based methods are shown in bold.
Pearson correlation between protein similarity methods
| Method | Method | Pearson correlation |
|---|---|---|
| BLAST ( | BLAST (identity) | −0.109 |
| BLAST ( | Protein word frequency | −0.250 |
| BLAST ( | ProtVec (avg) | −0.291 |
| BLAST ( | SMILESVec (word, chembl) | −0.335 |
| BLAST ( | SMILESVec (char, chembl) | −0.207 |
| BLAST ( | MACCS | −0.336 |
| SMILESVec (word, chembl) | MACCS | 0.895 |
| SMILESVec (char, pubchem) | MACCS | 0.590 |
| SMILESVec (word, chembl) | SMILESVec (char, pubchem) | 0.682 |
| SMILESVec (word, chembl) | ECFP6 | 0.933 |
| ECFP6 | MACCS | 0.898 |