| Literature DB >> 26555596 |
Ehsaneddin Asgari1, Mohammad R K Mofrad1,2.
Abstract
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26555596 PMCID: PMC4640716 DOI: 10.1371/journal.pone.0141287
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Protein sequence splitting.
In order to prepare the training data, each protein sequence will be represented as three sequences (1, 2, 3) of 3-grams.
Fig 2Normalized distributions of biochemical and biophysical properties in protein-space.
In these plots, each point represents a 3-gram (a word of three residues) and the colors indicate the scale for each property. Data points in these plots are projected from a 100-dimensional space a 2D space using t-SNE. As it is shown words with similar properties are automatically clustered together meaning that the properties are smoothly distributed in this space.
Using Lipschitz number to evaluate the continuity of ProtVec with respect to biophysical and biochemical properties.
| Lipschitz Number | |||
|---|---|---|---|
| Property | protein-Space | The scrambled space | Ratio |
| Mass | 0.3137 | 0.6605 | 0.4750 |
| Volume | 0.3742 | 0.6699 | 0.5586 |
| Van Der Waal Volume | 0.3629 | 0.6431 | 0.5643 |
| Polarity | 0.4757 | 1.2551 | 0.3790 |
| Hydrophobicity | 0.608 | 1.448 | 0.4203 |
| Charge | 0.8733 | 1.3620 | 0.6412 |
| Average | 0.50 | 1.01 | 0.51 |
Performance of protein family classification using SVM and ProtVec over some of the most frequent families in Swiss-Prot.
Families are sorted with respect to their frequency in Swiss-Prot.
| Training instances | Classification Result | ||||
|---|---|---|---|---|---|
| Family name | # of positive sequences | # of negative sequences | Specificity | Sensitivity | Accuracy |
| 50S ribosome-binding GTPase | 3,084 | 3,084 | 0.95 | 0.93 | 0.94 |
| Helicase conserved C-terminal domain | 2,518 | 2,518 | 0.83 | 0.80 | 0.82 |
| ATP synthase alpha-beta family, nucleotide-binding domain | 2,387 | 2,387 | 0.98 | 0.97 | 0.97 |
| 7 transmembrane receptor (rhodopsin family) | 1,820 | 1,820 | 0.95 | 0.96 | 0.95 |
| Amino acid kinase family | 1,750 | 1,750 | 0.91 | 0.92 | 0.91 |
| ATPase family associated with various cellular activities (AAA) | 1711 | 1711 | 0.92 | 0.90 | 0.91 |
| tRNA synthetases class I (I, L, M and V) | 1,634 | 1,634 | 0.97 | 0.97 | 0.97 |
| tRNA synthetases class II (D, K and N) | 1,419 | 1,419 | 0.88 | 0.83 | 0.85 |
| Major Facilitator Superfamily | 1,303 | 1,303 | 0.95 | 0.97 | 0.96 |
| Hsp70 protein | 1,272 | 1,272 | 0.97 | 0.97 | 0.97 |
| NADH-Ubiquinone-plastoquinone (complex I), various chains | 1,251 | 1,251 | 0.97 | 0.97 | 0.97 |
| Histidine biosynthesis protein | 1,248 | 1,248 | 0.96 | 0.97 | 0.97 |
| TCP-1-cpn60 chaperonin family | 1,246 | 1,246 | 0.95 | 0.96 | 0.95 |
| EPSP synthase (3-phosphoshikimate 1-carboxyvinyltransferase) | 1,207 | 1,207 | 0.96 | 0.96 | 0.96 |
| Aldehyde dehydrogenase family | 1,200 | 1,200 | 0.93 | 0.94 | 0.94 |
| Shikimate–quinate 5-dehydrogenase | 1,128 | 1,128 | 0.87 | 0.89 | 0.88 |
| GHMP kinases N terminal domain | 1,120 | 1,120 | 0.88 | 0.92 | 0.90 |
| Ribosomal protein S2 | 1,083 | 1,083 | 0.95 | 0.96 | 0.95 |
| Ribosomal protein S4–S9 N-terminal domain | 1,072 | 1,072 | 0.95 | 0.97 | 0.96 |
| Ribosomal protein L16p-L10e | 1,053 | 1,053 | 0.95 | 0.96 | 0.96 |
| KOW motif | 1,047 | 1,047 | 0.93 | 0.95 | 0.94 |
| Uncharacterized protein family UPF0004 | 1,044 | 1,044 | 0.95 | 0.97 | 0.96 |
| Ribosomal protein S12-S23 | 1,016 | 1,016 | 0.94 | 0.98 | 0.96 |
| GHMP kinases C terminal | 1,011 | 1,011 | 0.88 | 0.92 | 0.90 |
| Ribosomal protein S14p-S29e | 997 | 997 | 0.93 | 0.98 | 0.95 |
| Ribosomal protein S11 | 980 | 980 | 0.96 | 0.98 | 0.97 |
| UvrB-uvrC motif | 968 | 968 | 0.94 | 0.96 | 0.95 |
| Ribosomal protein L33 | 958 | 958 | 0.96 | 0.98 | 0.97 |
| BRCA1 C Terminus (BRCT) domain | 956 | 956 | 0.94 | 0.95 | 0.95 |
| RF-1 domain | 950 | 950 | 0.93 | 0.97 | 0.95 |
| Ankyrin repeats (3 copies) | 944 | 944 | 0.89 | 0.88 | 0.88 |
| Ribosomal protein L20 | 932 | 932 | 0.96 | 0.99 | 0.97 |
| RNA polymerase beta subunit | 912 | 912 | 0.94 | 0.97 | 0.95 |
| Ribosomal protein S18 | 908 | 908 | 0.93 | 0.97 | 0.95 |
| ATP synthase B-B CF(0) | 900 | 900 | 0.92 | 0.94 | 0.93 |
| Peptidase family M20-M25-M40 | 889 | 889 | 0.92 | 0.93 | 0.93 |
| Ribosomal protein L18e-L15 | 887 | 887 | 0.93 | 0.96 | 0.95 |
| Glucose inhibited division protein A | 886 | 886 | 0.95 | 0.96 | 0.95 |
| NADH-ubiquinone-plastoquinone oxidoreductase chain 4L | 885 | 885 | 0.94 | 0.97 | 0.96 |
| lactate-malate dehydrogenase, NAD binding domain | 880 | 880 | 0.92 | 0.94 | 0.93 |
| HD domain | 879 | 879 | 0.93 | 0.93 | 0.93 |
| Ribosomal protein S10p-S20e | 873 | 873 | 0.95 | 0.97 | 0.96 |
| Pyridoxal-phosphate dependent enzyme | 870 | 870 | 0.91 | 0.91 | 0.91 |
| Ribosomal L18p-L5e family | 860 | 860 | 0.93 | 0.96 | 0.94 |
| Ribosomal protein L3 | 855 | 855 | 0.94 | 0.97 | 0.96 |
| tRNA synthetases class I (M) | 843 | 843 | 0.94 | 0.96 | 0.95 |
| UbiA prenyltransferase family | 841 | 841 | 0.94 | 0.95 | 0.95 |
| Ribosomal protein L4–L1 family | 841 | 841 | 0.94 | 0.95 | 0.95 |
| Ribosomal protein S16 | 840 | 840 | 0.93 | 0.97 | 0.95 |
| Ribosomal protein S13-S18 | 840 | 840 | 0.94 | 0.97 | 0.95 |
| MraW methylase family | 837 | 837 | 0.95 | 0.98 | 0.96 |
| Ribosomal L32p protein family | 825 | 825 | 0.94 | 0.97 | 0.95 |
| Elongation factor TS | 819 | 819 | 0.94 | 0.97 | 0.96 |
| Tetrahydrofolate dehydrogenase-cyclohydrolase, catalytic domain | 817 | 817 | 0.94 | 0.96 | 0.95 |
| ATP synthase delta (OSCP) subunit | 813 | 813 | 0.93 | 0.96 | 0.94 |
| tRNA synthetases class I (C) catalytic domain | 812 | 812 | 0.95 | 0.97 | 0.96 |
| SecA Wing and Scaffold domain | 805 | 805 | 0.95 | 0.97 | 0.96 |
| Ribonuclease HII | 795 | 795 | 0.93 | 0.94 | 0.93 |
| Ribosomal protein L31 | 795 | 795 | 0.97 | 0.99 | 0.98 |
| Ribosomal L27 protein | 794 | 794 | 0.98 | 0.99 | 0.99 |
| IPP transferase | 794 | 794 | 0.93 | 0.95 | 0.94 |
| GTP-binding protein LepA C-terminus | 793 | 793 | 0.96 | 0.98 | 0.97 |
| Ribosomal protein L17 | 791 | 791 | 0.92 | 0.96 | 0.94 |
| Ribosomal protein L23 | 790 | 790 | 0.91 | 0.96 | 0.94 |
| Ribosomal protein L10 | 781 | 781 | 0.90 | 0.92 | 0.91 |
| Ribosomal protein L19 | 780 | 780 | 0.94 | 0.97 | 0.95 |
| Ribosomal protein S20 | 774 | 774 | 0.95 | 0.97 | 0.96 |
| Ribosomal protein L35 | 769 | 769 | 0.93 | 0.97 | 0.95 |
| Phosphoglucomutase-phosphomannomutase, C-terminal domain | 768 | 768 | 0.92 | 0.96 | 0.94 |
| AMP-binding enzyme | 767 | 767 | 0.87 | 0.89 | 0.88 |
| Ribosomal prokaryotic L21 protein | 766 | 766 | 0.93 | 0.96 | 0.95 |
| tRNA methyl transferase | 759 | 759 | 0.94 | 0.96 | 0.95 |
| Ribosomal L29 protein | 757 | 757 | 0.95 | 0.97 | 0.96 |
| Glycosyl transferase family, a-b domain | 754 | 754 | 0.90 | 0.91 | 0.91 |
| Translation initiation factor IF-2, N-terminal region | 750 | 750 | 0.96 | 0.98 | 0.97 |
| Ribosomal L28 family | 749 | 749 | 0.93 | 0.98 | 0.95 |
| Glycosyl transferase family 4 | 739 | 739 | 0.96 | 0.98 | 0.97 |
| tRNA synthetases class I (R) | 736 | 736 | 0.93 | 0.96 | 0.95 |
| Bacterial trigger factor protein (TF) C-terminus | 733 | 733 | 0.95 | 0.96 | 0.95 |
| For the first 1,000 families | 261,149 | 261,149 | 0.92 | 0.95 | 0.94 |
| For the first 2,000 families | 293,957 | 293,957 | 0.90 | 0.96 | 0.93 |
| For the first 3,000 families | 308,292 | 308,292 | 0.89 | 0.96 | 0.92 |
| For the first 4,000 families | 316,135 | 316,135 | 0.87 | 0.96 | 0.91 |
| Weighted average for all 7,027 families | 324,018 | 324,018 | 0.91 | 0.95 | 0.93 |
Fig 3Visualization of protein sequences using ProtVec can characterize FGNUPs versus Disport disordered sequences and structured sequences.
Column (a) compares FG Nup sequences 2D histogram (at the bottom) with 2D histogram of FG Nup disordered regions (on top). Column (b) compares 2D histogram two random sets of structured sequences with the same average length as the FG-Nups. Column (c) compares between 2D histogram of DisProt sequences (at the bottom) and 2D histogram of DisProt disordered regions (on top).
The performance of FG-Nups disordered protein classification in a 10xFold cross-validation using SVM.
| Sensitivity | Specificity | Accuracy |
|---|---|---|
| 0.9987 | 0.9974 | 0.9981 |
Fig 4Classification of FG-Nups versus PDB structured sequences.
In this figure, each point presents a protein projected into a 2D space.