| Literature DB >> 35148686 |
Abstract
BACKGROUND: For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few.Entities:
Keywords: Pre-training; ProtPlat; Protein sequence classification; Web server
Mesh:
Substances:
Year: 2022 PMID: 35148686 PMCID: PMC8832758 DOI: 10.1186/s12859-022-04604-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Numbers of protein families with different numbers of sequences in Pfam
| # protein sequence | # protein families |
|---|---|
| < 100 | 5474 |
| < 200 | 7433 |
| < 300 | 8775 |
| < 500 | 10,523 |
| ≥ 500 | 7249 |
Dataset of type III secreted effectors
| Dataset | T3SE | non-T3SE | Total |
|---|---|---|---|
| Train # | 241 | 284 | 525 |
| Test # | 46 | 92 | 138 |
Datasets of protein subcellular localization*
| Dataset | cy | mi | nu | Sp | Total |
|---|---|---|---|---|---|
| Animals_train | 302 | 153 | 803 | 632 | 1890 |
| Animals_test | 137 | 35 | 363 | 172 | 707 |
| Fungi_train | 181 | 177 | 589 | 72 | 1019 |
| Fungi_test | 30 | 11 | 122 | 16 | 179 |
| Plants_train | 52 | 57 | 60 | 35 | 204 |
| Plants_test | 6 | 10 | 61 | 6 | 83 |
*cy denotes cytoplasm, mi denotes mitochondrion, nu denotes nucleus, and sp denotes secretory pathway
Datasets of signal peptides
| Dataset | Sec/SPI | Others | Total |
|---|---|---|---|
| Archaea_train | 10 | 45 | 55 |
| Archaea_test | 50 | 132 | 182 |
| Eukaryotes_train | 2404 | 7409 | 9813 |
| Eukaryotes_test | 210 | 7247 | 7457 |
| Gram-negative_train | 419 | 1126 | 1545 |
| Gram-negative_test | 90 | 693 | 783 |
| Gram-positive_train | 164 | 370 | 534 |
| Gram-positive_test | 25 | 364 | 389 |
Fig. 1The data collection and filtering process of the Pfam database
Fig. 2Model architecture of ProtPlat. The k-mer embeddings are fed into the neural network and learned by the hidden layers. The output label is yielded by a hierarchy Softmax function
Fig. 3The web server interface of ProtPlat
Hyperparameter settings for pre-training in ProtPlat
| Hyperparameter | Value |
|---|---|
| 3 | |
| Epoch | 70 |
| Learning rate | 0.15 |
| Dim. of embeddings | 100 |
| Dim. of hidden layer | 100 |
Fig. 4F1 score comparison between models with and without pre-training
Performance comparison for the prediction of type III secreted effectors
| Model | ACC | F1 score |
|---|---|---|
| ProtPlat | ||
| WEDeepT3 | 0.812 | 0.705 |
| BPBAac | 0.609 | 0.339 |
| EffectiveT3 | 0.696 | 0.512 |
| T3_MM | 0.718 | 0.581 |
| DeepT3 | 0.594 | 0.486 |
| Bastion3 | 0.739 | 0.673 |
| BEAN 2.0 | 0.761 | 0.692 |
Performance comparison for protein subcellular location prediction
| Model | Animals | Fungi | Plants | |||
|---|---|---|---|---|---|---|
| ACC | F1 | ACC | F1 | ACC | F1 | |
| Euk-mPLoc | 0.61 | 0.54 | 0.6 | 0.56 | 0.46 | 0.37 |
| WoLF PSORT | 0.7 | 0.67 | 0.5 | 0.51 | 0.57 | 0.46 |
| LOCTree | 0.62 | 0.58 | 0.47 | 0.43 | 0.7 | 0.58 |
| BaCeILo | 0.64 | 0.66 | 0.57 | 0.6 | 0.69 | 0.56 |
| MultiLoc2-HighRes | 0.68 | 0.71 | 0.53 | 0.58 | 0.62 | 0.54 |
| MultiLoc2-LowRes | 0.73 | 0.6 | 0.61 | 0.64 | ||
| YLoc + | 0.58 | 0.67 | 0.48 | 0.51 | 0.58 | 0.49 |
| YLoc-HighRes | 0.74 | 0.69 | 0.56 | 0.51 | 0.58 | 0.54 |
| YLoc-LowRes | 0.75 | 0.56 | 0.61 | 0.71 | 0.58 | |
| ProtPlat | 0.66 | 0.66 | 0.72 | |||
Performance comparison of signal peptide prediction
| Model | Archaea | Eukaryotes | Gram-negative | Gram-positive | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | |
| SignalP 5.0 | 0.771 | 0.660 | 0.711 | 0.729 | 0.733 | 0.737 | ||||||
| DeepSig | – | – | – | 0.604 | 0.624 | 0.614 | 0.131 | 0.600 | 0.215 | 0.073 | 0.760 | 0.133 |
| LipoP | 0.484 | 0.480 | 0.482 | 0.159 | 0.343 | 0.217 | 0.327 | 0.733 | 0.452 | 0.153 | 0.600 | 0.244 |
| Philius | 0.425 | 0.580 | 0.491 | 0.151 | 0.619 | 0.243 | 0.106 | 0.700 | 0.184 | 0.054 | 0.600 | 0.099 |
| Phobius | 0.395 | 0.540 | 0.456 | 0.226 | 0.667 | 0.338 | 0.098 | 0.644 | 0.170 | 0.054 | 0.600 | 0.099 |
| PolyPhobius | 0.395 | 0.560 | 0.463 | 0.176 | 0.681 | 0.280 | 0.097 | 0.644 | 0.169 | 0.060 | 0.680 | 0.110 |
| PrediSi | – | – | – | 0.273 | 0.652 | 0.385 | 0.144 | 0.722 | 0.240 | 0.062 | 0.640 | 0.113 |
| PRED-LIPO | 0.455 | 0.480 | 0.467 | 0.069 | 0.095 | 0.080 | 0.212 | 0.467 | 0.292 | 0.216 | 0.760 | 0.336 |
| PRED-SIGNAL | 0.489 | 0.607 | 0.066 | 0.224 | 0.102 | 0.076 | 0.444 | 0.130 | 0.060 | 0.680 | 0.110 | |
| PRED-TAT | 0.493 | 0.580 | 0.533 | 0.080 | 0.410 | 0.134 | 0.125 | 0.711 | 0.213 | 0.082 | 0.720 | 0.147 |
| Signal-3L 2.0 | – | – | – | 0.322 | 0.648 | 0.430 | 0.113 | 0.644 | 0.192 | 0.074 | 0.800 | 0.135 |
| Signal-CF | – | – | – | 0.105 | 0.652 | 0.181 | 0.102 | 0.689 | 0.178 | 0.059 | 0.720 | 0.109 |
| SOSUIsignal | – | – | – | 0.037 | 0.176 | 0.061 | 0.040 | 0.267 | 0.070 | 0.018 | 0.200 | 0.033 |
| SPEPlip | – | – | – | 0.366 | 0.710 | 0.483 | 0.276 | 0.611 | 0.380 | 0.187 | 0.680 | 0.293 |
| SPOCTOPUS | – | – | – | 0.120 | 0.390 | 0.184 | 0.067 | 0.467 | 0.117 | 0.056 | 0.640 | 0.103 |
| TOPCONS2 | 0.366 | 0.480 | 0.415 | 0.107 | 0.371 | 0.166 | 0.081 | 0.544 | 0.141 | 0.022 | 0.240 | 0.040 |
| 0.627 | 0.636 | 0.698 | 0.728 | 0.550 | 0.668 | 0.603 | ||||||
*Pre denotes precision and Rec denotes recall. The precision and recall of the baseline methods are extracted from SignalP 5.0 [15]
Comparison of the F1 Scores between two segmentation methods
| Dataset | Non-overlapping segmentation | Overlapping segmentation |
|---|---|---|
| T3SE | 0.792 | |
| Animals | 0.623 | |
| Fungi | 0.688 | |
| Plants | 0.671 | |
| Archaea | 0.679 | |
| Eukaryotes | 0.680 | |
| Gram-negative | 0.713 | |
| Gram-positive | 0.558 |
Fig. 5Comparison of F1 scores obtained by ProtPlat with different values of k
Accuracy of different pre-trained representations
| Dataset | Training No. | ProtPlat | SeqVec | ProtTrans |
|---|---|---|---|---|
| DeepLoc | 11,085 | 0.537 | 0.565 | |
| T3SE | 525 | 0.823 | 0.821 | |
| Animals | 1890 | 0.665 | 0.685 | |
| Fungi | 1010 | 0.706 | 0.727 | |
| Plants | 204 | 0.718 | 0.738 | |
| Archaea | 55 | 0.718 | 0.714 | |
| Eukaryotes | 9813 | 0.695 | 0.721 | |
| Gram-negative | 1545 | 0.755 | 0.772 | |
| Gram-positive | 534 | 0.607 | 0.614 |