| Literature DB >> 35287576 |
Mohammad Neamul Kabir1, Limsoon Wong2.
Abstract
BACKGROUND: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.Entities:
Keywords: Ensemble classifier; Protein function prediction; Sequence homology; Support vector machine; Twilight zone sequence
Mesh:
Substances:
Year: 2022 PMID: 35287576 PMCID: PMC8919565 DOI: 10.1186/s12859-022-04626-w
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Detailed description of the three subsets of the COG database based on threshold number of members in each family
| Name | Min no. of members | No. of families | No. of proteins |
|---|---|---|---|
| COG-500-1074 | 500 | 1074 | 1,129,428 |
| COG-250-1796 | 250 | 1796 | 1,389,595 |
| COG-100-2892 | 100 | 2892 | 1,565,976 |
Fig. 1Homology between training and test set of COG dataset. The bars indicate the fraction of test data having identity less than or equal to the indicated value on the x-axis. For each fold of the dataset, the homology is calculated for test sequence against the training sequences and the seed sequences used to build pHMM feature models. For each identity percentage, the three different bars indicate the average of 3-fold of the three different subsets of COG dataset
Performance comparison of different methods on the twilight zone sequences, i.e. sequences having less than identity is shown in this table
| Dataset | Method | predCount = 1 | predCount = 2 | predCount = 3 | predCount = 4 | predCount = 5 | predCount |
|---|---|---|---|---|---|---|---|
| COG-500-1074 | EnsembleFam | ||||||
| pHMM | 69.54 | 73.75 | 55.51 | 70.62 | 70.85 | 73.55 | |
| DeepFam | 57.14 | 54.52 | 49.90 | 46.92 | 43.64 | 35.94 | |
| COG-250-1796 | EnsembleFam | 72.84 | |||||
| pHMM | 73.82 | 73.84 | 71.02 | 67.44 | 72.43 | ||
| DeepFam | 32.44 | 32.54 | 30.24 | 29.53 | 30.02 | 28.68 | |
| COG-100-2892 | EnsembleFam | ||||||
| pHMM | 63.44 | 59.69 | 53.45 | 48.16 | 47.42 | 57.57 | |
| DeepFam | 27.30 | 26.13 | 25.54 | 27.62 | 24.83 | 25.36 | |
| COG-500-1074 | EnsembleFam | ||||||
| pHMM | 62.22 | 61.20 | 88.95 | 87.38 | 85.19 | 85.85 | |
| DeepFam | 58.45 | 58.32 | 59.39 | 58.41 | 58.37 | 54.81 | |
| COG-250-1796 | EnsembleFam | ||||||
| pHMM | 63.05 | 89.41 | 89.05 | 87.74 | 84.82 | 83.69 | |
| DeepFam | 47.09 | 48.38 | 50.12 | 51.09 | 50.73 | 48.78 | |
| COG-100-2892 | EnsembleFam | ||||||
| pHMM | 87.07 | 87.78 | 86.08 | 84.04 | 80.16 | 81.69 | |
| DeepFam | 38.73 | 42.62 | 46.07 | 48.33 | 49.30 | 45.32 | |
The best results are highlighted in bold font. The dataset is divided into six subgroups based on the number of predictions made by EnsembleFam. Using the column “predCount = 5” as an example, the accuracy in this table is computed as follows. For a protein, if EnsembleFam makes 5 function predictions for it, and one of these is correct, the protein is counted as correct in the column “predCount = 5”; if all 5 function predictions are incorrect, the protein is counted as a wrong prediction. For the same protein, regardless of how many function predictions are made by pHMM, as long as one of these is correct, the protein is counted as correct in the column “predCount = 5”; otherwise, the protein is counted as incorrect in the column. As for DeepFam, which makes exactly one prediction for each protein, the same protein is counted as correct in the column “predCount = 5” if and only if the sole DeepFam prediction for it is correct. All the accuracy value showed in the table is the average of 3-fold cross-validation
Fig. 2ROC curve for a few COG families from COG-500-1074 dataset. In each chart EnsmebleFam, DeepFam and pHMM are shown in different colors. It is clear that EnsembleFam performs better than other methods
ROC AUC score comparison of four families from COG-500-1074 dataset shown in Fig. 2
| COG family | EnsembleFam | pHMM | DeepFam |
|---|---|---|---|
| COG 344 | 0.993374 | 0.997536 | |
| COG 508 | 0.983626 | 0.996149 | |
| COG 539 | 0.992682 | 0.994334 | |
| COG 796 | 0.994031 | 0.994851 |
The best results are highlighted in bold font
Fig. 3ROC AUC score comparison between EnsembleFam, DeepFam and pHMM on the three COG datasets. The axis shows the AUC score and the axis shows number of families in the respective dataset having AUC scores greater than or equal to the respective x value
Fig. 4Test result of EnsembleFam and pHMM on new (unknown) family not used in training. For this, we used 1818 different families from COG-100-2892 to test the models of COG-500-1074, similarly 1096 different families for COG-250-1796. In the figure, axis indicates the ROC AUC score and the axis indicates number of families above that AUC score
Number of prediction made by pHMM and EnsembleFam for twilight zone proteins where
| Method | seq1 | seq2 | seq3 | seq4 | seq5 | seq6 | seq7 | seq8 | seq9 | seq10 | seq11 | seq12 | seq13 | seq14 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pHMM | 57 | 54 | 58 | 17 | 13 | 7 | 7 | 56 | 50 | 50 | 38 | 1 | 1 | 54 |
| EnsembleFam | 1 | 0 | 0 | 2 | 0 | 0 | 3 | 1 | 1 | 1 | 1 | 2 | 2 | 1 |
For each test sequence, the maximum number of predictions can be 86, i.e, the sequence belongs to all sub-subfamilies. And the minimum number of predictions can be 0, i.e., the sequence does not belong to any of the sub-subfamily. DeepFam was not included in this comparison, as DeepFam always predict a single label irrespective of the number of families
Prediction accuracy comparison of different methods on the twilight zone proteins
| Method | Sub-subfamily | Sub-family | Family |
|---|---|---|---|
| pHMM | 5.51 | 11.76 | 39.80 |
| DeepFam | 5.53 | 16.88 | 61.44 |
| EnsembleFam | |||
| pHMM | 14.74 | 21.72 | |
| DeepFam | 22.38 | 37.18 | 73.40 |
| EnsembleFam | 65.46 | ||
Best results are highlighted in bold font. For pHMM and EnsembleFam, we removed the predictions where the number of prediction is and considered them as wrong prediction. For others, where the number of prediction is and the true label is included within the predicted one, we consider as correct. For DeepFam, as it only predicts one label, if the predicted label is the same as true label then we consider it as correct. EnsembleFam outperforms other two method in almost all cases