| Literature DB >> 20594334 |
Pooja Jain1, Jonathan D Hirst.
Abstract
BACKGROUND: Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20594334 PMCID: PMC2916923 DOI: 10.1186/1471-2105-11-364
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Training and test sets used in the four evaluation strategies
| Strategy | Training set | Test set | Training (Test) set size |
|---|---|---|---|
| 1 | DS1.69 | DS1.73 | 6929 (6606) |
| 2 | DS1.69 | DS1.73 | 4071 (4114) |
| 3 | DS1.69 | DS1.69 | 4071 (2858) |
| 4 | DS1.69 | DS1.73 | 4071 (4653) |
DS1.69, set of domain pairs from SCOP version 1.69, DS1.73, set of domain pairs exclusive of SCOP version 1.73. DS1.69and DS1.73, are the respective DS1.69 and DS1.73sets but without NA-pairs. DS1.69and DS1.73, are sets of only NA-pairs from DS1.69 and DS1.73, respectively.
SCOP inter-version blind prediction using random forest
| Class | 0.89 | 0.84 | 0.73 | 0.94 | 0.99 | 0.81 |
| Fold | 0.86 | 0.45 | 0.61 | 0.93 | 0.41 | 0.61 |
| Super-family | 0.80 | 0.55 | 0.69 | 0.75 | 0.55 | 0.63 |
| Family | 0.82 | 0.87 | 0.83 | 0.84 | 0.85 | 0.83 |
| None | 0.81 | 0.91 | 0.76 | n/a | n/a | n/a |
The prediction accuracies and the class-wise performance of the random forest in strategies 1 and 2. Percent accuracy is the percentage of correctly classified domain pairs from all of the shared SCOP levels. Pre = Precision, Rec = Recall and MCC = Matthew's correlation coefficient.
Confusion matrices for strategies 1 and 2
| Strategy 1 | ||||||
|---|---|---|---|---|---|---|
| CL | 2813 | 10 | 0 | 12 | 511 | |
| FO | 92 | 125 | 1 | 30 | 30 | |
| SF | 21 | 2 | 56 | 23 | 0 | |
| FA | 30 | 5 | 13 | 336 | 4 | |
| NA | 216 | 3 | 0 | 9 | 2264 | |
| CL | 3327 | 4 | 1 | 14 | - | |
| FO | 134 | 115 | 6 | 23 | - | |
| SF | 21 | 1 | 56 | 24 | - | |
| FA | 44 | 3 | 12 | 329 | - | |
| - | - | - | - | - | - | |
The confusion matrices used for classification of domain pairs according to their shared structural level following strategies 1 and 2. CL = shared Class, FO = shared Fold, SF = shared Super-family, FA = shared Family and NA = no shared structural level.
Classification of NA-pairs in strategies 3 and 4 according to the probability estimates
| Predicted Shared Level | Total | p | p = 0.5 | 0.5 | p ≥ 0.9 |
|---|---|---|---|---|---|
| Strategy 3 | |||||
| Class | 2806 | 14 | 130 | 1235 | 1427 |
| Fold | 29 | 0 | 6 | 9 | 14 |
| Super-family | 2 | 2 | 0 | 0 | 0 |
| Family | 21 | 0 | 8 | 12 | 1 |
| Class | 4558 | 37 | 162 | 1917 | 2442 |
| Fold | 55 | 0 | 5 | 37 | 13 |
| Super-family | 3 | 3 | 0 | 0 | 0 |
| Family | 37 | 0 | 9 | 20 | 8 |
Following strategies 3 and 4, NA-pairs can be classified as sharing one of the top four SCOP structural levels. A large fraction of NA-pairs are classified to share the same Class in both the strategies. However, further distribution of such pairs according to the probability reflects the classification confidence. The high probability reflects more confidence in the predicted classification.
Classification performance of the random forest on domains consisting of four, five and six SSEs in ten-fold cross-validation.
| Shared SCOP Level | 4SSEs | 5SSEs | 6SSEs | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy = 98% | Accuracy = 98% | Accuracy = 97% | |||||||
| Pre | Rec | MCC | Pre | Rec | MCC | Pre | Rec | MCC | |
| Class | 0.99 | 0.99 | 0.92 | 0.98 | 1.00 | 0.89 | 0.97 | 1.00 | 0.85 |
| Fold | 0.96 | 0.83 | 0.89 | 1.00 | 0.69 | 0.82 | 0.95 | 0.51 | 0.70 |
| Super-family | 0.88 | 0.69 | 0.78 | 0.98 | 0.65 | 0.79 | 0.95 | 0.57 | 0.74 |
| Family | 0.98 | 0.92 | 0.95 | 0.98 | 0.92 | 0.94 | 0.98 | 0.84 | 0.90 |
Classification performance of the random forest on domains consisting of four, five and six SSEs in ten-fold cross-validation. Pre = Precision, Rec = Recall and MCC = Matthew's correlation coefficient.
Classification for the selected unclassified target domains
| Target PDB Id PDB Title | Predicted Classification (sunid) | Template | |
|---|---|---|---|
| 2JZ6 | 50S ribosomal protein L28 | Scop_cf DNA/RNA-binding 3-helical bundle (46688) | 1SAN, 1HOM |
| scop_cf Spectrin repeat-like (46965) | 1CUN, 1U4Q | ||
| 2K2D | C-terminal domain of human pirh2 | scop_cf Cupredoxin-like (49502) | 1V54, 1OCR, 2DYS, 1OCC |
| scop_cf Rubredoxin-like (57769) | 2EIM, 2DYS, 1OCZ | ||
| scop_cf Glucocorticoid receptor-like (DNA-binding domain) alpha+beta metal(zinc)-bound fold (57715) | 1B8T | ||
| 2K5J | Protein yiiF Uncharacterised protein | scop_cf DNA/RNA binding 3 helical bundle (46688) | 1FJL, 1MBJ, 2DS5 |
| scop_sf "Winged helix" DNA binding domain (46785) | 2GZW | ||
| scop_cf Albumin binding domain like (46996) | 1GJS(T), 1J78, 1MA9 | ||
| 2RPJ | Fn 14 Cystein Rich Domain (CRD) | scop_sf t-snare proteins (47661) | 1S94, 1EZ3, 1BR0 |
| scop_cf Spectrin repeat-like (46965) | 1E2A, 2E2A | ||
| 2ZM6 | 30 S ribosomal subunit | Different scop_fa covering 22 lineages of various Ribosomal protein S(2-20) families* | 1HNW, 2UU9, 1IBL |
| 3BPJ | Human translation initiation factor 3 | scop_cf Long alpha hairpin fold (46556)** | 2OTJ, 1YHQ, 1VQK |
| scop_cf Tetracyclin repressor-like (48497) | 1ZK8 | ||
| 3H3M | Flagellar protein FliT | scop_fa Voltage-gated potassium channels (81323) | 2HVK, 1JVM, 1R3J 1K4D |
| scop_cf Spectrin repeat-like (46965) | 1G73, 1FEW | ||
| scop_fa MIT domain (116847) | 1YXR | ||
| 3ERM | Conserved protein with unknown function | scop_fa Myb/SANT domain (46739) | 1IDY, 1MSE, 1MBJ |
| scop_cf alpha-alpha superhelix (48370) | 1HF8, 1HG5, 1HFA | ||
| scop_fo Spectrin repeat-like (46965) | 1E2A, 2E2A | ||
| 3GI7 | Secreted protein of unknown function | scop_cf DNA/RNA-binding 3-helical bundle (46688) | 1P7I, 2HDD, 1DU0, 1FTT |
Predicted classification for the selected unclassified target domains based on the classified domains (Template). Scop_cf= SCOP Fold, scop_sf = SCOP super-family and scop_fa = SCOP family.* Additional file 2 (2ZM6),** Additional file 3 (3BPJ).
Figure 1Structural overlap of domains predicted to share the same SCOP Family. Structural overlap of some of the selected target-template pairs may confirm the correctness of predicted shared Family level. Rows one, two and three show the overlap of domains consisting of four, five and six SSEs, respectively. The PDB identifier for the target is on the left-hand side of the sub-figure caption and for the template domain(s) it is on the right-hand side.
Figure 2Structural overlap of domains predicted to share the same SCOP Super-family. Structural similarity of some of the selected target-template pairs. The target domains on the left were predicted to share the same super-family as the respective template domains on the right-hand side.
Figure 3Structural overlap of domains predicted to share the same SCOP Fold. Structural similarity of some of the selected target-template pairs. The target domains on the left were predicted to share the same fold as the respective template domains on the right-hand side.
Figure 4Structural similarity among . The domain d1lj2a (a) is predicted to share the same fold as d1hcia3 (b)(Spectrin repeat-like, sunid = 46965). The other domains d1fewa (c), d1g73a (d) and d1s35a (e) are also from the same fold and with them the overlapping domains in Figure 5c (d1ibmt (cyan), d1pnxt (magenta) and d1pnst (yellow)) are also predicted to share the FO level.
Figure 5Structural overlap of . (a) The domain d1ibkr (magenta) is currently classified such that it does not share any structural level with domains d1j5er (cyan) and d1n32r (green). However, we predict them to be FA-pairs. Similarly in (b) domains d1ibmt (cyan), d1pnxt (magenta) and d1pnst (yellow) do not currently share any structure level with the domain d2uuct1 (green, a SCOP 1.73 only domain). However, following strategy 4, the pairs of d1ibmt, d1pnxt and d1pnst with d2uuct1 are predicted to be FA-pairs sharing the Ribosomal protein S20 (sunid = 46993) family. (c) The domain d1d5qa (yellow) is a member of singleton family Mini-protein reproducing the core of the CD4 surface (sunid = 58922) under the not a true SCOP class Designed proteins. It and d2ptaa (magenta) are predicted to be a FA-pair belonging to the Short-chain scorpion toxins, (sunid = 57116) family.
Figure 6Structural similarity of the two domains d3h3ma (green) and d1yxr (cyan), predicted to share the same family MIT domain (suind = 116847)
Figure 7Venn diagram showing number of different SCOP levels represented by the datasets used. The overlap of Class (a), Fold (b), Super-family (c) or Family (d) levels represented by the datasets of domains consisting of three, four, five and six SSEs
Datasets of domains composed of four, five and six SSEs
| Domain Pair Type | Number of SSEs | ||
|---|---|---|---|
| Four | Five | Six | |
| 5154 | 7494 | 8845 | |
| FO-Pairs | 221 | 232 | 135 |
| SF-Pairs | 66 | 120 | 154 |
| FA-Pairs | 290 | 462 | 456 |
Number of different pairs in the datasets of domains composed of four, five and six SSEs.