| Literature DB >> 16393147 |
Zhen Yao1, Juan Xiao, Anthony K H Tung, Wing Kin Sung.
Abstract
Finding the common substructures shared by two proteins is considered as one of the central issues in computational biology because of its usefulness in understanding the structure-function relationship and application in drug and vaccine design. In this paper, we propose a novel algorithm called FAMCS (Finding All Maximal Common Substructures) for the common substructure identification problem. Our method works initially at the protein secondary structural element (SSE) level and starts with the identification of all structurally similar SSE pairs. These SSE pairs are then merged into sets using a modified Apriori algorithm, which will test the similarity of various sets of SSE pairs incrementally until all the maximal sets of SSE pairs that deemed to be similar are found. The maximal common substructures of the two proteins will be formed from these maximal sets. A refinement algorithm is also proposed to fine tune the alignment from the SSE level to the residue level. Comparison of FAMCS with other methods on various proteins shows that FAMCS can address all four requirements and infer interesting biological discoveries.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16393147 PMCID: PMC5172543 DOI: 10.1016/s1672-0229(05)03015-9
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Fig. 1Simplified 3D structures of proteins P and Q. α-helix is represented by ellipse, while, β-sheet is represented by rectangle.
Fig. 2The 3D structures of the backbone of the immunoglobulin fab fragment (1MCP, chain L) and the murine T-cell antigen receptor (1TCR, chain B). They have two domains in common: the constant (C) and the variable (V) domain, but their relative positions are different.
Parameter Tuning for T, T, and T
| Protein Pair | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1MCPl | 1GGGa | 1F4N | 1B0U | 1MJP | 2CRO | 1A1Ea | 3HSC | 1HYWa | 1GMI | |
| 1TCRb | 1WDNa | 256Ba | 1AM1 | 1ECR | 2WRPr | 2ABL | 2YHX | 3UBPa | 1HYWa | |
| 7, 30, 2 | ||||||||||
| 7, 30, 3 | ||||||||||
| 7, 30, 4 | ||||||||||
| 7, 45, 2 | ||||||||||
| 7, 45, 4 | ||||||||||
| 7, 60, 2 | ||||||||||
| 7, 60, 4 | ||||||||||
The one that produces the best result. (3, 7, 45) and (3, 7, 60) generated the optimal results.
Parameter Tuning for W and W
| Protein Pair | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1MCPl | 1GGGa | 1F4N | 1B0U | 1MJP | 2CRO | 1A1Ea | 3HSC | 1HYWa | 1GMI | |
| 1TCRb | 1WDNa | 256Ba | 1AM1 | 1ECR | 2WRPr | 2ABL | 2YHX | 3UBPa | 1HYWa | |
| 0.3, 0.7 | ||||||||||
| 0.5, 0.5 | ||||||||||
| 0.7, 0.3 | ||||||||||
The one that produces the best result.
Comparison of FAMCS with DALI, VAST, and Chew’s Work*
| Protein | Method | SSE alignment | RMSD | Residue No. |
|---|---|---|---|---|
| 1MCPl (all | FAMCS | (I) (V domain) | 1.3 | 71 |
| 1TCRb (all | (II) (C domain) 9:11, 10:13, | 2.6 | 86 | |
| DALI | 7.3 | 149 | ||
| VAST | 1.9 | 95 | ||
| 1GGGa ( | FAMCS | (I) (Middle) | 0.5 | 74 |
| 1WDNa ( | (II) (Head+tail) | 0.5 | 100 | |
| DALI | 1:1, | 4.2 | 174 | |
| VAST | 1:1, | 3.4 | 172 | |
| 1F4N (all | FAMCS | 1:3, | 9.1 | 90 |
| 256Ba (all | DALI | 1:1, | 14.4 | 91 |
| 1B0U (α/β) | FAMCS | 5:10, 11:12, 15:3, 19:8, 20:9 | 4.0 | 45 |
| 1AM1 (α/ | DALI | No similarity detected | N/A | N/A |
| 1MJP (all | FAMCS | 1:11, 5:14 (1 and 5 are | 3.2 | 31 |
| 1ECRa ( | DALI | No similarity detected | N/A | N/A |
| 2CRO (all | FAMCS | 0.8 | 24 | |
| 2WRPr (all | DALI | 1(tail):3(tail), | 4.7 | 38 |
| Chew’s | 3.9 | 24 | ||
| 1A1Ea ( | FAMCS | 0.82 | 70 | |
| 2ABL (all | DALI | 1.8 | 95 | |
| VAST | 1.06 | 88 | ||
| Chew’s | NS:NS, | 1.29 | 60 | |
| 3HSC ( | FAMCS | 2.7 | 73 | |
| 2YHX ( | DALI | 5.7 | 265 | |
| Chew’s | 16–18:4-6 | 3.9 | 28 | |
| 1HYWa ( | FAMCS | 1:2, 4:1 | 2.7 | 34 |
| 3UBPa ( | DALI | 1:1, 4:2, NS:5 | 3.1 | 39 |
| 1GMI (all | FAMCS | 5:4, 8:3 | 6.43 | 17 |
| 1HYWa ( | DALI | No similarity detected | N/A | N/A |
Only the SSE alignments are shown. An aligned segment is presented in the form of “i: j” (which means the ith element of the first protein is aligned with the jth element of the second protein), or “i–j:k–l” (which means the ith to the jth element of the first protein are aligned with the kth to the lth element of the second protein). Common alignment pattern found by different methods are highlighted. “NS” represents the structure that is neither α-helix nor β-strand. “i+NS” means the ith SSE followed by a non-SSE part. “i(head)” or “i(tail)” means only the very head or tail portion of the ith SSE participates in the alignment. For FAMCS, SSE alignment of one row is one MCS. Only the top co-present MCSs from FAMCS are shown, except for 1MCPl/1TCRb and 1GGGa/1WDNa, where the top two are displayed. Answers of DALI and VAST are from their web servers (VAST only provides alignment for structural neighbors). Results of Chew’s work are taken from its paper (, hence, many data is unavailable. Wherever the data is unavailable, an “N/A” is put in the table.
Result Sizes and Execution Time vs. Protein Sizes*
| Protein pair | Size | MCS No. | Time (second) | |||||
|---|---|---|---|---|---|---|---|---|
| SSE | Residue | All | Co-present | Total | Step 1 | Step 2 | ||
| 1MCPl | 15 | 220 | 2,030 | 2,545 | 2 | 2,078 | 1 | 2,077 |
| 1TCRb | 19 | 247 | ||||||
| 1GGGa | 18 | 220 | 915 | 709 | 2 | 53 | 0 | 53 |
| 1WDNa | 18 | 223 | ||||||
| 1F4N | 4 | 60 | 8 | 8 | 1 | 0 | 0 | 0 |
| 256Ba | 4 | 106 | ||||||
| 1B0U | 22 | 258 | 558 | 485 | 3 | 13 | 0 | 13 |
| 1AM1 | 12 | 213 | ||||||
| 1MJP | 4(a)+4(b) | 208 | 32 | 37 | 2 | 0 | 0 | 0 |
| 1ECRa | 19 | 305 | ||||||
| 2CRO | 5 | 64 | 5 | 7 | 1 | 0 | 0 | 0 |
| 2WRPr | 6 | 104 | ||||||
| 1A1Ea | 5 | 104 | 56 | 29 | 1 | 0 | 0 | 0 |
| 2ABL | 10 | 163 | ||||||
| 3HSC | 26 | 382 | 1,072 | 1,021 | 5 | 104 | 1 | 103 |
| 2YHX | 22 | 457 | ||||||
| 1HYWa | 4 | 58 | 1 | 3 | 1 | 0 | 0 | 0 |
| 3UBPa | 5 | 100 | ||||||
| 1GMI | 10 | 136 | 3 | 3 | 1 | 0 | 0 | 0 |
| 1HYWa | 4 | 58 | ||||||
The notation L2 is taken from our algorithm (see Methods). L2 size essentially is the total number of similar SSE pairs. Step 1 time refers to the time to find all similar SSE pairs; Step 2 time refers to the time to merge common substructures level by level to get all MCSs; total time includes Step 1 time, Step 2 time, and the time to select co-present MCSs.
Fig. 3Each SSE is represented by a vector with length and direction obtained from the N-and C-terminal C atoms. Let A and B be two vectors corresponding to two SSEs, d is the closest distance between A and B. A’ and B’ are the projected vectors onto the plane that is normal to d. The dihedral angle (Ω) is the angle between A’ and B’ measured along the plane.
Fig. 4Algorithm to generate all MCSs from similar SSE pairs.
Fig. 5Algorithm to select significant co-present MCSs.