| Literature DB >> 18721472 |
Abstract
BACKGROUND: Structural similarities among proteins can provide valuable insight into their functional mechanisms and relationships. As the number of available three-dimensional (3D) protein structures increases, a greater variety of studies can be conducted with increasing efficiency, among which is the design of protein structural alphabets. Structural alphabets allow us to characterize local structures of proteins and describe the global folding structure of a protein using a one-dimensional (1D) sequence. Thus, 1D sequences can be used to identify structural similarities among proteins using standard sequence alignment tools such as BLAST or FASTA.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18721472 PMCID: PMC2529324 DOI: 10.1186/1471-2105-9-349
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Relationship between the number of clusters found and the number of SOM map units used
| SOM map size | Number of clusters | SOM map size | Number of clusters |
| 10 × 10 | 6 | 110 × 110 | 24 |
| 20 × 20 | 9 | 120 × 120 | 19 |
| 30 × 30 | 10 | 130 × 130 | 21 |
| 40 × 40 | 12 | 140 × 140 | 22 |
| 50 × 50 | 15 | 150 × 150 | 18 |
| 60 × 60 | 13 | 160 × 160 | 15 |
| 70 × 70 | 14 | 170 × 170 | 21 |
| 80 × 80 | 18 | 180 × 180 | 18 |
| 90 × 90 | 18 | 190 × 190 | 18 |
| 100 × 100 | 20 | 200 × 200 | 18 |
Our analysis determined that among the number of clusters that maximized the BIC, 18 clusters occurred most frequently. Thus, we assigned 18 letters to our alphabet.
The average overlap between all pairs of SOMs that produced 18 clusters of fragments
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | |
| 99.8 | 98.4 | 96.7 | 97.4 | 97.4 | 94.3 | 99.1 | 95.0 | 97.8 | 94.6 | 99.8 | 95.6 | 96.7 | 95.3 | 95.7 | 98.2 | 96.3 | 95.5 |
Summary of the within-cluster Euclidean distance and the center-to-center Euclidean distance for 18 protein fragment clusters found by our alphabet design pipeline
| 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | |||
| 1 | 116.6 | 37.2 | 252.3 | 300.4 | 330.1 | 242.8 | 181.7 | 182.1 | 317.6 | 327.7 | 415.4 | 266.3 | 329.0 | 181.7 | 242.5 | 262.2 | 273.6 | 253.4 | 193.2 | 0 |
| 2 | 238.7 | 38.5 | 315.8 | 226.6 | 272.7 | 197.4 | 243.3 | 227.2 | 285.3 | 270.5 | 346.1 | 283.9 | 285.4 | 261.3 | 189.5 | 182.3 | 215.0 | 296.0 | 0 | |
| 3 | 264.7 | 29.8 | 219.7 | 279.8 | 193.6 | 220.6 | 190.4 | 284.1 | 251.1 | 292.9 | 413.2 | 195.1 | 237.6 | 181.4 | 324.4 | 234.1 | 285.9 | 0 | ||
| 4 | 319.3 | 41.5 | 297.8 | 297.0 | 270.7 | 285.5 | 311.5 | 288.6 | 286.9 | 317.1 | 352.2 | 302.9 | 184.3 | 250.7 | 256.2 | 193.3 | 0 | |||
| 5 | 250.4 | 39.7 | 248.6 | 268.9 | 190.2 | 238.1 | 302.2 | 280.1 | 258.5 | 287.2 | 406.6 | 267.2 | 258.8 | 192.8 | 229.0 | 0 | ||||
| 6 | 257.5 | 28.0 | 220.4 | 174.2 | 242.3 | 180.4 | 262.8 | 266.2 | 264.4 | 229.1 | 310.3 | 322.3 | 270.9 | 308.6 | 0 | |||||
| 7 | 72.2 | 20.4 | 220.8 | 356.7 | 289.2 | 297.5 | 266.1 | 244.8 | 307.2 | 361.1 | 478.3 | 248.9 | 316.8 | 0 | ||||||
| 8 | 282.2 | 31.0 | 275.3 | 214.2 | 186.1 | 218.9 | 259.1 | 335.6 | 258.2 | 253.8 | 286.9 | 273.9 | 0 | |||||||
| 9 | 320.9 | 27.9 | 275.8 | 287.6 | 250.7 | 244.5 | 222.6 | 292.2 | 286.7 | 307.3 | 354.3 | 0 | ||||||||
| 10 | 148.8 | 26.1 | 406.3 | 243.1 | 334.4 | 286.3 | 333.5 | 361.8 | 293.2 | 240.8 | 0 | |||||||||
| 11 | 97.1 | 43.4 | 290.4 | 169.5 | 214.8 | 178.9 | 248.7 | 238.3 | 270.4 | 0 | ||||||||||
| 12 | 272.0 | 32.7 | 259.7 | 226.6 | 200.7 | 218.7 | 269.1 | 325.6 | 0 | |||||||||||
| 13 | 133.6 | 33.2 | 291.2 | 309.3 | 334.3 | 267.6 | 230.5 | 0 | ||||||||||||
| 14 | 272.8 | 31.4 | 255.5 | 206.2 | 258.7 | 145.3 | 0 | |||||||||||||
| 15 | 106.2 | 32.3 | 241.1 | 76.8 | 162.1 | 0 | ||||||||||||||
| 16 | 109.0 | 39.1 | 221.8 | 172.6 | 0 | |||||||||||||||
| 17 | 33.2 | 23.2 | 272.9 | 0 | ||||||||||||||||
| 18 | 146.2 | 38.2 | 0 | |||||||||||||||||
Summary of the within-cluster Euclidean distance and the center-to-center Euclidean distance for 18 protein fragment clusters found by the SOM alone
| 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | |||
| 1 | 129.1 | 38.9 | 220.9 | 270.9 | 302.4 | 202.7 | 175.7 | 161.9 | 277.2 | 295.9 | 381.4 | 233.8 | 307.6 | 181.6 | 234.8 | 223.0 | 263.6 | 250.5 | 164.4 | 0 |
| 2 | 242.7 | 39.4 | 304.8 | 198.1 | 265.0 | 192.0 | 201.2 | 217.5 | 241.3 | 237.1 | 309.9 | 244.0 | 247.9 | 259.6 | 165.5 | 169.5 | 189.0 | 273.5 | 0 | |
| 3 | 265.8 | 29.8 | 180.3 | 276.0 | 168.3 | 177.9 | 169.5 | 266.4 | 237.3 | 256.2 | 397.2 | 185.0 | 218.6 | 156.4 | 298.5 | 224.5 | 280.4 | 0 | ||
| 4 | 327.1 | 41.4 | 261.7 | 275.8 | 265.8 | 241.5 | 298.8 | 273.8 | 250.9 | 297.5 | 321.2 | 266.9 | 182.9 | 215.2 | 250.4 | 156.8 | 0 | |||
| 5 | 251.9 | 39.6 | 206.8 | 223.7 | 150.2 | 207.1 | 300.9 | 258.4 | 217.8 | 274.8 | 400.7 | 227.2 | 243.6 | 167.1 | 227.6 | 0 | ||||
| 6 | 260.7 | 29.2 | 202.0 | 158.5 | 235.8 | 137.3 | 248.4 | 225.3 | 258.8 | 205.8 | 304.7 | 300.23 | 235.0 | 297.0 | 0 | |||||
| 7 | 75.7 | 20.5 | 191.9 | 323.9 | 243.6 | 280.5 | 238.4 | 199.0 | 292.7 | 346.6 | 463.2 | 247.1 | 291.3 | 0 | ||||||
| 8 | 291.4 | 30.8 | 250.4 | 196.2 | 144.8 | 203.1 | 245.8 | 322.9 | 245.4 | 226.8 | 265.6 | 272.7 | 0 | |||||||
| 9 | 329.1 | 27.9 | 275.3 | 251.6 | 219.3 | 200.6 | 197.1 | 263.8 | 278.1 | 272.5 | 342.3 | 0 | ||||||||
| 10 | 157.9 | 27.4 | 364.8 | 240.4 | 310.3 | 262.0 | 292.9 | 329.8 | 266.4 | 234.3 | 0 | |||||||||
| 11 | 113.8 | 45.8 | 244.9 | 156.7 | 190.2 | 167.5 | 224.8 | 213.6 | 254.9 | 0 | ||||||||||
| 12 | 283.0 | 32.4 | 215.7 | 205.4 | 197.6 | 191.5 | 239.0 | 299.2 | 0 | |||||||||||
| 13 | 170.3 | 29.5 | 277.6 | 272.6 | 322.4 | 252.2 | 188.8 | 0 | ||||||||||||
| 14 | 277.8 | 32.6 | 238.5 | 179.7 | 239.8 | 99.5 | 0 | |||||||||||||
| 15 | 111.2 | 33.1 | 210.6 | 59.5 | 161.6 | 0 | ||||||||||||||
| 16 | 114.05 | 38.4 | 219.4 | 146.8 | 0 | |||||||||||||||
| 17 | 36.2 | 24.8 | 228.6 | 0 | ||||||||||||||||
| 18 | 158.5 | 37.4 | 0 | |||||||||||||||||
The average Phi/Psi angles (i.e. the Phi/Psi angles of the centroid) for the 18 clusters found by our alphabet design pipeline
| -97.99 | -70.43 | -104.52 | -79.77 | 132.99 | 118.98 | 132.37 | -44.26 | |
| -67.81 | -67.48 | -92.52 | -69.17 | -52.78 | 134.75 | 96.12 | -35.69 | |
| -98.66 | -99.17 | -83.46 | -104.16 | 132.56 | 75.64 | -36.97 | 134.01 | |
| 90.39 | -63.35 | -93.54 | -84.31 | -5.43 | 97.71 | 115.22 | 94.64 | |
| -88.09 | -102.50 | -93.58 | -97.49 | -51.56 | 88.66 | 106.12 | 133.27 | |
| -65.87 | -69.19 | -85.50 | -59.89 | -35.12 | -50.41 | 129.98 | -37.57 | |
| -107.28 | -96.08 | -107.66 | -105.96 | 132.71 | 130.92 | 133.88 | 133.06 | |
| 89.16 | -93.43 | -62.92 | -90.25 | 20.65 | 0.22 | -32.50 | 85.94 | |
| -91.05 | -90.16 | 91.91 | -91.53 | 100.48 | 103.36 | 5.40 | 75.56 | |
| 58.59 | 56.79 | 55.50 | 54.75 | -42.38 | -38.76 | -47.77 | -48.46 | |
| -71.08 | -84.21 | -65.92 | 87.57 | -21.11 | -29.95 | -31.80 | 20.00 | |
| -83.07 | 95.78 | -69.02 | -91.34 | 9.50 | -9.18 | -5.50 | 100.52 | |
| -88.72 | -64.82 | -95.72 | 91.27 | 100.65 | 113.69 | 107.43 | 0.70 | |
| -87.36 | -71.63 | -75.80 | -68.31 | 134.69 | 58.97 | -35.87 | -49.72 | |
| -96.95 | -78.84 | -75.71 | -78.03 | 4.07 | 2.17 | -33.25 | -25.92 | |
| -83.07 | -95.71 | -63.62 | -97.87 | -28.27 | -28.59 | -38.35 | 126.57 | |
| -63.55 | -65.43 | -62.97 | -68.03 | -42.53 | -41.88 | -42.16 | -38.34 | |
| -105.06 | -91.96 | -78.47 | -94.14 | 122.89 | -83.40 | 109.64 | 99.64 |
Figure 1The 3D conformation of the representative segment for each alphabet letter.
Figure 2Superimposition of protein segments in the 18 clusters.
Frequency of occurrence of the protein segments defined by our alphabet in four main SCOP classes
| All alpha | All beta | alpha/beta | alpha+beta | |||||
| Letter | Count | Percentage | Count | Percentage | Count | Percentage | Count | Percentage |
| A | 54859 | 2.95 | 255473 | 8.83 | 278238 | 5.46 | 161041 | 6.07 |
| R | 91363 | 4.92 | 148361 | 5.13 | 270345 | 5.31 | 145619 | 5.49 |
| N | 76176 | 4.10 | 309834 | 10.71 | 340682 | 6.69 | 202555 | 7.64 |
| D | 21055 | 1.13 | 127159 | 4.39 | 112078 | 2.20 | 66959 | 2.53 |
| C | 34856 | 1.88 | 172334 | 5.96 | 193952 | 3.81 | 102632 | 3.87 |
| Q | 102444 | 5.51 | 111333 | 3.85 | 271893 | 5.34 | 138081 | 5.21 |
| E | 58672 | 3.16 | 782607 | 27.06 | 620717 | 12.18 | 427778 | 16.14 |
| G | 42350 | 2.28 | 72105 | 2.49 | 147390 | 2.89 | 76968 | 2.90 |
| H | 39017 | 2.10 | 115542 | 3.99 | 163319 | 3.21 | 89203 | 3.37 |
| I | 3547 | 0.19 | 6607 | 0.23 | 9449 | 0.19 | 5739 | 0.22 |
| L | 49312 | 2.65 | 40909 | 1.41 | 141605 | 2.78 | 65856 | 2.48 |
| K | 43582 | 2.35 | 58687 | 2.04 | 146869 | 2.88 | 70549 | 2.66 |
| M | 16727 | 0.90 | 127070 | 4.39 | 110318 | 2.17 | 67912 | 2.56 |
| F | 70718 | 3.81 | 89366 | 3.09 | 179145 | 3.52 | 91702 | 3.46 |
| P | 104364 | 5.62 | 54939 | 1.91 | 192654 | 3.78 | 87149 | 3.29 |
| S | 76080 | 4.10 | 83725 | 2.89 | 173935 | 3.41 | 91160 | 3.44 |
| T | 937938 | 50.49 | 149259 | 5.17 | 1551585 | 30.46 | 651525 | 24.58 |
| W | 34533 | 1.86 | 186476 | 6.46 | 190001 | 3.72 | 108460 | 4.09 |
Figure 3Example learning curve of matrix training. The average positive hit rate converged at 0.9153 with the learning rate set to 0.5.
Figure 4The substitution matrix TRISUM-169.
SA-FAST versus 3D-BLAST and PSI-BLAST in SCOP structural function assignment accuracy for the SCOP-894 protein dataset
| Class | 894 proteins | Accuracya (894 proteins) | Accuracy (sequence identity <25%) | ||||
| Number of queries | SA-FAST | 3D-BLAST | PSI-BLAST | SA-FAST | 3D-BLAST | PSI-BLAST | |
| All alpha | 161 | 99.27 | 94.41 | 94.41 | 95.83 | 75.00 | 66.67 |
| All beta | 199 | 95.12 | 94.47 | 93.97 | 87.32 | 77.55 | 73.33 |
| 292 | 97.58 | 97.26 | 91.44 | 95.68 | 87.88 | 65.77 | |
| 242 | 95.13 | 94.63 | 88.84 | 93.81 | 83.33 | 60.87 | |
aThe top-ranked family in the hit list of a query was used as the predicted family. Accuracy is the percentage of times that the family was correctly predicted.
Comparison between SA-FAST, 3D-BLAST, PSI-BLAST, YAKUSA, MAMMOTH, and CE on 50 proteins selected from SCOP95-1.69
| Search tool | Average time required for a query (sec) | Relative to SA-FAST | Accuracya (%) | Average precisionb (%) |
| SA-FAST | 1.15 | 1.00 | 96 | 90.80 |
| 3D-BLAST | 1.30 | 1.13 | 94 | 85.20 |
| PSI-BLAST | 0.48 | 0.42 | 84 | 68.16 |
| YAKUSA | 8.88 | 7.72 | 90 | 74.86 |
| MAMMOTH | 1834.18 | 1594.94 | 100 | 94.01 |
| CE | 22053.32 | 19176.80 | 98 | 90.78 |
aThe top-ranked family in the hit list of a query was used as the predicted family. Accuracy is the percentage of times that the family was correctly predicted.
bThe precision is defined as T/H, where T is the number of true hit structures in the hit list, and H is the total number of structures in the hit list.
Results of ten difficult cases of pairwise alignment
| 1fxia | 1ubq | 48(2.10) | 60(2.60) | 64(3.80) | 63(3.01) | 59(2.76) | 76(2.89) | 58(2.64) |
| 1ten | 3hhrb | 78(1.60) | 86(1.90) | 87(1.90) | 87(1.90) | 57(2.57) | 73(2.31) | 90(2.24) |
| 3hlab | 2rhe_ | - | 63(2.50) | 85(3.50) | 79(2.81) | 54(2.65) | 78(3.01) | 79(2.87) |
| 2azaa | 1paz_ | 74(2.20) | 81(2.50) | 85(2.90) | 87(3.01) | 70(2.34) | 57(2.23) | 87(2.40) |
| 1cewi | 1mola | 71(1.9) | 81(2.30) | 69(1.90) | 83(2.44) | 52(2.37) | 53(2.35) | 61(1.83) |
| 1cid_ | 2rhe_ | 85(2.20) | 95(3.30) | 94(2.70) | 100(3.11) | 54(2.75) | 53(2.49) | 55(2.08) |
| 1crl_ | 1ede | - | 211(3.40) | 187(3.20) | 269(3.55) | 167(3.35) | 120(3.47) | 187(3.25) |
| 2sim_ | 1nsba | 284(3.80) | 286(3.80) | 264(3.00) | 286(3.07) | 121(2.75) | 121(2.96) | 137(3.2) |
| 1bgea | 2gmfa | 74(2.50) | 98(3.50) | 94(4.10) | 100(3.19) | 27(3.34) | 77(2.8) | 78(2.72) |
| 1tie_ | 4fgf_ | 82(1.70) | 108(2.00) | 116(2.90) | 117(3.05) | 91(3.15) | 62(3.45) | 115(3.05) |
The number of residues aligned and the RMSD (in parentheses) are shown. The last row displays the average RMSD per aligned residue. Except for PBE-align, 3D-BLAST, and SA-FAST, the results of the methods were adopted from [36].
Figure 5Superimposition examples based on alignments identified by SA-FAST. (a) 1fxiA & 1ubq_ (b) 2azaA & 1paz_ (c) 1cewI & 1molA (d) 1cid_ & 2rhe.
Comparison between our structural alphabet (used in SA-FAST) and those of Yang & Tung (used in 3D-BLAST) and de Brevern et al. (converted by PBE-T, a facility associated with PBE-align) for describing motifs found by MEME within the EGF family
| 24 | 23 | 95.8 | 22 | 91.7 | 23 | 95.8 | 11 | 45.8 | 21 | 87.5 | 19 | 79.2 | 18 | 75.0 | 14 | 58.3 | 18 | 75.0 | |
| 74 | 73 | 98.6 | 71 | 95.9 | 74 | 100.0 | 62 | 83.8 | 73 | 98.6 | 60 | 81.1 | 68 | 91.9 | 62 | 83.8 | 70 | 94.6 | |
| 117 | 116 | 99.1 | 106 | 90.6 | 61 | 52.1 | 54 | 46.2 | 102 | 87.2 | 25 | 21.4 | 109 | 93.2 | 112 | 95.7 | 48 | 41.0 | |
| 12 | 12 | 100.0 | 11 | 91.7 | 11 | 91.7 | 9 | 75.0 | 11 | 91.7 | 9 | 75.0 | 12 | 100.0 | 11 | 91.7 | 9 | 75.0 | |
| 227 | 224 | 98.6 | 210 | 92.5 | 169 | 74.4 | 136 | 59.9 | 207 | 91.2 | 113 | 49.8 | 207 | 91.2 | 199 | 87.7 | 145 | 63.9 | |
aThe number of EGF proteins of a specific type, bA hit for a sub-domain occurred when more than half of the sub-domain residues were contained in a given motif. We present the number of hits of different types, cCov(Coverage) was defined as the ratio of the number of hits to the number of EGF proteins, e.g., if No. = 24 and Hits = 22, then Cov = 22/24 = 91.7%.
Statistical analysis of EGF(EGF-like) proteins whose sub-domains were detected by MEME
| 151 | 66.52 | 79 | 34.80 | 104 | 45.81 | |
| 74 | 32.60 | 78 | 34.36 | 116 | 51.10 | |
| 2 | 0.88 | 63 | 27.75 | 7 | 3.08 | |
| 0 | 0.00 | 7 | 3.08 | 0 | 0.00 | |
| 227 | 100.00 | 227 | 100.00 | 227 | 100.00 | |
aEGF (EGF-like) proteins in which all three sub-domains (A, B and C) were found by MEME, bEGF (EGF-like) proteins in which two of the three sub-domains were found by MEME, cEGF (EGF-like) proteins in which only one sub-domain was found by MEME, dEGF (EGF-like) proteins in which MEME failed to identify any sub-domain.
Figure 6Examples of structural motifs corresponding to EGF sub-domains. We colored the sub-domains A, B, and C in blue, green, and red, respectively. The motifs that corresponded to EGF sub-domains, using our structural alphabet and those of Yang & Tung and de Brevern et al., were also highlighted in blue, green, and red. The overlapping region between motifs was colored purple. In the sequence view, the first three sequences are EGF protein represented by our structural alphabet, the alphabet of Yang & Tung, and the alphabet of de Brevern et al., respectively. The fourth is the amino acid sequence with the cysteines highlighted in orange. The sub-domains are marked at the bottom.
Summary of properties of structural alphabets and alphabet designs
| 23 | 16 | 18 | |
| Prespecified | Iterative shrinking | BIC | |
| k-means | SOM+HMM | SOM+k-means | |
| Preprocessed (Pair Database) | Preprocessed (PBE-SELECT) | No preprocess (nrPDB) | |
| BLOSUM-like | BLOSUM-like | Self-Training | |
| Yes | Yes | No | |
| Limited | Limited | Modular design More flexible |
Figure 7System architecture of the matrix training framework.