| Literature DB >> 22163274 |
Hsin-Nan Lin1, Cédric Notredame, Jia-Ming Chang, Ting-Yi Sung, Wen-Lian Hsu.
Abstract
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22163274 PMCID: PMC3229492 DOI: 10.1371/journal.pone.0027872
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Connecting two counterpart regions by shared synonyms of two protein sequences.
The words YIAKQRQ in protein S and VKALPDA in protein T share two synonyms which are extracted from their similar sequences.
Figure 2The algorithm of SymAlign. We use PSI-BLAST to collect a group of similar sequences for the targets from which we define synonyms.
Similarity scores are estimated based on the shared synonyms. A library of all alignable residue pairs is made and fed into T-Coffee for generating a sequence alignment.
Comparison with existing methods on pairwise alignments.
| Methods | BAliBASE's RV11 | PREFAB | ||||
| Q-score (%) | iRMSD (Å) | RMSD(Å) | Q-score (%) | iRMSD (Å) | RMSD(Å) | |
| SymAlign | 45.78 | 1.31 | 11.10 | 22.56 | 1.35 | 11.57 |
| Probalign | 40.42 | 1.38 | 11.84 | 21.10 | 1.40 | 11.89 |
| MTRAP | 39.71 | 1.41 | 12.97 | 21.80 | 1.44 | 12.91 |
| T-Coffee | 39.58 | 1.38 | 13.10 | 21.45 | 1.43 | 13.30 |
| ProbCons | 38.76 | 1.38 | 13.15 | 21.03 | 1.40 | 13.23 |
| MUSCLE | 37.44 | 1.42 | 13.23 | 21.18 | 1.47 | 13.29 |
| MAFFT | 35.35 | 1.41 | 13.63 | 19.13 | 1.45 | 13.78 |
| ClustalW | 34.21 | 1.48 | 13.69 | 19.14 | 1.50 | 13.29 |
| B_DHIP | 27.44 | 1.48 | 13.75 | 13.81 | 1.43 | 12.93 |
| Dialign | 29.78 | 1.42 | 14.91 | 15.71 | 1.42 | 14.00 |
| BASIC | 14.91 | 1.41 | 16.73 | 8.57 | 1.48 | 14.86 |
Every pair of proteins contained in each test set was aligned with each aligner and subsequently evaluated with the three metrics: Q-score, iRMSD and RMSD. SymAlign achieves the best ranking on the two test sets and the three quality measures.
Comparison with existing methods on multiple alignments and outliers.
| Methods | 1. BAliBASE'sRV11 | 2. BAliBASE'sRV11' (with outliers) |
| RMSD(Å) | RMSD (Å) | |
| SymAlign | 9.20 | 9.40 |
| Probalign | 8.70 | 10.20 |
| T-Coffee | 10.31 | 10.80 |
| ProbCons | 10.31 | 11.09 |
| MUSCLE | 11.75 | 13.39 |
| Dialign | 11.90 | 11.64 |
| MAFFT | 12.21 | 13.89 |
| ClustalW | 12.44 | 13.39 |
| MTRAP | 16.38 | 16.53 |
We estimated the alignment accuracy on the original RV11 test sets and those with additions of outliers. The experiment result shows that SymAlign is more robust to outliers than any other aligners tested here.
The comparison results of identifying structural similarity on RV11.
| Method | Sequence Identity >10% | Sequence Identity >15% | Sequence Identity >20% | |||
| Precision | Recall | Precision | Recall | Precision | Recall | |
| TM-align | 93.40 | 23.74 | 100.00 | 10.61 | 100.00 | 6.42 |
| SymAlign | 40.57 | 23.46 | 81.13 | 12.01 | 100.00 | 7.26 |
| Dialign | 6.69 | 28.21 | 28.77 | 11.17 | 90.00 | 7.54 |
| ClustalW | 1.65 | 65.36 | 18.46 | 20.11 | 85.71 | 8.37 |
| MTRAP | 1.63 | 65.08 | 17.41 | 21.78 | 88.89 | 8.94 |
| Probalign | 1.61 | 63.13 | 6.62 | 26.53 | 85.41 | 11.45 |
| T-Coffee | 1.55 | 74.02 | 6.44 | 28.49 | 77.77 | 9.77 |
| ProbCons | 1.52 | 74.86 | 5.96 | 32.12 | 74.00 | 10.33 |
| MUSCLE | 1.47 | 83.80 | 3.39 | 38.26 | 72.73 | 11.17 |
| MAFFT | 1.32 | 88.83 | 1.66 | 60.05 | 19.87 | 17.32 |
The comparison results of identifying structural similarity on PREFAB.
| Method | Sequence Identity >10% | Sequence Identity >15% | Sequence Identity >20% | |||
| Precision | Recall | Precision | Recall | Precision | Recall | |
| TM-align | 94.53 | 14.55 | 98.42 | 7.59 | 98.75 | 4.22 |
| SymAlign | 14.25 | 15.68 | 70.41 | 8.70 | 95.89 | 4.78 |
| Dialign | 3.03 | 24.77 | 22.65 | 9.55 | 87.14 | 4.88 |
| MTRAP | 1.42 | 50.71 | 10.68 | 14.38 | 87.02 | 5.74 |
| Probalign | 1.41 | 51.29 | 5.07 | 19.15 | 70.44 | 6.79 |
| ClustalW | 1.39 | 52.46 | 9.26 | 14.56 | 81.08 | 5.79 |
| T-Coffee | 1.27 | 58.22 | 4.34 | 20.52 | 59.45 | 6.91 |
| ProbCons | 1.25 | 60.52 | 3.84 | 21.59 | 55.05 | 7.18 |
| MUSCLE | 1.17 | 67.25 | 2.82 | 25.87 | 53.63 | 7.16 |
| MAFFT | 1.10 | 79.29 | 1.48 | 45.02 | 13.69 | 10.92 |
The proportions of positive cases both identified by TM-align and SymAlign to those only identified by TM-align with respect to different thresholds.
| SequenceIdentity >10% | SequenceIdentity >15% | Sequence Identity >20% | |
| BAliBASE's RV11 | 91.76% | 89.47% | 95.65% |
| PREFAB | 86.39% | 90.95% | 92.73% |
The experiment shows that the agreement between SymAlign and TM-align on RV11 and PREFAB datasets is very strong.
Figure 3The dot matrix generated by SymAlign for proteins BB11002.1bb9 and BB11002.1ov3_A in RV11.
A grayscaled dot represents the number of shared synonyms corresponding to a residue pair. We turn a grayscaled dot into a red-scaled one if the corresponding residue pair is annotated as an equivalent pair in the reference alignment. As one can see, the left side of the matrix shows an alternative alignment with a pattern very similar to the reference alignment.