| Literature DB >> 19056820 |
Abstract
While most of the recent improvements in multiple sequence alignment accuracy are due to better use of vertical information, which include the incorporation of consistency-based pairwise alignments and the use of profile alignments, we observe that it is possible to further improve accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on a few sets of benchmark alignments that are commonly used to measure alignment accuracy, and the average improvements in accuracy can be as much as 1-3% on protein sequence alignment and 5-10% on DNA/RNA sequence alignment. Unlike previous algorithms, consistent average improvements can be obtained across all identity levels.Entities:
Mesh:
Year: 2008 PMID: 19056820 PMCID: PMC2632924 DOI: 10.1093/nar/gkn945
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Illustration of the beginning portion of the alignment of sequences 1smvA and 4sbvC from PREFAB (3) by different algorithms. (a) Alignment by MUSCLE (3). (b) Alignment by our algorithm NRAlign that modifies MUSCLE, which agrees with the reference structural alignment in PREFAB, where SS is the secondary structure assignment from DSSP (21), with L denoting loop and E denoting extended strand.
Figure 2.Illustration of the window on two sequences s and s′ with ω = 2. (a) The offsets in Nω(x,y) = {−2,−1,1,2} are included. (b) Since y + 1 is the last position in s′, only one position is used to the right of (x,y) and the offsets in Nω(x,y) = {−2,−1,1} are included.
Parameter settings for the modified version of each algorithm that uses horizontal information
| Protein | DNA/RNA | ||||||
|---|---|---|---|---|---|---|---|
| TCoffee | MUSCLE | ProbCons | MUMMALS | TCoffee | MUSCLE | ProbConsRNA | |
| ω | 3 | 2 | 5 | 1 | 9 | 6 | 15 |
| β | 0.7 | 1.0 | 1.0 | 0.8 | 0.7 | 1.0 | 1.0 |
Average SPS and CS scores (in %) on full length protein sequences in BAliBASE 3.0
| TCoffee | MUSCLE | ProbCons | MUMMALS | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SPS | ||||||||||||
| 1V1 {38} | 53.81 | 56.21 | 64.46 | 64.23 | ||||||||
| 1V2 {44} | 91.55 | 90.62 | 93.50 | 93.53 | ||||||||
| 1 (V1–V2) {82} | 74.06 | 0.02 | 74.67 | 0.003 | 80.05 | – | 80.03 | – | ||||
| 2 {41} | 88.82 | 88.08 | 89.93 | 89.18 | ||||||||
| 3 {30} | 71.09 | 75.01 | 78.30 | 80.76 | ||||||||
| 4 {49} | 82.21 | 84.83 | 87.25 | 83.69 | ||||||||
| 5 {16} | 80.98 | 82.69 | 87.69 | 86.33 | ||||||||
| All (1–5) {218} | 78.88 | 0.04 | 80.11 | 0.006 | 83.89 | – | 83.14 | – | ||||
| CS | ||||||||||||
| 1V1 {38} | 31.34 | 33.95 | 40.45 | 41.39 | ||||||||
| 1V2 {44} | 81.64 | 80.75 | 85.52 | 83.98 | ||||||||
| 1 (V1–V2) {82} | 58.33 | 1×10−4 | 59.84 | 0.01 | 64.63 | 0.02 | 64.34 | – | ||||
| 2 {41} | 37.85 | 35.27 | 40.49 | 42.83 | ||||||||
| 3 {30} | 36.00 | 40.57 | 54.37 | 49.40 | ||||||||
| 4 {49} | 48.20 | 47.37 | 53.14 | 48.55 | ||||||||
| 5 {16} | 49.31 | 44.94 | 57.31 | 52.88 | ||||||||
| All (1–5) {218} | 48.56 | 7×10−9 | 48.89 | 0.002 | 55.71 | 0.04 | 53.85 | 0.001 | ||||
Reference 1 contains alignments of sequences that are subdivided into two subsets 1V1 (<20% identity) and 1V2 (20–40% identity). Reference 2 contains alignments that include orphan sequences. Reference 3 contains alignments of clusters of sequences from different families. Reference 4 contains alignments of sequences with large terminal extensions, while reference 5 contains alignments of sequences with internal insertions. The number in braces denotes the number of alignments in each subset. For each algorithm, the first number shows the accuracy of the original algorithm (TCoffee, MUSCLE, ProbCons, MUMMALS) that does not use horizontal information. The second number shows the accuracy of the modified algorithm NRAlign that makes use of horizontal information, with the higher accuracy value in bold. The third number shows the P-value, with – indicating insignificant differences. Since many of the subsets are small, P-values are computed only for reference 1 and for the entire set.
Average SPS and CS scores (in %) on HOMSTRAD
| TCoffee | MUSCLE | ProbCons | MUMMALS | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SPS | ||||||||||||
| 0–20% {156} | 46.68 | 0.005 | 48.08 | 4 × 10−4 | 49.67 | 6 × 10−8 | 54.39 | – | ||||
| 20–40% {459} | 79.19 | 4 × 10−13 | 78.86 | 1 × 10−10 | 80.55 | 3 × 10−22 | 82.67 | 4 × 10−4 | ||||
| 40–70% {348} | 94.48 | 2 × 10−11 | 94.45 | 1 × 10−4 | 94.75 | 7 × 10−12 | 95.04 | 9 × 10−4 | ||||
| 70–100% {69} | 99.10 | – | 99.02 | – | 99.08 | – | 98.94 | 0.005 | ||||
| All {1032} | 80.76 | 2 × 10−22 | 80.82 | 6 × 10−16 | 81.91 | 6 × 10−38 | 83.65 | 5 × 10−8 | ||||
| CS | ||||||||||||
| 0–20% {156} | 39.97 | 2 × 10−4 | 41.77 | 0.003 | 43.12 | 3 × 10−7 | 47.94 | – | ||||
| 20–40% {459} | 72.97 | 9 × 10−17 | 73.01 | 2 × 10−11 | 74.67 | 5 × 10−24 | 77.31 | 0.001 | ||||
| 40–70% {348} | 91.79 | 2 × 10−13 | 91.90 | 8 × 10−5 | 92.20 | 6 × 10−13 | 92.61 | 3 × 10−5 | ||||
| 70–100% {69} | 99.03 | – | 98.98 | – | 99.02 | – | 98.87 | 0.007 | ||||
| All {1032} | 76.07 | 1 × 10−30 | 76.39 | 8 × 10−16 | 77.44 | 6 × 10−40 | 79.47 | 4 × 10−8 | ||||
Each subset includes all protein sequence alignments with average pairwise identity within the specified range. For each algorithm, the higher accuracy value is in bold.
Average Q scores (in %) on PREFAB 4.0
| TCoffee | MUSCLE | ProbCons | MUMMALS | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0–20% {887} | 37.92 | 1 × 10−5 | 38.22 | 8 × 10−8 | 38.95 | 7 × 10−31 | 43.59 | 0.005 | ||||
| 20–40% {588} | 82.60 | 4 × 10−8 | 81.75 | 1 × 10−29 | 82.84 | 4 × 10−39 | 85.39 | 2 × 10−4 | ||||
| 40–70% {112} | 96.37 | 0.005 | 96.24 | 0.01 | 96.41 | 5 × 10−6 | 96.59 | 5 × 10−4 | ||||
| 70–100% {95} | 97.94 | – | 97.91 | – | 97.76 | 3 × 10−4 | 97.75 | 0.03 | ||||
| All {1682} | 60.82 | 1 × 10−12 | 60.68 | 7 × 10−29 | 61.44 | 3 × 10−71 | 64.79 | 5 × 10−8 | ||||
| 0–20% {887} | 49.67 | 6 × 10−6 | 50.71 | – | 55.63 | 0.02 | 57.68 | 0.02 | ||||
| 20–40% {588} | 83.94 | 8 × 10−7 | 85.09 | – | 87.24 | 3 × 10−7 | 87.24 | 0.02 | ||||
| 40–70% {112} | 95.55 | 0.02* | 94.72 | – | 95.39 | 0.004 | 95.34 | – | ||||
| 70–100% {95} | 97.97 | – | 97.50 | – | 97.26 | 0.001 | 96.68 | 0.005 | ||||
| All {1682} | 67.46 | 2 × 10−9 | 68.30 | – | 71.68 | 1 × 10−7 | 72.73 | 5 × 10−4 | ||||
Each subset includes all structure pairs with protein sequence identity within the specified range, with * indicating worse accuracy in P-value. The Q(2) scores are obtained from aligning only the original input protein sequence pair, while the Q(50) scores are obtained from aligning the full set of protein sequences (at most 50) that also include random hits from database search and evaluations are made on the original input sequence pair. For each algorithm, the higher accuracy value is in bold.
Average f and f scores and average normalized Dali Z-score, GDT_TS score, and ContactA and ContactB scores (in %) on the Twilight and Superfamily subsets of SABmark 1.65
| TCoffee | MUSCLE | ProbCons | MUMMALS | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Twilight {205} | ||||||||||||
| 23.99 | – | 24.07 | 0.008 | 29.26 | 0.01 | 31.57 | 0.04 | |||||
| – | 16.47 | – | 21.00 | – | 22.87 | 0.009 | ||||||
| Dali | 11.10 | 0.02 | 13.14 | 0.02 | 13.88 | 3 × 10−5 | 15.32 | 0.03 | ||||
| GDT_TS | 10.67 | 0.007 | 12.45 | 0.03 | 13.38 | 5 × 10−4 | 14.52 | – | ||||
| ContactA | 6.72 | – | 7.62 | 0.03 | 8.67 | 0.01 | 9.41 | – | ||||
| ContactB | 8.98 | – | 10.06 | – | 12.10 | 0.01 | 12.59 | – | ||||
| Superfamily {422} | ||||||||||||
| 52.91 | 2 × 10−5 | 53.12 | 0.008 | 57.06 | 8 × 10−8 | 59.50 | 0.004 | |||||
| 41.30 | 5 × 10−4 | 39.87 | 0.04 | 43.57 | 0.03 | 45.15 | 0.01 | |||||
| Dali | 33.09 | 0.04 | 35.34 | 0.002 | 35.84 | 9 × 10−21 | 37.79 | 0.001 | ||||
| GDT_TS | 31.07 | 0.01 | 32.98 | 5 × 10−4 | 33.67 | 1 × 10−17 | 35.05 | 0.01 | ||||
| ContactA | 23.07 | – | 24.23 | 0.001 | 25.29 | 2 × 10−9 | 26.41 | – | ||||
| ContactB | 28.91 | – | 30.30 | 0.007 | 32.10 | 1 × 10−4 | 33.11 | 0.04 | ||||
The Twilight subset contains protein sequence alignments that represent a SCOP fold (⩽25% identity), while the Superfamily subset contains protein sequence alignments that represent a SCOP superfamily (⩽50% identity). Four cases are omitted in the Twilight subset and three cases are omitted in the Superfamily subset since no high quality reference alignments are available. For each algorithm, the higher accuracy value is in bold.
Average SPS, CS and SCI scores (in %) on Data-set 1 of BRAliBase II
| TCoffee | MUSCLE | ProbConsRNA | |||||||
|---|---|---|---|---|---|---|---|---|---|
| SPS | |||||||||
| 0–55% {96} | 57.87 | 2 × 10−11 | 65.10 | 0.01 | 73.20 | 1 × 10−5 | |||
| 55–75% {218} | 80.07 | 5 × 10−22 | 83.62 | – | 86.08 | 4 × 10−8 | |||
| 75–100% {167} | 95.01 | – | – | 96.05 | – | ||||
| All {481} | 80.83 | 9 × 10−32 | 83.97 | 0.004 | 86.97 | 1 × 10−12 | |||
| CS | |||||||||
| 0–55% {96} | 36.42 | 2 × 10−7 | 45.83 | 0.02 | 56.32 | 0.005 | |||
| 55–75% {218} | 65.29 | 7 × 10−23 | 71.03 | 0.02 | 74.57 | 2 × 10−6 | |||
| 75–100% {167} | 89.90 | – | 90.73 | – | 91.94 | 0.03 | |||
| All {481} | 68.07 | 5 × 10−28 | 72.84 | 0.002 | 76.96 | 2 × 10−8 | |||
| SCI | |||||||||
| 0–55% {96} | 31.84 | 3 × 10−14 | 50.80 | 2 × 10−4 | 57.33 | 3 × 10−5 | |||
| 55–75% {218} | 54.17 | 2 × 10−27 | 66.26 | 3 × 10−4 | 67.07 | 2 × 10−17 | |||
| 75–100% {167} | 87.58 | 0.03 | 89.30 | 0.03 | 89.23 | 1 × 10−4 | |||
| All {481} | 61.31 | 2 × 10−39 | 71.17 | 2 × 10−7 | 72.82 | 8 × 10−23 | |||
Each subset includes all alignments of five RNA sequences with average pairwise identity within the specified range. For each algorithm, the higher accuracy value is in bold.
Average Q scores (in %) on the mdsa_all set of DNA PREFAB
| TCoffee | MUSCLE | ProbConsRNA | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0–20% {123} | 2.75 | 0.002 | 3.85 | 4 × 10−4 | 2.90 | 4 × 10−4 | |||
| 20–40% {1030} | 12.80 | 1 × 10−77 | 15.93 | 3 × 10−65 | 16.13 | 2 × 10−94 | |||
| 40–70% {436} | 51.78 | 6 × 10−69 | 60.17 | 1 × 10−64 | 59.38 | 2 × 10−69 | |||
| 70–100% {87} | 96.74 | 2 × 10−5 | 97.03 | – | 96.74 | 2 × 10−6 | |||
| All {1676} | 26.56 | 7 × 10−153 | 30.76 | 1 × 10−131 | 30.60 | 8 × 10−171 | |||
Each subset includes all pairs with DNA sequence identity within the specified range. For each algorithm, the higher accuracy value is in bold.
Average f and f scores and average normalized Dali Z-score, GDT_TS score, and ContactA and ContactB scores (in %) on the Twilight and Superfamily subsets of SABmark 1.65 when pairwise alignments are performed over all protein sequence pairs instead of obtaining a single multiple alignment
| TCoffee | MUSCLE | ProbCons | MUMMALS | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Twilight {205} | ||||||||||||
| 24.88 | 0.005 | 25.30 | 4 × 10−5 | 26.23 | 4 × 10−4 | 29.13 | 0.02 | |||||
| 16.78 | – | 17.05 | 3 × 10−4 | 17.92 | 0.04 | 19.64 | – | |||||
| Dali | 13.41 | 1 × 10−4 | 13.83 | 3 × 10−5 | 13.46 | 4 × 10−10 | 15.06 | 0.03 | ||||
| GDT_TS | 12.74 | 7 × 10−8 | 13.16 | 2 × 10−7 | 12.88 | 5 × 10−11 | 14.24 | – | ||||
| ContactA | 7.70 | 0.002 | 8.01 | 4 × 10−4 | 8.09 | 5 × 10−4 | 8.93 | – | ||||
| ContactB | 10.17 | 0.01 | 10.75 | – | 10.94 | – | 11.90 | – | ||||
| Superfamily {422} | ||||||||||||
| 50.73 | 1 × 10−13 | 50.79 | 3 × 10−16 | 51.60 | 1 × 10−28 | 54.79 | 3 × 10−6 | |||||
| 38.09 | 5 × 10−9 | 38.16 | 3 × 10−11 | 39.10 | 7 × 10−19 | 41.06 | 5 × 10−5 | |||||
| Dali | 33.82 | 3 × 10−11 | 33.80 | 2 × 10−19 | 33.56 | 7 × 10−45 | 35.64 | 1 × 10−5 | ||||
| GDT_TS | 31.81 | 2 × 10−13 | 31.84 | 2 × 10−22 | 31.72 | 3 × 10−39 | 33.34 | 1 × 10−5 | ||||
| ContactA | 23.11 | 2 × 10−6 | 23.21 | 9 × 10−20 | 23.29 | 3 × 10−25 | 24.64 | 0.01 | ||||
| ContactB | 28.85 | 0.003 | 29.10 | 3 × 10−9 | 29.28 | 4 × 10−9 | 30.73 | – | ||||
For each algorithm, the higher accuracy value is in bold.
Average SPS and CS scores (in %) on HOMSTRAD
| TCoffee | MUSCLE | ProbCons | MUMMALS | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SPS | ||||||||||||
| 2 seqs {630} | 80.88 | 1 × 10−6 | 80.40 | 1 × 10−11 | 81.65 | 1 × 10−20 | 83.50 | 6 × 10−6 | ||||
| 3 seqs {169} | 80.52 | 2 × 10−9 | 81.26 | 1 × 10−5 | 81.50 | 2 × 10−8 | 83.33 | 0.002 | ||||
| 4–5 seqs {122} | 79.78 | 1 × 10−9 | 80.97 | – | 82.26 | 2 × 10−9 | 83.53 | 0.04 | ||||
| ⩾ 6 seqs {111} | 81.55 | 4 × 10−4 | 82.34 | 0.04 | 83.64 | 1 × 10−7 | 85.15 | – | ||||
| CS | ||||||||||||
| 2 seqs {630} | 80.88 | 1 × 10−6 | 80.40 | 1 × 10−11 | 81.65 | 1 × 10−20 | 83.50 | 6 × 10−6 | ||||
| 3 seqs {169} | 74.51 | 1 × 10−9 | 75.41 | 6 × 10−6 | 75.54 | 1 × 10−6 | 77.92 | 0.007 | ||||
| 4–5 seqs {122} | 68.38 | 1 × 10−10 | 70.17 | – | 71.69 | 2 × 10−10 | 73.58 | 0.02 | ||||
| ⩾6 seqs {111} | 59.59 | 8 × 10−10 | 62.03 | 0.04 | 62.77 | 3 × 10−8 | 65.47 | 0.04 | ||||
Each subset includes all alignments with number of protein sequences within the specified range. For each algorithm, the higher accuracy value is in bold.
Average SPS and CS scores (in %) on HOMSTRAD and average SPS, CS and SCI scores (in %) on Data-set 1 of BRAliBase II by varying the parameter ω that specifies the maximum number of horizontal positions that are included to the left and to the right, and the parameter β that specifies the weight of the neighboring scores
| MUMMALS on HOMSTRAD | MUSCLE on BRAliBase | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ω =3 | ω =5 | ω =7 | ω =9 | ω =3 | ω =9 | ω =12 | ω =15 | ||||
| SPS | SPS | ||||||||||
| β =0.2 | 83.72 | 83.72 | 83.64 | 83.49 | 83.18 | β =0.2 | 84.64 | 84.63 | 84.30 | 83.64 | |
| β =0.4 | 83.70 | 83.70 | 83.60 | 83.34 | 82.94 | β =0.4 | 84.63 | 84.65 | 84.58 | 84.19 | 83.15 |
| β =0.6 | 83.72 | 83.72 | 83.54 | 83.27 | 82.84 | β =0.6 | 84.54 | 84.64 | 84.42 | 84.12 | 83.04 |
| 83.72 | 83.72 | 83.52 | 83.24 | 82.78 | β =0.8 | 84.69 | 84.71 | 84.36 | 84.00 | 82.95 | |
| β =1.0 | 83.70 | 83.50 | 83.21 | 82.75 | 84.69 | 84.81 | 84.41 | 83.98 | 82.91 | ||
| CS | CS | ||||||||||
| β =0.2 | 79.60 | 79.61 | 79.56 | 79.39 | 79.03 | β =0.2 | 73.84 | 73.92 | 73.42 | 72.52 | |
| β =0.4 | 79.57 | 79.61 | 79.52 | 79.22 | 78.74 | β =0.4 | 73.92 | 73.93 | 73.85 | 73.31 | 71.81 |
| β =0.6 | 79.59 | 79.44 | 79.13 | 78.62 | β =0.6 | 73.67 | 73.92 | 73.54 | 73.16 | 71.48 | |
| 79.60 | 79.63 | 79.41 | 79.08 | 78.55 | β =0.8 | 74.03 | 74.03 | 73.47 | 73.03 | 71.27 | |
| β =1.0 | 79.60 | 79.61 | 79.39 | 79.04 | 78.51 | 74.03 | 74.00 | 73.57 | 72.90 | 71.22 | |
| SCI | |||||||||||
| β =0.2 | 73.20 | 74.19 | 74.19 | 73.26 | |||||||
| β =0.4 | 73.34 | 73.96 | 74.02 | 74.11 | 72.83 | ||||||
| β =0.6 | 73.19 | 73.94 | 73.92 | 74.12 | 72.69 | ||||||
| β =0.8 | 73.28 | 73.79 | 73.64 | 73.97 | 72.41 | ||||||
| 73.11 | 73.93 | 73.84 | 74.11 | 72.15 | |||||||
For each modified algorithm and each score measure, the highest accuracy value and the values of ω and β that correspond to our chosen parameter setting that is the same across different benchmarks are in bold.
Computation time on HOMSTRAD and on Data-set 1 of BRAliBase II represented as a pair of the form average,maximum in seconds
| HOMSTRAD | TCoffee | MUSCLE | ProbCons | MUMMALS | ||||
|---|---|---|---|---|---|---|---|---|
| 2 seqs {630} | 0.19,1.25 | 0.27,1.33 | 0.03,0.20 | 0.07,0.57 | 0.39,4.67 | 0.42,4.99 | 0.38,4.57 | 0.40,3.92 |
| 3 seqs {169} | 0.38,2.12 | 0.64,4.37 | 0.07,0.52 | 0.21,2.00 | 0.62,5.84 | 0.67,6.38 | 0.67,9.62 | 0.70,13.14 |
| 4–5 seqs {122} | 0.88,3.51 | 1.79,7.71 | 0.14,1.08 | 0.48,2.74 | 1.28,11.35 | 1.40,12.84 | 1.88,11.60 | 1.93,11.89 |
| ⩾6 seqs {111} | 10.44,129.86 | 26.73,348.34 | 0.45,4.66 | 1.57,20.73 | 7.06,147.96 | 8.39,174.65 | 10.77,205.44 | 10.51,209.66 |
| BRAliBase | TCoffee | MUSCLE | ProbConsRNA | |||||
| All {481} | 2.57,9.54 | 12.69,37.63 | 0.05,0.21 | 0.20,1.15 | 0.43,2.42 | 0.62,3.12 | ||
For each algorithm, the first pair shows the running time of the original algorithm and the second pair shows the running time of the modified algorithm NRAlign.