| Literature DB >> 28583067 |
Frank Keul1, Martin Hess2, Michael Goesele3, Kay Hamacher1.
Abstract
BACKGROUND: Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities. We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space.Entities:
Keywords: Homologous sequence search; PFASUM; Sequence alignment; Substitution matrix
Mesh:
Substances:
Year: 2017 PMID: 28583067 PMCID: PMC5460430 DOI: 10.1186/s12859-017-1703-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Ambiguous amino acids and their designated canonic amino acids shown as one letter codes
| Ambiguous amino acid | B | J | Z | X |
| Canonic amino acid | N, D | E, Q | I, L | all |
The matrix test sets assessed in the homology search performance evaluation
| Test set | Algorithm | Matrix numbers |
|---|---|---|
|
| PFASUM | [11,100] |
|
| BLOSUM | 50,62,80 |
| MD | 10,20,40 | |
| Optima | 5 | |
| PAM | 120,250 | |
| VTML | 10,20,40,80,120,160,200 |
The matrix test sets assessed in the MSA construction evaluation
| Test set | Algorithm | Matrix numbers |
|---|---|---|
|
| PFASUM | 31,43,60 |
|
| BLOSUM | 50,62 |
| PAM | 250 | |
| VTML | 160,200 |
Fig. 1Performance comparison of Standard Search Matrices with PFASUM Search Matrices of similar entropies on all three ASTRAL datasets. The highest achieved coverage at an 0.01 errors per query for any gap opening and extension penalty combination is shown. Best performing gap parameter for each matrix and database combination can be found in Additional file 6: Table S2. Please note, we reduced the shown range of the coverage to emphasize the differences between the matrices
Fig. 2Comparison of the performance of all Standard Search Matrices with the novel PFASUM Search Matrices on three different ASTRAL datasets. The highest achieved coverage at 0.01 errors per query for any gap opening and extension penalty combination is shown (parameters are listed in Additional file 8: Table S4). With the exception of two performance differences, all shown coverage values are significantly different according to our Z-score analysis shown in Additional file 9: Table S5
Best performing substitution matrices of the PFASUM Search Matrices and Standard Search Matrices sets for the three test scenarios
| Database | Matrix | Gap parameters | Coverage |
|---|---|---|---|
| ASTRAL20 | VTML200 | -14/-1 | 0.1598 |
| PFASUM60 | -16/-1 | 0.1706 | |
| ASTRAL40 | VTML200 | -14/-1 | 0.4392 |
| PFASUM43 | -13/-1 | 0.4448 | |
| ASTRAL70 | VTML200 | -9/-2 | 0.5459 |
| PFASUM31 | -13/-2 | 0.5508 |
Fig. 3General comparison of MSA matrix performance based on the average q-score per benchmark database. PFASUM MSA Matrices outperform the tested Standard MSA Matrices on all three benchmarks. PFASUM31 achieved the highest for BAliBASE 3.0 and SABmark 1.65, while VTML200 leads all matrices on the OXBench dataset. The red dotted line indicates the maximum separately for each benchmark
Fraction of times (in percent) that a specific matrix in the PFASUM MSA Matrices set produced an MSA of at least as good (≥) quality as a specific matrix out of the Standard MSA Matrices set
| PFASUM31 | PFASUM43 | PFASUM60 | ||
|---|---|---|---|---|
| BAliBASE 3.0 | BLOSUM50 | 67.36 (59.07) | 62.69 (54.92) | 62.44 (51.81) |
| BLOSUM62 | 71.50 (65.03) | 69.43 (62.44) | 67.88 (58.03) | |
| PAM250 | 75.39 (70.21) | 71.76 (63.73) | 70.73 (66.84) | |
| VTML160 | 63.47 (54.92) | 61.14 (50.52) | 61.66 (48.45) | |
| VTML200 | 68.13 (57.77) | 63.21 (51.81) | 61.92 (50.00) | |
| OXBench | BLOSUM50 | 81.52 (25.06) | 83.29 (26.84) | 84.30 (23.80) |
| BLOSUM62 | 79.24 (23.04) | 81.01 (22.03) | 80.76 (21.52) | |
| PAM250 | 80.00 (32.41) | 82.03 (31.65) | 79.49 (33.16) | |
| VTML160 | 79.24 (24.81) | 83.80 (28.61) | 82.53 (25.32) | |
| VTML200 | 76.96 (20.76) | 82.03 (22.53) | 80.00 (20.51) | |
| SABmark 1.65 | BLOSUM50 | 67.85 (47.52) | 63.36 (44.21) | 65.96 (43.74) |
| BLOSUM62 | 63.59 (48.23) | 64.30 (47.28) | 64.30 (45.86) | |
| PAM250 | 72.58 (59.57) | 70.21 (56.97) | 72.81 (60.05) | |
| VTML160 | 64.78 (45.39) | 60.76 (43.50) | 65.25 (42.08) | |
| VTML200 | 66.67 (47.99) | 63.83 (44.21) | 68.56 (44.21) |
The comparison for better-than-relations (>) are shown in brackets. Values are shown for all PFASUM MSA Matrices vs. Standard MSA Matrices comparisons on all three different benchmark datasets