| Literature DB >> 25344302 |
Cuong Cao Dang, Vinh Sy Le, Olivier Gascuel, Bart Hazes, Quang Si Le1.
Abstract
BACKGROUND: Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25344302 PMCID: PMC4287512 DOI: 10.1186/1471-2105-15-341
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The Pearson correlations between the HSSP matrix and other matrices estimated by the FastMG procedure
| Matrices | Frequencies | Exchangeability matrix |
|---|---|---|
|
| 0.992/0.986 | 0.989/0.991 |
|
| 0.996/0.993 | 0.992/0.996 |
|
| 0.998/0.995 | 0.994/0.997 |
|
| 0.998/0.997 | 0.995/0.998 |
|
| 0.999/0.999 | 0.997/0.999 |
Pairwise comparisons between HSSP and other matrices estimated from the FastMG procedure
| M1 | M2 | LogLK (M1-M2) | M1 > M2 (#TP) | M1 < M2 (#TP) | #M1 > M2 (#TP) | #M1 < M2 (#TP) |
|---|---|---|---|---|---|---|
| HSSPS |
| 0.02 | 201 (130) | 99 (65) | 66 (27) | 12 (4) |
| HSSPS |
| 0.01 | 208 (126) | 92 (64) | 69 (26) | 10 (4) |
| HSSPS |
| 0.01 | 203 (114) | 97 (72) | 72 (26) | 9 (5) |
| HSSPS |
| 0.01 | 206 (119) | 94 (59) | 69 (24) | 11 (4) |
| HSSPS |
| 0.01 | 200 (101) | 100 (63) | 73 (16) | 11 (4) |
| HSSPS |
| 0.01 | 191 (124) | 109 (75) | 43 (19) | 33 (15) |
| HSSPS |
| 0.00 | 152 (95) | 148 (81) | 26 (5) | 34 (9) |
| HSSPS |
| 0.00 | 142 (88) | 158 (89) | 19 (4) | 36 (11) |
| HSSPS |
| 0.00 | 132 (78) | 168 (87) | 18 (3) | 32 (6) |
| HSSPS |
| 0.00 | 131 (72) | 169 (80) | 15 (4) | 40 (4) |
LogLK: the log likelihood difference per site between trees inferred using M1 and M2; a positive (negative) value means M1 is better (worse) than M2. M1 > M2: the number of alignments where M1 results in better likelihood value than M2; #TP: the number of alignments where tree topologies inferred using M1 and M2 are different. #M1 > M2 (p <0.05): the number of alignments where the Kishino-Hasegawa test indicates that M1 is significantly better than M2. #M1 < M2 (p <0.05): the same as #M1 > M2, but now M2 is significantly better than M1.
The running time of the standard estimation procedure and the FastMG procedure with different splitting algorithms and k values for the HSSP data set
| Matrices | Building trees (hours) | Estimating parameters (hours) | Total time (hours) |
|---|---|---|---|
|
| 10.7/7.4 | 6.3/3.6 | 17.0/11.0 |
|
| 22.9/19.0 | 6.0/3.9 | 28.9/22.9 |
|
| 32.6/29.9 | 5.7/3.9 | 38.3/33.8 |
|
| 42.3/40.0 | 5.5/3.9 | 47.8/43.9 |
|
| 73.7/71.7 | 5.2/4.1 | 78.9/75.8 |
| HSSPS | 319.5 | 4.2 | 323.7 |
The Pearson correlations between the Pfam matrix and other matrices estimated from the FastMG procedure
| Matrices | Frequencies | Exchangeability matrix |
|---|---|---|
|
| 0.994/0.990 | 0.993/0.993 |
|
| 0.997/0.996 | 0.995/0.997 |
|
| 0.998/0.999 | 0.995/0.999 |
|
| 0.999/0.999 | 0.998/0.999 |
|
| 1.000/1.000 | 0.999/1.000 |
Pairwise comparisons between the Pfam matrix and other matrices estimated by the FastMG procedure
| M1 | M2 | LogLK (M1-M2) | M1 > M2 (#TP) | M1 < M2 (#TP) | #M1 > M2 (#TP) | #M1 < M2 (#TP) |
|---|---|---|---|---|---|---|
| PfamS |
| 0.01 | 299 (67) | 181 (41) | 119 (8) | 55 (8) |
| PfamS |
| 0.01 | 294 (54) | 186 (35) | 132 (6) | 55 (3) |
| PfamS |
| 0.01 | 309 (57) | 171 (38) | 142 (10) | 51 (3) |
| PfamS |
| 0.00 | 275 (40) | 205 (35) | 116 (6) | 65 (2) |
| PfamS |
| 0.00 | 279 (38) | 201 (28) | 117 (4) | 64 (3) |
| PfamS |
| 0.00 | 218 (54) | 262 (64) | 51 (3) | 80 (8) |
| PfamS |
| 0.00 | 190 (33) | 290 (51) | 41 (2) | 104 (6) |
| PfamS |
| 0.00 | 212 (33) | 268 (39) | 50 (2) | 82 (1) |
| PfamS |
| 0.00 | 233 (36) | 247 (36) | 58 (1) | 72 (0) |
| PfamS |
| 0.00 | 166 (21) | 314 (31) | 27 (0) | 91 (2) |
LogLK: the log likelihood difference per site between trees inferred using M1 and M2; a positive (negative) value means M1 is better (worse) than M2. M1 > M2: the number of alignments where M1 results in better likelihood value than M2; #TP: the number of alignments where tree topologies inferred using M1 and M2 are different. #M1 > M2 (p <0.05): the number of alignments where the Kishino-Hasegawa test indicates that M1 is significantly better than M2. #M2 > M1 (p <0.05): the same as #M1 > M2, but now M2 is significantly better than M1.
The running time of the standard estimation procedure and the FastMG procedure with different splitting algorithms and k values for the Pfam data set
| Matrices | Building trees (hours) | Estimating parameters (hours) | Total time (hours) |
|---|---|---|---|
|
| 2.8/2.4 | 5.5/3.1 | 8.3/5.5 |
|
| 6.6/6.2 | 4.7/3.0 | 11.3/9.3 |
|
| 9.9/9.4 | 4.2/2.9 | 14.1/12.3 |
|
| 12.4/12.6 | 4.0/2.9 | 16.4/15.5 |
|
| 20.8/22.4 | 3.4/2.8 | 24.2/25.2 |
| PfamS | 35.8 | 2.9 | 38.7 |
The log likelihood per site comparisons between the Mam matrix and matrices estimated by the FastMG procedure
|
|
|
|
|
| |
|---|---|---|---|---|---|
| LogLK per site | 0.72/-0.04 | 0.58/-0.06 | 0.49/-0.05 | 0.42/-0.05 | 0.26/-0.02 |
| # Significantly better | 10/0 | 10/0 | 10/0 | 10/0 | 10/0 |
| # Significantly worse | 0/1 | 0/4 | 0/4 | 0/5 | 0/3 |
LogLK per site: average log-likelihood per site difference between MamS and the other matrices, positive/negative values indicate that the MamS matrix was better/worse than the other matrices. #Significantly better: number of tests where MamS is significantly better than the MamR/MamT matrix (based on Kishino-Hasegawa test). #Significantly worse: number of tests where MamS is significantly worse than the MamR/MamT matrix (based on Kishino-Hasegawa test).
The log likelihood per site comparisons between the original MtMam matrix and matrices estimated by the FastMG procedure
|
|
|
|
|
| |
|---|---|---|---|---|---|
| LogLK per site | 0.37/-0.39 | 0.23/-0.40 | 0.14/-0.40 | 0.07/-0.40 | -0.09/-0.37 |
| # Significantly better | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 |
| # Significantly worse | 0/10 | 0/10 | 1/10 | 4/10 | 7/10 |
LogLK per site: average log-likelihood per site difference between the original MtMam and the other matrix, positive/negative values indicate that the original MtMam matrix was better/worse than the other matrix. #Significantly better: number of tests where the original MtMam is significantly better than the MamR/MamT matrix (based on Kishino-Hasegawa test). #Significantly worse: number of tests where the original MtMam is significantly worse than the MamR/MamT matrix (based on Kishino-Hasegawa test).
The running time (hours) of different estimation procedures
| MamS |
|
|
|
|
| |
|---|---|---|---|---|---|---|
| Avg. time | 22.2 | 0.5/0.4 | 1.5/0.9 | 2.2/1.4 | 2.7/1.9 | 4.3/3.6 |
| Speed up | 42/61.4 | 14.8/24.5 | 10.2/16 | 8.2/11.6 | 5.1/6.1 |
Figure 1Tree-based splitting example. The tree-based splitting algorithm would divide this hypothetical tree for a 9-sequence alignment into two sub-alignments (s 1, s 2, s 3, s 8) and (s 4, s 5, s 6, s 7, s 9), corresponding to the left and right sub-trees, respectively.