| Literature DB >> 16563161 |
Thomas M Keane1, Christopher J Creevey, Melissa M Pentony, Thomas J Naughton, James O Mclnerney.
Abstract
BACKGROUND: In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16563161 PMCID: PMC1435933 DOI: 10.1186/1471-2148-6-29
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1Alternative Trees. Two different trees (with bootstrap support values based on 100 replicates) constructed from a single gene family [34] with different protein models using Phyml v2.4.4 [53]. Tree (a) was produced using the MtREV matrix [15] and Tree (b) was produced using the WAG matrix [18].
Figure 2Base Tree. The true tree used to generate all of the simulated alignments.
Base Tree Simulations. Results of simulated datasets when a random, NJ-JTT, and the true tree are used as the base tree for the model selection procedure and the sequence length is 500 characters. Each entry is the number of times out of 100 replicates the correct model was selected by each measure.
| Random | NJ-JTT | True | |||||||
| Model | AIC1 | AIC2 | BIC | AIC1 | AIC2 | BIC | AIC1 | AIC2 | BIC |
| Blosum | 0 | 0 | 0 | 91 | 98 | 99 | 84 | 96 | 96 |
| Blosum+I | 0 | 0 | 0 | 94 | 99 | 100 | 100 | 100 | 100 |
| Blosum+G | 58 | 65 | 67 | 75 | 83 | 87 | 75 | 84 | 87 |
| Blosum+I+G | 89 | 86 | 85 | 90 | 88 | 87 | 89 | 85 | 85 |
| CPREV | 0 | 0 | 0 | 92 | 99 | 100 | 93 | 98 | 99 |
| CPREV+I | 0 | 0 | 0 | 98 | 99 | 99 | 97 | 99 | 100 |
| CPREV+G | 80 | 83 | 85 | 89 | 89 | 90 | 89 | 90 | 90 |
| CPREV+I+G | 94 | 91 | 91 | 80 | 75 | 73 | 80 | 75 | 73 |
| Dayhoff | 0 | 0 | 0 | 95 | 100 | 100 | 94 | 98 | 99 |
| Dayhoff+I | 0 | 0 | 0 | 98 | 100 | 100 | 96 | 99 | 100 |
| Dayhoff+G | 68 | 72 | 74 | 77 | 86 | 90 | 79 | 88 | 91 |
| Dayhoff+I+G | 94 | 93 | 93 | 82 | 74 | 74 | 84 | 74 | 72 |
| JTT | 0 | 0 | 0 | 94 | 99 | 100 | 97 | 99 | 99 |
| JTT+I | 0 | 0 | 0 | 96 | 100 | 100 | 97 | 100 | 100 |
| JTT+G | 54 | 59 | 62 | 78 | 85 | 86 | 81 | 87 | 89 |
| JTT+I+G | 94 | 94 | 93 | 89 | 85 | 82 | 92 | 87 | 84 |
| MtREV | 0 | 0 | 0 | 85 | 96 | 97 | 94 | 99 | 99 |
| MtREV+I | 0 | 0 | 0 | 92 | 99 | 100 | 97 | 100 | 100 |
| MtREV+G | 80 | 87 | 87 | 92 | 94 | 95 | 93 | 94 | 94 |
| MtREV+I+G | 86 | 85 | 84 | 68 | 65 | 61 | 70 | 63 | 63 |
| WAG | 0 | 0 | 0 | 88 | 97 | 99 | 95 | 99 | 100 |
| WAG+I | 0 | 0 | 0 | 98 | 100 | 100 | 96 | 100 | 100 |
| WAG+G | 74 | 79 | 79 | 83 | 89 | 89 | 83 | 88 | 89 |
| WAG+I+G | 90 | 89 | 87 | 79 | 73 | 69 | 78 | 73 | 71 |
Full ML Comparison. A comparison of the models selected from the likelihood values obtained from a full ML tree search using all models and the likelihood values using the default NJ-JTT base tree. The column 'Identical' indicates the number of times (out of 100 alignments) both procedures selected the same model. The column titled 'Rate' indicates cases when the same amino acid matrix and a different ASRV was selected. The column titled 'Matrix' indicates cases when the a different amino acid matrix was selected.
| AIC1 | AIC2 | BIC | |||||||
| Dataset | Identical | Rate | Matrix | Identical | Rate | Matrix | Identical | Rate | Matrix |
| Proteobacteria | 95 | 4 | 1 | 93 | 6 | 1 | 94 | 2 | 4 |
| Archaea | 99 | 1 | 0 | 96 | 2 | 2 | 95 | 2 | 3 |
| Vertebrate | 91 | 7 | 2 | 94 | 5 | 1 | 97 | 1 | 2 |
Alignment Length Simulations. Results of the simulated datasets for alignments of 100, 500, and 1000 characters in length. Each entry is the number of times out of 100 replicates the correct model was selected by each measure (using the default NJ-JTT base tree).
| 100 | 500 | 1000 | |||||||
| Model | AIC1 | AIC2 | BIC | AIC1 | AIC2 | BIC | AIC1 | AIC2 | BIC |
| Blosum | 86 | 96 | 95 | 91 | 98 | 99 | 94 | 99 | 100 |
| Blosum+I | 95 | 100 | 95 | 94 | 99 | 100 | 98 | 100 | 100 |
| Blosum+G | 89 | 95 | 95 | 75 | 83 | 87 | 79 | 85 | 88 |
| Blosum+I+G | 44 | 30 | 30 | 90 | 88 | 87 | 95 | 95 | 94 |
| CPREV | 92 | 99 | 99 | 92 | 99 | 100 | 95 | 100 | 100 |
| CPREV+I | 94 | 100 | 100 | 98 | 99 | 99 | 99 | 99 | 100 |
| CPREV+G | 87 | 99 | 98 | 89 | 89 | 90 | 91 | 96 | 97 |
| CPREV+I+G | 51 | 37 | 37 | 80 | 75 | 73 | 95 | 94 | 94 |
| Dayhoff | 92 | 99 | 99 | 95 | 100 | 100 | 93 | 99 | 100 |
| Dayhoff+I | 94 | 100 | 100 | 98 | 100 | 100 | 96 | 100 | 100 |
| Dayhoff+G | 83 | 93 | 93 | 77 | 86 | 90 | 94 | 94 | 95 |
| Dayhoff+I+G | 54 | 35 | 38 | 82 | 74 | 74 | 95 | 92 | 91 |
| JTT | 95 | 98 | 98 | 94 | 99 | 100 | 93 | 98 | 100 |
| JTT+I | 95 | 99 | 98 | 96 | 100 | 100 | 96 | 100 | 100 |
| JTT+G | 87 | 94 | 94 | 78 | 85 | 86 | 91 | 91 | 93 |
| JTT+I+G | 48 | 36 | 40 | 89 | 85 | 82 | 96 | 95 | 94 |
| MtREV | 95 | 98 | 98 | 85 | 96 | 97 | 91 | 97 | 97 |
| MtREV+I | 97 | 100 | 100 | 92 | 99 | 100 | 97 | 100 | 100 |
| MtREV+G | 86 | 97 | 97 | 92 | 94 | 95 | 92 | 95 | 96 |
| MtREV+I+G | 29 | 17 | 17 | 68 | 65 | 61 | 87 | 85 | 83 |
| WAG | 91 | 97 | 96 | 88 | 97 | 99 | 97 | 98 | 100 |
| WAG+I | 94 | 100 | 99 | 98 | 100 | 100 | 97 | 99 | 100 |
| WAG+G | 85 | 95 | 93 | 83 | 89 | 89 | 86 | 95 | 95 |
| WAG+I+G | 50 | 34 | 36 | 79 | 73 | 69 | 97 | 96 | 94 |
Gamma Distribution Simulations. Results of simulations when the α parameter of the gamma distribution was varied between 0.5, 1.0, and 2.0. The sequence length was kept constant at 500 characters and the proportion of invariable sites was 0.2. Each entry is the number of times out of 100 replicates that the correct model was selected.
| Model | AIC1 | AIC2 | BIC | AIC1 | AIC2 | BIC | AIC1 | AIC2 | BIC |
| BLOSUM62+G | 75 | 83 | 87 | 32 | 62 | 69 | 36 | 68 | 74 |
| BLOSUM62+I+G | 90 | 88 | 87 | 95 | 93 | 92 | 100 | 100 | 100 |
| CPREV+G | 89 | 89 | 90 | 39 | 72 | 77 | 39 | 65 | 79 |
| CPREV+I+G | 80 | 75 | 73 | 93 | 89 | 89 | 100 | 100 | 100 |
| Dayhoff+G | 77 | 86 | 90 | 33 | 36 | 74 | 38 | 60 | 66 |
| Dayhoff+I+G | 82 | 74 | 74 | 98 | 95 | 92 | 100 | 100 | 100 |
| JTT+G | 78 | 85 | 86 | 43 | 71 | 76 | 25 | 54 | 63 |
| JTT+I+G | 89 | 85 | 82 | 98 | 96 | 94 | 100 | 100 | 100 |
| MtREV+G | 92 | 94 | 95 | 46 | 72 | 76 | 51 | 75 | 84 |
| MtREV+I+G | 68 | 65 | 61 | 90 | 85 | 83 | 100 | 100 | 100 |
| WAG+G | 83 | 89 | 89 | 35 | 70 | 76 | 32 | 70 | 79 |
| WAG+I+G | 79 | 73 | 69 | 97 | 91 | 90 | 100 | 100 | 100 |
| Dayhoff+I+G | 54 | 35 | 38 | 82 | 74 | 74 | 95 | 92 | 91 |
| JTT | 95 | 98 | 98 | 94 | 99 | 100 | 93 | 98 | 100 |
| JTT+I | 95 | 99 | 98 | 96 | 100 | 100 | 96 | 100 | 100 |
| JTT+G | 87 | 94 | 94 | 78 | 85 | 86 | 91 | 91 | 93 |
| JTT+I+G | 48 | 36 | 40 | 89 | 85 | 82 | 96 | 95 | 94 |
| MtREV | 95 | 98 | 98 | 85 | 96 | 97 | 91 | 97 | 97 |
| MtREV+I | 97 | 100 | 100 | 92 | 99 | 100 | 97 | 100 | 100 |
| MtREV+G | 86 | 97 | 97 | 92 | 94 | 95 | 92 | 95 | 96 |
| MtREV+I+G | 29 | 17 | 17 | 68 | 65 | 61 | 87 | 85 | 83 |
| WAG | 91 | 97 | 96 | 88 | 97 | 99 | 97 | 98 | 100 |
| WAG+I | 94 | 100 | 99 | 98 | 100 | 100 | 97 | 99 | 100 |
| WAG+G | 85 | 95 | 93 | 83 | 89 | 89 | 86 | 95 | 95 |
| WAG+I+G | 50 | 34 | 36 | 79 | 73 | 69 | 97 | 96 | 94 |
Amino Acid Frequency Simulations. Results of the simulated datasets where the original amino acid frequencies are randomly perturbed by up to 10% from the original values and the alignment length is 500 characters. Each entry indicates the number of times out of 100 replicates the correct model was selected by each measure.
| Model | AIC1 | AIC2 | BIC | Model | AIC1 | AIC2 | BIC |
| Blosum+F | 94 | 100 | 100 | JTT+F | 93 | 100 | 100 |
| Blosum+I+F | 71 | 91 | 95 | JTT+I+F | 67 | 89 | 94 |
| Blosum+G+F | 86 | 93 | 96 | JTT+G+F | 75 | 89 | 92 |
| Blosum+I+G+F | 99 | 97 | 96 | JTT+I+G+F | 98 | 96 | 96 |
| CPREV+F | 92 | 100 | 100 | MtREV+F | 93 | 99 | 99 |
| CPREV+I+F | 87 | 98 | 99 | MtREV+I+F | 86 | 96 | 99 |
| CPREV+G+F | 93 | 96 | 97 | MtREV+G+F | 86 | 93 | 95 |
| CPREV+I+G+F | 89 | 87 | 84 | MtREV+I+G+F | 85 | 82 | 80 |
| Dayhoff+F | 93 | 99 | 99 | WAG+F | 95 | 100 | 100 |
| Dayhoff+I+F | 91 | 98 | 99 | WAG+I+F | 82 | 96 | 97 |
| Dayhoff+G+F | 86 | 93 | 96 | WAG+G+F | 88 | 95 | 96 |
| Dayhoff+I+G+F | 99 | 97 | 96 | WAG+I+G+F | 90 | 89 | 89 |
Real Dataset Analysis. Results of the model selection on the specialised datasets (see the references for full descriptions of the individual datasets). Amino acid matrix expectations are based on previously published information about the sequences ([19, 54, 55] and LANL [56]).
| Dataset | Source | Expected | AIC1 | AIC2 | BIC |
| mtCDNApri | Yang [54] | MtMam | MtMam+I+G | MtMam+G | MtMam+G |
| mtCDNAape | Yang [54] | MtMam | MtMam+F | MtMam+F | MtMam+F |
| 70pep_nogap | Reyes | MtMam | MtMam+I+G+F | MtMam+I+G | MtMam+I+G |
| BETA | Dimmic | RtREV | RtREV+G+F | RtREV+G | RtREV+G |
| ENDO | Dimmic | RtREV | RtREV+I+G+F | RtREV+I+G+F | RtREV+I+G+F |
| GAGGAM | Dimmic | JTT | JTT+G+F | JTT+G+F | JTT+G+F |
| GAGHIV | Dimmic | JTT | JTT+G+F | JTT+G+F | JTT+G+F |
| GAMMA | Dimmic | RtREV | CPREV+G+F | RtREV+G | RtREV+G |
| LENTI | Dimmic | RtREV | RtREV+I+G+F | RtREV+I+G+F | RtREV+I+G+F |
| SPUMA | Dimmic | RtREV | RtREV+G | RtREV+G | RtREV+G |
| NONLTR | Dimmic | RtREV | RtREV+I+G+F | RtREV+I+G+F | RtREV+I+G+F |
| SIVPOLPRO | LANL | RtREV | RtREV+G+F | RtREV+G+F | RtREV+G |
Figure 3Proteobacteria Dataset. A break-down of the set of best-fit protein models for the proteobacteria dataset.
Figure 4Vertebrate Dataset. A break-down of the set of best-fit protein models for the vertebrate dataset.
Figure 5Archaea Dataset. A break-down of the set of best-fit protein models for the archaea dataset.
Tree Accuracy Simulations. Results of the simulated tree accuracy test where alignments were generated with a particular model and then phylogenies were built using all of the other available models. Each entry is the average scaled Robinson-Foulds (RF) distance [40] over the trees inferred using the alternative models. This test was repeated 10 times for each model and the values in brackets are the RF distances from the true tree when phylogenies were inferred using the model that generated the alignment. Phyml [53] was used to build all trees.
| Model | RF Distance | Model | RF Distance |
| Blosum | 0.03 (0.03) | JTT | 0.05 (0.05) |
| Blosum+I | 0.02 (0.02) | JTT+I | 0.05 (0.04) |
| Blosum+G | 0.08 (0.06) | JTT+G | 0.04 (0.03) |
| Blosum+I+G | 0.05 (0.05) | JTT+I+G | 0.12 (0.11) |
| CPREV | 0.05 (0.04) | MtREV | 0.06 (0.05) |
| CPREV+I | 0.09 (0.04) | MtREV+I | 0.08 (0.08) |
| CPREV+G | 0.06 (0.05) | MtREV+G | 0.07 (0.06) |
| CPREV+I+G | 0.07 (0.06) | MtREV+I+G | 0.12 (0.1) |
| Dayhoff | 0.07 (0.07) | WAG | 0.02 (0.02) |
| Dayhoff+I | 0.06 (0.05) | WAG+I | 0.04 (0.04) |
| Dayhoff+G | 0.06 (0.06) | WAG+G | 0.1 (0.1) |
| Dayhoff+I+G | 0.05 (0.04) | WAG+I+G | 0.04 (0.04) |
Proteobacteria Tree Accuracy Analysis. The scaled Robinson-Foulds (RF) distances [40] of the trees produced from the Proteobacteria dataset using fixing a model used to build trees from each alignment. The values reported are the median and average distance computed by comparing every tree against every other tree. When the optimal set of models were used the median was 0.22 and the average was 0.34. Phyml [53] was used to build all trees.
| Model | Median RF | Mean RF | Model | Median RF | Mean RF |
| Blosum | 0.23 | 0.35 | JTT | 0.23 | 0.34 |
| Blosum+I | 0.25 | 0.35 | JTT+I | 0.25 | 0.35 |
| Blosum+G | 0.25 | 0.35 | JTT+G | 0.25 | 0.35 |
| Blosum+I+G | 0.25 | 0.35 | JTT+I+G | 0.25 | 0.35 |
| CPREV | 0.24 | 0.35 | MtREV | 0.25 | 0.35 |
| CPREV+I | 0.25 | 0.35 | MtREV+I | 0.25 | 0.35 |
| CPREV+G | 0.25 | 0.35 | MtREV+G | 0.25 | 0.35 |
| CPREV+I+G | 0.25 | 0.35 | MtREV+I+G | 0.25 | 0.35 |
| Dayhoff | 0.2 | 0.34 | WAG | 0.21 | 0.34 |
| Dayhoff+I | 0.21 | 0.34 | WAG+I | 0.23 | 0.35 |
| Dayhoff+G | 0.22 | 0.34 | WAG+G | 0.25 | 0.35 |
| Dayhoff+I+G | 0.22 | 0.34 | WAG+I+G | 0.25 | 0.35 |
Figure 6Pseudo Code. The algorithm used to generate the simulated +F alignments can be described in pseudocode as follows. The function random returns a random number greater than the first argument and less than the second argument.