| Literature DB >> 33316067 |
Benoit Morel1, Pierre Barbera1, Lucas Czech2, Ben Bettisworth1, Lukas Hübner1,3, Sarah Lutteropp1, Dora Serdari1, Evangelia-Georgia Kostaki4, Ioannis Mamais5, Alexey M Kozlov1, Pavlos Pavlidis6, Dimitrios Paraskevis4, Alexandros Stamatakis1,3.
Abstract
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8,736 out of all 16,453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into subclasses using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.Entities:
Keywords: SARS-CoV-2; outgroups; phylogenetic inference; phylogeny rooting; strain classification
Mesh:
Year: 2021 PMID: 33316067 PMCID: PMC7798910 DOI: 10.1093/molbev/msaa314
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Metrics for Assessing the Quality of the Tree Inference Conducted on the Four Distinct MSA Versions (FMSA, FMSAO, SMSA, SMSAO).
| Metric | FMSA | FMSAO | SMSA | SMSAO |
|---|---|---|---|---|
| Taxa | 4,869 | 4,871 | 2,888 | 2,904 |
| ML trees RF | 0.783 | 0.783 | 0.775 | 0.783 |
| Search RF | 0.112 | 0.112 | 0.128 | 0.119 |
| Plausible trees | 76 | 75 | 74 | 76 |
| MR res. | 0.129 | 0.131 | 0.155 | 0.147 |
| MRE res. | 0.706 | 0.701 | 0.699 | 0.680 |
Note.—ML trees RF is the average relative RF distance between all 100 inferred ML trees. Search RF is the average relative RF distance between the parsimony starting trees and the final ML trees of the respective tree searches on these starting trees. Plausible trees represents the number of trees (out of 100) in the plausible trees set. MR and MRE resolutions are the resolution ratios (see definition in the text) of the MR and MRE trees computed on the plausible tree sets.
Fig. 1.Log-likelihood scores of the best-scoring ML tree topology after model parameter (GTR, ML base frequencies, and Γ rate heterogeneity) and branch length optimization with the following (default) settings: blmax: 100, fast branch length optimization, : 0.1, and varying the indicated blmin (vertical line: default value of ).
Fig. 2.Spearman rank correlation of RAxML-NG tree search and RAxML-NG evaluation mode log-likelihood scores under the free rates model on a set of 100 ML tree topologies on the FMSA data set.
Fig. 3.Spearman rank correlation of IQ-TREE tree search and IQ-TREE evaluation mode log-likelihood scores under the free rates model on a set of 100 ML tree topologies on the FMSA data set.
Fig. 4.Extended majority rule consensus tree (FMSAO-CE) of the plausible tree set of the FMSAO alignment. We colored the tree by the region of origin of each sequence.
Fig. 5.Pangolin tool lineages displayed on the extended majority rule consensus tree (FMSAO-CE) of the plausible tree set for the FMSAO data set. We color the tree by Pangolin tool lineages. The numbers in parentheses next to the pangolin tool lineage labels indicate the number of taxa in the tree per lineage.
Metrics for the Thinned Alignment Versions.
| Metric | F-SST | F-MET | F-RAND | S-SST | S-MET | S-RAND |
|---|---|---|---|---|---|---|
| Taxa | 912 | 912 | 912 | 434 | 434 | 434 |
| ML trees RF | 0.67 | 0.66 | 0.77 | 0.68 | 0.63 | 0.79 |
| Search RF | 0.19 | 0.20 | 0.15 | 0.21 | 0.21 | 0.18 |
| Plausible trees | 39 | 45 | 73 | 31 | 47 | 59 |
| MR resolution | 0.166 | 0.218 | 0.144 | 0.164 | 0.245 | 0.141 |
| MRE resolution | 0.918 | 0.842 | 0.72 | 0.912 | 0.85 | 0.72 |
Note.—Taxa is the number of taxa in the alignment. ML trees RF is the average relative RF distance between all 100 inferred ML trees. Search RF is the average relative RF distance between the parsimony starting trees and the final ML trees of the respective tree searches on these starting trees. Plausible trees represents the number of trees (out of 100) in the plausible tree sets. MR and MRE resolutions are the resolution ratios (see definition in the text) of the MR and MRE trees computed on the plausible tree sets.
EPA-NG Root Placement Probability and Entropy Statistics for the Pangolin Outgroup Sequence over All Trees in the Respective Plausible Tree Sets for Distinct MSA Versions.
| Alignment | Max LWR | LWR Entropy | ||
|---|---|---|---|---|
| Mean | SD | Mean | SD | |
| FMSAO | 0.033 | 0.001 | 5.332 | 0.010 |
| FMSAO-HMMER | 0.034 | 0.000 | 5.325 | 0.010 |
| SMSAO |
| 0.010 |
| 0.046 |
| SMSAO-HMMER | 0.001 | 0.000 | 5.634 | 0.000 |
Note.—Highlighted in italics is the highest confidence signal, which is the only among all tested data sets to reach >0.04 mean LWR.
EPA-NG Root Placement Probability and Entropy Statistics for the Bat Outgroup Sequence over All Trees in the Respective Plausible Tree Sets for Distinct MSA Versions.
| Alignment | Max LWR | LWR Entropy | ||
|---|---|---|---|---|
| Mean | SD | Mean | SD | |
| FMSAO | 0.037 | 0.001 | 5.437 | 0.009 |
| FMSAO-HMMER | 0.037 | 0.001 | 5.438 | 0.008 |
| SMSAO | 0.025 | 0.001 | 5.378 | 0.006 |
| SMSAO-HMMER | 0.004 | 0.000 | 5.546 | 0.013 |
Results of RootDigger Analysis for Different MSA Versions.
| Alignment | Max LWR | LWR Entropy | ||
|---|---|---|---|---|
| Mean | SD | Mean | SD | |
| FMSA | 0.240 | 0.006 | 8.053 | 0.059 |
| FMSA-SS | 0.041 | 0.001 | 7.872 | 0.036 |
| SMSA |
| 0.038 |
| 0.474 |
| SMSA-SS | 0.101 | 0.002 | 7.327 | 0.023 |
Note.—Because of excessive runtimes, for every data set, we only analyzed the 5% of trees with the highest likelihood with RootDigger in exhaustive mode. To further summarize the results, we also compute the entropy of the LWR distributions for each resulting tree and report the average for each data set. The results are averages over the included plausible trees. Highlighted in italics is the highest confidence signal.
Fig. 6.Rooted SMSA Maximum Likelihood tree number 2. We color the tree by geographic regions and root it via RootDigger using a nonreversible model of nucleotide substitution. The tree inference randomly resolved multifurcations by introducing branches of length zero. For visualization purposes, we collapsed these branches, hence yielding a multifurcating tree again.
Fig. 7.Median number of delimited species over all possible rootings per plausible tree in SMSA-P.