| Literature DB >> 35758778 |
Noa Ecker1, Dana Azouri1,2, Ben Bettisworth3,4, Alexandros Stamatakis3,4, Yishay Mansour5, Itay Mayrose2, Tal Pupko1.
Abstract
MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree.Entities:
Mesh:
Year: 2022 PMID: 35758778 PMCID: PMC9236582 DOI: 10.1093/bioinformatics/btac252
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Performance of the Lasso approximation
| A. | |||
|---|---|---|---|
| Training random tree index | Log-likelihood with site sampling | Log-likelihood with all sites | Percentage of error (%) |
| 1 | −2 521 335.5 | −2 521 360.5 | 0.001 |
| 2 | −2 492 529.8 | −2 492 400.3 | 0.005 |
| 3 | −2 682 862.6 | −2 683 107.8 | 0.009 |
| 4 | −2 491 174.8 | −2 491 191.2 | 0.001 |
| 5 | −2 463 169.9 | −2 463 143.8 | 0.001 |
Note: Performance of the Lasso approximation on five trees selected from the training set (A) and five trees selected from the test set (B). The training and test sets included 4000 and 100 trees, respectively. The Lasso methodology selected 4048 sites from a total of 80 000 sites (i.e. around 5%). The percentage of error is calculated as the absolute of the difference between the true and approximated log-likelihoods divided by the true log-likelihood. The mean percentage of error and the standard deviation across 4000 trees used as a training set are 0.0035 and 7.6e-06, respectively. The mean percentage of error and the standard deviation across 100 trees used as a test set are 0.17 and 0.0019, respectively.
Fig. 1.Scatter plot of predicted versus exact log-likelihoods (LL). Each dot represents one random tree. The blue line is the linear regression line. (A) Results on training data; (B) results on test data
Fig. 2.The error in log-likelihood estimation as a function of training size and percentage of sampled positions. The y axis quantifies the error as the percentage of unexplained variance () obtained on a test set of 100 random trees ( denotes the square Pearson correlation coefficient). Shown are results for four values of training size and four values of sampling percentage. The analyzed alignment is the NagyA1 dataset with 30 sequences and 80 000 sites
Fig. 3.Distribution of evolutionary rates for the entire alignment against that of the sampled alignment. Shown are results on the NagyA1 dataset with 30 sequences and 80 000 positions using ζ = 5% of the positions and = 4 000 trees used for training. The overlap between the two distribution is shown in dark green
Performance of different search strategies
| Standard | Lasso-only | Two-phase | |
|---|---|---|---|
| Log-likelihood of final tree | −1 925 986.1 | −1 926 169.6 | −1 925 950.8 |
| Number of SPR moves | 167 | 153 | 157 |
| Total CPU time of the search | 133 818 | 2847 | 6208 |
| Training CPU time | 0 | 16 502 | 16 502 |
Note: Performance of the Standard search, Lasso-only search and Two-phase search on the NagyA1 MSA with 30 sequences and 80 000 positions. The Lasso-only search and the Two-phase search are based on Lasso approximation, which was generated using ζ = 5% of the positions and = 4000 trees for training. All final log-likelihood scores are computed using all alignment sites.