| Literature DB >> 35585494 |
Diogo Pinheiro1, Sergio Santander-Jimenéz2, Aleksandar Ilic3.
Abstract
BACKGROUND: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints.Entities:
Keywords: Machine learning; Missing data imputation; Phylogenetic tree; Random forest
Mesh:
Substances:
Year: 2022 PMID: 35585494 PMCID: PMC9116704 DOI: 10.1186/s12864-022-08540-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Fig. 1Missing matrix inputs obtained from incomplete sequences
Fig. 2Results obtained with the 9 ×9 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF
Statistical testing of NRF results for the 9x9 dataset, with regard to autoencoder (AE) and matrix factorization (MF). Statistically significant improvements achieved by PhyloMissForest (under any of the considered configuration profiles) are denoted as , while non-significant differences are marked with ×
| Dataset | %Missing | Non-bootstrap | Bootstrap | PhyloMissForest diff. | |||
|---|---|---|---|---|---|---|---|
| Vs. AE | Vs. MF | Vs. AE | Vs. MF | Vs. AE | Vs. MF | ||
| 9x9 | 5% | 0.25 | |||||
| 10% | |||||||
| 15% | |||||||
| 20% | 0.11 | 0.14 | |||||
| 25% | 0.25 | 0.28 | 0.22 | 0.12 | × | × | |
| 30% | 0.14 | 0.12 | × | ||||
| 35% | 0.58 | 0.17 | 0.97 | 0.44 | × | × | |
| 40% | 0.31 | 0.19 | 0.35 | × | |||
| 45% | 0.31 | 0.44 | |||||
| 50% | 0.14 | 0.17 | 0.53 | × | |||
| 55% | 0.11 | 0.44 | |||||
| 60% | 0.91 | 0.35 | × | ||||
Bold values refer to p-values denoting statistically significant improvements
Fig. 3Results obtained with the 37 ×37 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF
Statistical testing of NRF results for the 37x37 dataset, with regard to autoencoder (AE) and matrix factorization (MF). Statistically significant improvements achieved by PhyloMissForest (under any of the considered configuration profiles) are denoted as , while non-significant differences are marked with ×
| Dataset | %Missing | Non-bootstrap | Bootstrap | PhyloMissForest diff. | |||
|---|---|---|---|---|---|---|---|
| Vs. AE | Vs. MF | Vs. AE | Vs. MF | Vs. AE | Vs. MF | ||
| 37x37 | 5% | 0.63 | 0.04 | 0.22 | 0.17 | × | × |
| 10% | 0.97 | 0.80 | 0.35 | 0.53 | × | × | |
| 15% | 0.28 | 0.63 | 0.31 | 0.44 | × | × | |
| 20% | 0.17 | 0.63 | 0.58 | × | |||
| 25% | 0.11 | 0.17 | × | ||||
| 30% | 0.74 | 0.91 | × | ||||
| 35% | 0.85 | 0.68 | × | ||||
| 40% | 0.74 | 0.68 | × | ||||
| 45% | 0.63 | 0.91 | × | ||||
| 50% | 0.91 | 0.68 | × | ||||
| 55% | 1.00 | 0.39 | × | ||||
| 60% | 0.48 | 0.28 | × | ||||
Bold values refer to p-values denoting statistically significant improvements
Fig. 4Results obtained with the 55 ×55 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF
Statistical testing of NRF results for the 55x55 dataset, with regard to autoencoder (AE) and matrix factorization (MF). Statistically significant improvements achieved by PhyloMissForest (under any of the considered configuration profiles) are denoted as , while non-significant differences are marked with ×
| Dataset | %Missing | Non-bootstrap | Bootstrap | PhyloMissForest diff. | |||
|---|---|---|---|---|---|---|---|
| Vs. AE | Vs. MF | Vs. AE | Vs. MF | Vs. AE | Vs. MF | ||
| 55x55 | 5% | 0.17 | |||||
| 10% | 0.58 | 0.25 | × | ||||
| 15% | |||||||
| 20% | 0.12 | ||||||
| 25% | 0.48 | 0.25 | × | ||||
| 30% | |||||||
| 35% | |||||||
| 40% | |||||||
| 45% | |||||||
| 50% | |||||||
| 55% | |||||||
| 60% | |||||||
Bold values refer to p-values denoting statistically significant improvements
Fig. 5Results obtained with the 40 ×40 dataset. Box plots with the mean, min (best), max (worst) and standard deviation of NRF
Mean NRF results (%) and standard deviations for the 201x201 dataset. N/A refers to situations where matrix factorization did not finish execution in an experimental time window of 48 hours
| Dataset | %Missing | PhyloMissForest | AutoEncoder | Matrix Factorization | |
|---|---|---|---|---|---|
| Non-bootstrap | Bootstrap | ||||
| 201x201 | 10% | 15.20 ±2 | 18.81 ±3 | N/A | |
| 15% | 19.02 ±1 | 23.64 ±2 | N/A | ||
| 20% | 21.26 ±1 | 25.08 ±1 | N/A | ||
Bold values refer the best NRF results in the comparison
Statistical testing of NRF results for the 40x40 and 201x201 datasets, with regard to autoencoder (AE) and matrix factorization (MF). Statistically significant improvements achieved by PhyloMissForest (under any of the considered configuration profiles) are denoted as , while non-significant differences are marked with ×. N/A refers to situations where matrix factorization did not finish execution in an experimental time window of 48 hours
| Dataset | %Missing | Non-bootstrap | Bootstrap | PhyloMissForest diff. | |||
|---|---|---|---|---|---|---|---|
| Vs. AE | Vs. MF | Vs. AE | Vs. MF | Vs. AE | Vs. MF | ||
| 40x40 | 5% | 0.69 | 0.69 | × | |||
| 10% | 0.31 | 0.69 | 0.42 | 0.42 | × | × | |
| 15% | 0.69 | 1.00 | × | ||||
| 20% | 0.31 | 0.22 | × | ||||
| 25% | 0.69 | 0.55 | × | ||||
| 30% | 1.00 | 0.31 | × | ||||
| 35% | 0.84 | 0.15 | 0.31 | × | |||
| 40% | 0.05 | 0.01 | |||||
| 45% | 0.55 | ||||||
| 50% | 0.15 | 0.22 | |||||
| 55% | 0.84 | 0.84 | 0.42 | × | |||
| 60% | 0.84 | ||||||
| 201x201 | 10% | N/A | N/A | N/A | |||
| 15% | N/A | N/A | N/A | ||||
| 20% | N/A | N/A | N/A | ||||
Bold values refer to p-values denoting statistically significant improvements
Comparisons with LASSO and DAMBE on real-world datasets: mean NRF values and p-values obtained in the statistical testing of PhyloMissForest samples over the alternative approaches. Lower NRF values denote better quality. N/A denotes scenarios where DAMBE was not able to find any suitable solution
| Dataset | %Missing | NRF scores | ||||
|---|---|---|---|---|---|---|
| PhyloMissForest | LASSO | DAMBE | vs. LASSO | vs. DAMBE | ||
| 9x9 | 5% | 9.17 | 23.61 | |||
| 10% | 8.33 | 29.17 | ||||
| 15% | 14.17 | 41.67 | 0.25 | |||
| 20% | 13.33 | 38.54 | 0.35 | |||
| 25% | 16.67 | 39.58 | 0.53 | |||
| 30% | 17.50 | 36.11 | 0.48 | |||
| 35% | 17.83 | N/A | 0.44 | N/A | ||
| 40% | 21.67 | N/A | 0.58 | N/A | ||
| 45% | 24.33 | N/A | 0.35 | N/A | ||
| 50% | 25.00 | N/A | 0.63 | N/A | ||
| 55% | 29.17 | N/A | 0.35 | N/A | ||
| 60% | 33.17 | N/A | 0.68 | N/A | ||
| 37x37 | 5% | 3.94 | 7.35 | 0.79 | ||
| 10% | 7.50 | 6.72 | 0.54 | |||
| 15% | 10.15 | 10.54 | 0.91 | 0.96 | ||
| 20% | 10.88 | N/A | 0.53 | N/A | ||
| 25% | 15.29 | N/A | 0.11 | N/A | ||
| 30% | 17.56 | N/A | 0.19 | N/A | ||
| 35% | 18.56 | N/A | 0.25 | N/A | ||
| 40% | 20.24 | N/A | 0.48 | N/A | ||
| 45% | 25.59 | N/A | 0.00 | N/A | ||
| 50% | 24.74 | N/A | 0.14 | N/A | ||
| 55% | 27.47 | N/A | 0.63 | N/A | ||
| 60% | 31.53 | N/A | 0.53 | N/A | ||
| 55x55 | 5% | 20.58 | N/A | N/A | ||
| 10% | 21.63 | N/A | N/A | |||
| 15% | 22.79 | N/A | N/A | |||
| 20% | 22.31 | N/A | N/A | |||
| 25% | 24.90 | N/A | N/A | |||
| 30% | 26.73 | N/A | N/A | |||
| 35% | 27.21 | N/A | N/A | |||
| 40% | 30.38 | N/A | N/A | |||
| 45% | 28.94 | N/A | N/A | |||
| 50% | 33.27 | N/A | N/A | |||
| 55% | 35.10 | N/A | N/A | |||
| 60% | 35.10 | N/A | N/A | |||
Bold values in the “NRF scores” columns denote the best NRF scores in the comparison, while in the p-values columns they refer to p-values denoting statistically significant improvements
Comparisons with LASSO and DAMBE on simulated datasets: mean NRF values and p-values obtained in the statistical testing of PhyloMissForest samples over the alternative approaches. Lower NRF values denote better quality. N/A denotes scenarios where DAMBE was not able to find any suitable solution
| Dataset | %Missing | NRF scores | ||||
|---|---|---|---|---|---|---|
| PhyloMissForest | LASSO | DAMBE | vs. LASSO | vs. DAMBE | ||
| 40x40 | 5% | 20.54 | N/A | N/A | ||
| 10% | 21.89 | N/A | N/A | |||
| 15% | 20.00 | N/A | N/A | |||
| 20% | 22.70 | N/A | N/A | |||
| 25% | 23.24 | N/A | N/A | |||
| 30% | 21.35 | N/A | N/A | |||
| 35% | 28.38 | N/A | N/A | |||
| 40% | 24.59 | N/A | N/A | |||
| 45% | 31.89 | N/A | N/A | |||
| 50% | 29.46 | N/A | N/A | |||
| 55% | 32.97 | N/A | N/A | |||
| 60% | 31.89 | N/A | N/A | |||
| 201x201 | 10% | 33.84 | N/A | N/A | ||
| 15% | 35.56 | N/A | N/A | |||
| 20% | 34.85 | N/A | N/A | |||
Bold values in the “NRF scores” columns denote the best NRF scores in the comparison, while in the “p-values” columns they refer to p-values denoting statistically significant improvements
Fig. 6Phylogenetic trees estimated with the full distance matrix on the upper left, in comparison with the trees obtained with PhyloMissForest (bottom left), matrix factorization (upper right) and autoencoder (bottom right) in the 9 ×9 dataset with 5% of missing data
Comparisons of NRF values between PhyloMissForest and the baseline algorithm MissForest [37]. “Split-LS-Rand” refers to the configuration where LS is incorporated for guidance purposes in the different steps of PhyloMissForest, while “Best Observed” represents the best results reported with any possible configuration of search strategies. Lower values denote better quality
| Dataset | %Missing | PhyloMissForest | Mixed-type | |
|---|---|---|---|---|
| Split-LS-Rand | Best Observed | MissForest | ||
| 9x9 | 5% | 4.6 | 5.0 | |
| 10% | 7.5 | 11.7 | ||
| 15% | 12.1 | |||
| 20% | 12.5 | 13.8 | ||
| 25% | 19.2 | |||
| 30% | 17.9 | |||
| 37x37 | 5% | 2.0 | ||
| 10% | 4.6 | |||
| 15% | 7.0 | |||
| 20% | 6.8 | 9.4 | ||
| 25% | 14.0 | |||
| 30% | 16.2 | |||
| 55x55 | 5% | 3.4 | 6.3 | |
| 10% | 5.9 | 8.7 | ||
| 15% | 11.0 | 14.7 | ||
| 20% | 16.9 | |||
| 25% | 20.5 | 22.0 | ||
| 30% | 25.0 | |||
Bold values refer to the best NRF values in the comparison
Final parameter settings for PhyloMissForest under non-bootstrap and bootstrap profiles
| Parameters | Non-Bootstrap | Bootstrap |
|---|---|---|
| Bootstrap | 0 | 1 |
| Size of the bootstrap | - | 1 |
| Number of trees | 30 | 50 |
| Max Features | 0.25 | 1 |
| Max Depth | 1 | 1 |
| Min Leaf | 0.01 | 0.13 |
Mean execution times obtained with autoencoder, matrix factorization, and PhyloMissForest (non-bootstrap and bootstrap profiles)
| Dataset | Autoencoder | Matrix Factorization | Non-Bootstrap | Bootstrap |
|---|---|---|---|---|
| 9 ×9 | 25s | 18s | 1s | 20s |
| 37 ×37 | 37s | 17min | 25s | 7min |
| 40 ×40 | 9min | 35min | 1.7min | 20min |
| 55 ×55 | 2min | 53min | 2min | 25min |
| 201 ×201 | 1.5h | >48h | 6h | 34h |
Fig. 7Random forest scheme with bootstrap and aggregation steps
Fig. 8Flowchart of the phases of PhyloMissForest