| Literature DB >> 32689946 |
Ananya Bhattacharjee1,2, Md Shamsuzzoha Bayzid3.
Abstract
BACKGROUND: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data.Entities:
Keywords: Autoencoder; Deep learning; Gene trees; Imputation; Matrix factorization; Missing data; Phylogenetic trees; Species trees
Mesh:
Year: 2020 PMID: 32689946 PMCID: PMC7370488 DOI: 10.1186/s12864-020-06892-5
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1An overview of the experimental pipeline of this study. The input is either a set of sequences, or a complete distance matrix. We generate incomplete distance matrix from input sequences or input complete distance matrix by using various missingness mechanisms. Next, we apply various imputation techniques to impute the missing entries in the incomplete distance matrix and thereby, generating (complete) imputed distance matrices. Next, we estimate phylogenetic trees from the imputed distance matrices using FastME. Finally, we compare the estimated trees with the model tree to evaluate the performance of various imputation techniques
RF rates of different methods on the 24 taxa dataset with varying numbers of missing entries. The best RF rates for various model conditions are shown in boldface
| #Taxa | #Entries | #Missing | RF Rate | |||
|---|---|---|---|---|---|---|
| Entries | DAMBE | LASSO | MF | AE | ||
| 10 | 0.0476 | 0.2857 | 0.0952 | |||
| 20 | 0.3333 | 0.1905 | 0.2381 | |||
| 30 | 0.2381 | 0.3333 | 0.2381 | |||
| 40 | 0.3333 | 0.3333 | ||||
| 50 | 0.3333 | 0.1905 | 0.4286 | 0.3333 | ||
| 60 | 0.2857 | 0.3333 | 0.381 | |||
| 70 | 0.4286 | 0.5714 | 0.381 | |||
| 24 | 276 | 80 | 0.4762 | 0.6667 | ||
| 90 | 0.5714 | |||||
| 100 | 0.5714 | 0.7143 | 0.6190 | |||
| 110 | 0.7143 | 0.8571 | ||||
| 120 | 0.8095 | 0.7619 | 0.8571 | |||
| 130 | 0.8571 | 0.8095 | ||||
| 140 | N/A | N/A | ||||
Fig. 2Phylogenetic trees estimated on the full and incomplete dataset (30 missing entries) with 24 OTUs from 10 Hawaiian katydid species. a Tree estimated from the full data (complete distance matrix), b - e trees reconstructed from incomplete distance matrix by DAMBE, LASSO, MF, and AE, respectively. Red rectangles highlight the inconsistencies with the tree on the full dataset
Average RF rates (± standard error) of different methods on the 37-taxon dataset for varying numbers of missing entries and two different sequence evolution models. For each model condition, we show the average RF rate and standard error over 10 replicates. The best RF rates for various model conditions are shown in boldface
| #Taxa | #Entries | Scaling | Model | #Missing | Average RF Rate | |||
|---|---|---|---|---|---|---|---|---|
| Entries | DAMBE | LASSO | MF | AE | ||||
| 36 | 0.41 ±0.02 | 0.72 ±0.03 | 0.41 ±0.02 | |||||
| 100 | 0.48 ±0.02 | 0.72 ±0.03 | 0.46 ±0.02 | |||||
| TN93 | 225 | 0.72 ±0.03 | 0.78 ±0.03 | 0.70 ±0.02 | ||||
| 37 | 666 | 1X | 342 | N/A | N/A | 0.99 ±0.02 | ||
| 36 | 0.41 ±0.02 | 0.71 ±0.02 | 0.4 ±0.02 | |||||
| 100 | 0.49 ±0.02 | 0.72 ±0.02 | 0.5 ±0.03 | |||||
| LogDet | 225 | 0.72 ±0.02 | 0.76 ±0.02 | 0.72 ±0.02 | ||||
| 342 | N/A | N/A | 1 ±0 | |||||
| 36 | 0.45 ±0.02 | 0.69 ±0.02 | 0.43 ±0.02 | |||||
| 100 | 0.72 ±0.02 | 0.5 ±0.02 | 0.54 ±0.03 | |||||
| TN93 | 225 | 0.66 ±0.02 | 0.76 ±0.02 | 0.71 ±0.02 | ||||
| 37 | 666 | 0.5X | 342 | N/A | N/A | 1 ±0 | ||
| 36 | 0.45 ±0.02 | 0.68 ±0.02 | 0.42 ±0.02 | |||||
| 100 | 0.71 ±0.02 | 0.52 ±0.02 | 0.51 ±0.02 | |||||
| LogDet | 225 | 0.76 ±0.01 | 0.66 ±0.02 | 0.7 ±0.02 | ||||
| 342 | N/A | N/A | 0.99 ±0.02 | |||||
| 36 | 0.43 ±0.02 | 0.68 ±0.01 | 0.42 ±0.02 | |||||
| 100 | 0.69 ±0.02 | 0.52 ±0.02 | ||||||
| TN93 | 225 | 0.73 ±0.02 | 0.71 ±0.02 | 0.69 ±0.02 | ||||
| 37 | 666 | 2X | 342 | N/A | N/A | 0.99 ±0.01 | ||
| 36 | 0.44 ±0.02 | 0.63 ±0.02 | 0.4 ±0.01 | |||||
| 100 | 0.66 ±0.02 | 0.54 ±0.02 | 0.52 ±0.02 | |||||
| LogDet | 225 | 0.73 ±0.01 | 0.7 ±0.02 | 0.69 ±0.02 | ||||
| 342 | N/A | N/A | 0.99 ±0.01 | |||||
Average RF rates (± standard error) of different methods on the 201-taxon dataset. The best RF rates for various model conditions are shown in boldface
| #Taxa | #Entries | Model | #Missing | Average RF Rate | ||
|---|---|---|---|---|---|---|
| Entries | LASSO | MF | AE | |||
| 400 | 0.6 ±0.02 | |||||
| 1024 | 0.61 ±0.02 | 0.4 ±0.05 | ||||
| TN93 | 2500 | 0.62 ±0.02 | 0.41 ±0.03 | |||
| 5625 | 0.63 ±0.03 | 0.44 ±0.03 | ||||
| 201 | 20100 | 10100 | N/A | 0.59 ±0.02 | ||
| 400 | 0.59 ±0.02 | 0.38 ±0.02 | ||||
| 1024 | 0.62 ±0.01 | 0.4 ±0.03 | ||||
| LogDet | 2500 | 0.61 ±0.02 | 0.41 ±0.02 | |||
| 5625 | 0.62 ±0.02 | 0.46 ±0.03 | ||||
| 10100 | N/A | 0.58 ±0.03 | ||||
Average RF rates (± standard error) of different methods on the Carnivores dataset. The best RF rates for various model conditions are shown in boldface
| #Taxa | #Entries | #Missing | Average RF Rate | |||
|---|---|---|---|---|---|---|
| Entries | DAMBE | LASSO | MF | AE | ||
| 5 | 0.29 ±0.06 | 0.37 ±0.1 | 0.23 ±0.07 | |||
| 10 | 0.6 ±0.03 | 0.71 ±0.06 | ||||
| 10 | 45 | 15 | 0.63 ±0.07 | 0.83 ±0.09 | 0.57 ±0.04 | |
| 20 | 0.77 ±0.03 | 0.94 ±0.07 | 0.63 ±0.05 | |||
| 25 | N/A | N/A | 0.94 ±0.05 | |||
Average RF rates (± standard error) of different methods on the Baculovirus dataset. The best RF rates for various model conditions are shown in boldface
| #Taxa | #Entries | #Missing | Average RF Rate | |||
|---|---|---|---|---|---|---|
| Entries | DAMBE | LASSO | MF | AE | ||
| 4 | 0.27 ±0.08 | 0.29 ±0.03 | 0.39 ±0.04 | |||
| 8 | 0.5 ±0.11 | 0.5 ±0.1 | 0.39 ±0.08 | |||
| 9 | 36 | 12 | 0.7 ±0.07 | 0.49 ±0.05 | 0.5 ±0 | |
| 16 | 0.7 ±0.06 | 0.67 ±0.05 | 0.57 ±0.08 | |||
| 20 | N/A | N/A | ||||
Average RF (± standard error) of different methods on the mtDNAPri3F84SE dataset. The best RF rates for various model conditions are shown in boldface
| #Taxa | #Entries | #Missing | Average RF Rate | |||
|---|---|---|---|---|---|---|
| Entries | DAMBE | LASSO | MF | AE | ||
| 2 | 0.1 ±0.04 | 0.4 ±0.15 | 0.15 ±0.09 | |||
| 5 | 0.55 ±0.08 | 0.5 ±0.1 | ||||
| 7 | 21 | 7 | 0.4 ±0.11 | 0.75 ±0.07 | 0.8 ±0.19 | |
| 10 | 0.65 ±0.17 | 0.8 ±0.04 | 0.7 ±0.04 | |||
| 12 | N/A | N/A | 0.9 ±0.05 | |||
Fig. 3a General overview of an autoencoder. b A schematic of our proposed autoencoder model. The X’s in the dropout layers symbolically denote that their weights will be set to zero