| Literature DB >> 23843191 |
Kaustubh R Patil1, Alice C McHardy.
Abstract
Understanding the evolutionary relationships between organisms is vital for their in-depth study. Gene-based methods are often used to infer such relationships, which are not without drawbacks. One can now attempt to use genome-scale information, because of the ever increasing number of genomes available. This opportunity also presents a challenge in terms of computational efficiency. Two fundamentally different methods are often employed for sequence comparisons, namely alignment-based and alignment-free methods. Alignment-free methods rely on the genome signature concept and provide a computationally efficient way that is also applicable to nonhomologous sequences. The genome signature contains evolutionary signal as it is more similar for closely related organisms than for distantly related ones. We used genome-scale sequence information to infer taxonomic distances between organisms without additional information such as gene annotations. We propose a method to improve genome tree inference by learning specific distance metrics over the genome signature for groups of organisms with similar phylogenetic, genomic, or ecological properties. Specifically, our method learns a Mahalanobis metric for a set of genomes and a reference taxonomy to guide the learning process. By applying this method to more than a thousand prokaryotic genomes, we showed that, indeed, better distance metrics could be learned for most of the 18 groups of organisms tested here. Once a group-specific metric is available, it can be used to estimate the taxonomic distances for other sequenced organisms from the group. This study also presents a large scale comparison between 10 methods--9 alignment-free and 1 alignment-based.Entities:
Keywords: alignment; alignment-free; distance metric learning; genome signature; genome tree; sequence comparison; taxonomy
Mesh:
Year: 2013 PMID: 23843191 PMCID: PMC3762195 DOI: 10.1093/gbe/evt105
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FPerformance on the phylogenetic groups. Each bar shows a performance measure along with error bars showing SD.
FPerformance on the GC-content groups. Each bar shows a performance measure along with error bars showing SD.
FPerformance on the ecological groups from three attributes. The bars show the performance measures and the error bars indicate SD.
P Values from the One-Sided Wilcoxon Rank Test, Testing the Specificity of the Learned Metrics for the Respective Groups
| Attribute | Group | CPCC | QD |
|---|---|---|---|
| Phylum | Proteobacteria | ||
| Firmicutes | |||
| Actinobacteria | |||
| Euryarchaeota | |||
| GC-content | ≤30% | ||
| >30–≤50% | |||
| >50–≤70% | |||
| Habitat | Aquatic | 0.5957 | 0.3762 |
| Terrestrial | |||
| Multiple | |||
| Host-associated | |||
| Specialized | |||
| Temperature range | Hyperthermophilic | ||
| Thermophilic | |||
| Mesophilic | 0.8850 | 0.6349 | |
| Oxygen requirement | Aerobic | 0.1150 | |
| Anaerobic | |||
| Facultative | |||
Note.—While for the CPCC, the alternative hypothesis was that the group-specific metrics produce higher CPCC values than randomly learned metrics, for the QD, the alternative hypothesis was that the group-specific metrics result in lower quartet distances than the randomly learned metrics. Significant results (<0.05) are shown in bold.
The CPCC and Quartet Distance before (CPCC and QD) and after (CPCC_PCA and QD_PCA) PCA based on the l4n1 Signature
| Attribute | Group | CPCC | CPCC_PCA | QD | QD_PCA | Dimension | Variance (%) |
|---|---|---|---|---|---|---|---|
| Phylum | Proteobacteria | 0.42 | 0.43 | 0.45 | 0.43 | 21 | 94.46 |
| Firmicutes | 0.57 | 0.54 | 0.32 | 0.29 | 20 | 96.25 | |
| Actinobacteria | 0.39 | 0.44 | 0.55 | 0.50 | 19 | 96.32 | |
| Euryarchaeota | 0.46 | 0.45 | 0.47 | 0.43 | 20 | 97.20 | |
| GC-content | ≤30% | 0.30 | 0.34 | 0.43 | 0.40 | 19 | 96.73 |
| >30–≤50% | 0.36 | 0.34 | 0.51 | 0.51 | 25 | 94.27 | |
| >50–≤70% | 0.44 | 0.48 | 0.48 | 0.43 | 22 | 94.49 | |
| Habitat | Aquatic | 0.39 | 0.38 | 0.51 | 0.51 | 24 | 94.78 |
| Terrestrial | 0.39 | 0.45 | 0.39 | 0.38 | 18 | 96.43 | |
| Multiple | 0.37 | 0.36 | 0.46 | 0.45 | 21 | 95.17 | |
| Host-associated | 0.17 | 0.18 | 0.51 | 0.51 | 21 | 94.65 | |
| Specialized | 0.20 | 0.19 | 0.57 | 0.57 | 23 | 95.28 | |
| Temperature range | Hyperthermophilic | 0.46 | 0.41 | 0.43 | 0.46 | 18 | 97.93 |
| Thermophilic | 0.19 | 0.24 | 0.59 | 0.58 | 22 | 96.03 | |
| Mesophilic | 0.25 | 0.24 | 0.51 | 0.52 | 22 | 93.49 | |
| Oxygen requirement | Aerobic | 0.34 | 0.34 | 0.56 | 0.56 | 22 | 94.65 |
| Anaerobic | 0.19 | 0.20 | 0.58 | 0.55 | 24 | 94.71 | |
| Facultative | 0.46 | 0.47 | 0.30 | 0.35 | 23 | 95.32 | |
| Average | 0.35 | 0.36 | 0.48 | 0.47 | 21.33 | 95.45 | |
Note.—The dimension and variance columns show the number of dimensions and variance retained, respectively. No significant improvement was observed after applying PCA either for the CPCC or the QD (P > 0.3, one-sided Wilcoxon rank sum test, see text for details).
Correlation of the Mean Change in the CPCC with Different Statistics across the Groups
| Correlation | Value | No. of Genomes | No. of Species | Genome Size (Mean) | Genome Size (SD) | GC-Content (Mean) | GC-Content (SD) | NRI | NTI |
|---|---|---|---|---|---|---|---|---|---|
| Pearson’s | R | 0.03 | 0.02 | ||||||
| 0.52 | 0.19 | 0.22 | 0.92 | 0.95 | 0.19 | ||||
| Spearman’s | ρ | 0.06 | 0.03 | ||||||
| 0.07 | 0.63 | 0.09 | 0.09 | 0.81 | 0.93 | 0.12 | 0.32 | ||
Note.—The Actinobacteria and Euryarchaeota groups were removed for this analysis, as they behaved like outliers (see text for details). Significant results (P < 0.05) are shown in bold.