| Literature DB >> 25026170 |
Kathrin P Aßhauer1, Heiner Klingenberg2, Thomas Lingner3, Peter Meinicke4.
Abstract
The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis.Entities:
Mesh:
Year: 2014 PMID: 25026170 PMCID: PMC4139848 DOI: 10.3390/ijms150712364
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Neighborhood accuracy on Human Microbiome Project (HMP) data for different profiling methods, metrics and body sites. (A) Accuracy of profiling methods with average/minimum/maximum over six different metrics; (B) Accuracy of distance metrics with average/minimum/maximum over all nine profiling methods; (C) Body site-specific accuracy for City block metric averaged over nine profiling methods; (D) Confusion matrix of neighborhood evaluation for different body sites according to UProC protein domain profiles and City block metric. Values represent rounded percentages and entries lower than 0.5 are omitted.
Figure 2Neighborhood accuracy on metagenome universe collection for different methods and habitats. (A) Accuracy of profiling methods with average/minimum/maximum over six different metrics; (B) Accuracy of distance metrics with average/minimum/maximum over all seven profiling methods; (C) Habitat-specific accuracy for City block metric averaged over seven profiling methods; (D) Heatmap of confusion matrix for different habitats according to UProC protein domain profiles and City block metric. Habitat labels on y-axis abbreviated to three letters.
Figure 32D representation of metagenome universe for different dimension reduction methods using UProC protein domain profile space. Markers represent metagenome datasets with colors corresponding to habitat labels as provided in legend in subfigure (A) Principal component analysis (PCA) using Euclidean metric with dimension-specific variance in parantheses; (B) Multidimensional scaling (MDS) using City block metric with dimension-specific variance in parantheses; (C) Sammon mapping using City block metric; (D) Unsupervised kernel regression (UKR) using City block metric.