| Literature DB >> 30878035 |
Will Pm Rowe1, Anna Paola Carrieri2, Cristina Alcon-Giner3, Shabhonam Caim3, Alex Shaw4, Kathleen Sim4, J Simon Kroll4, Lindsay J Hall5, Edward O Pyzer-Knapp2, Martyn D Winn6.
Abstract
BACKGROUND: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for tyrhe compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30878035 PMCID: PMC6420756 DOI: 10.1186/s40168-019-0653-2
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Summary of technical terms
| Term | Definition |
|---|---|
| Consistent weighted sampling | An efficient method of sub-sampling histogram data that takes into account the frequency of each bin |
| De novo | Analyses based solely on the collected sequence data |
| Dimensionality reduction | Representing the sequence data in a metagenome by a relatively small number of collective quantities |
| Dissimilarity measure | A measure of how dissimilar two metagenomes are, typically used to identify significant changes in microbiome composition |
| Feature vectors | A set of key quantities of a dataset that can be used as input to a machine learning algorithm |
| Histosketch | A small approximate representation of histogram data, such as a k-mer spectrum. |
| Jaccard similarity | A measure of the similarity of two datasets based on the proportion of shared members. |
| K-mer | A short sub-sequence extracted from a read or genome |
| K-mer spectrum | The set of all observed k-mers, together with their abundances in the sequence dataset |
| Locality-sensitive hashing | A method of dimensionality reduction which hashes sequence data in such a way that similar sequences are kept together |
| Reference-based | Making use of existing reference genomes to align and classify new sequencing data |
Fig. 1Overview of our method to histosketch microbiome samples from sequence data streams. a During counting, sequence reads are collected from the data stream by n counting processes. Reads are decomposed to canonical k-mers, encoded to uint64 values and used to increment local count-min sketches. Once X reads have been received from the data stream, approximate k-mer counts from the counting processes are transmitted as histogram elements to the single sketching process. b To update the histosketch, the incoming histogram element is hashed and compared against each hash value (W) or the previous histosketch (S), updating S and W if a new minimum is encountered. To hash the incoming vector, uniform scaling is applied and a cumulative frequency estimate is made using a count-min sketch; we then utilise CWS to generate a hash value for the updated histogram bin
Fig. 2Hierarchical clustering of CAMI short read microbiome samples [32]. Heatmaps show the pairwise Jaccard similarity between microbiome samples (ranging from 0% (blue) to 100% (red)); colormap ranges are computed using robust quantiles and dendrogram clades are coloured by body site. a HULK histosketches (k-mer size = 21, histosketch size = 512) for 48 microbiome samples were sketched in 1 min 30 s (12 cores per histosketch). b sourmash MinHash sketches (k-mer size = 21, sketch size = 512, track abundance = true) for 48 samples were sketched in 25 min 17 s. c Simka k-mer spectra (k-mer size = 21) for 48 microbiome samples were computed in 24 min 1 s (12 cores per spectrum)
Fig. 3Hierarchical clustering of dog microbiome samples [34]. a, b and c correspond to clustered histosketches from 0.005%, 0.05% and 0.5% of sample reads, respectively. Heatmaps show the pairwise Jaccard similarity between microbiome samples (ranging from 0% (blue) to 100% (red)); colormap ranges are computed using robust quantiles and dendrogram clades are coloured by diet. The majority of microbiome samples from the dogs on the baseline diet clustered together (green); however, the samples taken after these dogs were put on to an altered diet (pink/blue) and did not show any distinct clustering pattern
Fig. 4Principal component analysis of histosketches from CAMI short read microbiomes, with the 48 samples coloured by body site [32]. Circular data points indicate the histosketches used to build the LSH forest index and stars data points indicate histoketches used as search queries. Red rings enclose the returned LSH Forest search results for each search query (Jaccard similarity threshold > 90%)
Average random forest classification runtimes for predicting antibiotic vs. no-antibiotic treated neonatal microbiomes using read sampling intervals and concept drift (probability threshold = 0.9, k = 7, s = 42, decay ratio = 0.02, p = 8)
| Sampling interval (no. reads) | Runtime to reach initial classification (seconds) | Initial classification probability | Runtime to reach ≥ 0.9 classification probability (seconds) | Sampling intervals to reach ≥ 0.9 classification probability |
|---|---|---|---|---|
| No interval | 28.38 | 0.99 | 28.38 | na |
| 1,000,000 | 9.16 | 0.96 | 9.16 | 1 |
| 100,000 | 2.08 | 0.87 | 2.09 | 2 |
| 10,000 | 1.91 | 0.82 | 2.17 | 4 |
na not applicable