| Literature DB >> 26682918 |
James Kaminski1, Molly K Gibson2, Eric A Franzosa1,3, Nicola Segata4, Gautam Dantas2,5,6, Curtis Huttenhower1,3.
Abstract
Profiling microbial community function from metagenomic sequencing data remains a computationally challenging problem. Mapping millions of DNA reads from such samples to reference protein databases requires long run-times, and short read lengths can result in spurious hits to unrelated proteins (loss of specificity). We developed ShortBRED (Short, Better Representative Extract Dataset) to address these challenges, facilitating fast, accurate functional profiling of metagenomic samples. ShortBRED consists of two components: (i) a method that reduces reference proteins of interest to short, highly representative amino acid sequences ("markers") and (ii) a search step that maps reads to these markers to quantify the relative abundance of their associated proteins. After evaluating ShortBRED on synthetic data, we applied it to profile antibiotic resistance protein families in the gut microbiomes of individuals from the United States, China, Malawi, and Venezuela. Our results support antibiotic resistance as a core function in the human gut microbiome, with tetracycline-resistant ribosomal protection proteins and Class A beta-lactamases being the most widely distributed resistance mechanisms worldwide. ShortBRED markers are applicable to other homology-based search tasks, which we demonstrate here by identifying phylogenetic signatures of antibiotic resistance across more than 3,000 microbial isolate genomes. ShortBRED can be applied to profile a wide variety of protein families of interest; the software, source code, and documentation are available for download at http://huttenhower.sph.harvard.edu/shortbred.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26682918 PMCID: PMC4684307 DOI: 10.1371/journal.pcbi.1004557
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 2Accuracy of ShortBRED and centroid-based profiling within synthetic metagenomes.
(A, B) ROC curves report the sensitivity and specificity (in terms of TPR and FPR) of the two methods for correctly identifying the presence and absence of protein families of interest in six synthetic metagenomes, spiked with 5%, 10%, and 25% of their material from the ARDB (panel A) and VFDB (panel B). (C, D) Scatterplots of protein family “predicted from mapping”, the abundance values calculated by ShortBRED and the centroids, versus “expected from gold standard”, the abundance values of the protein families in the 10% synthetic metagenome.
Characteristics of ShortBRED markers used to profile synthetic metagenomes.
| ARDB | VFDB | |
|---|---|---|
|
| 618 | 2,089 |
| Families with true markers | 594 | 2,041 |
| Families without true markers | 24 | 48 |
|
| 2,886 | 7,869 |
| True markers | 2,845 | 7,730 |
| Junction markers | 37 | 139 |
| Quasi markers | 4 | 0 |
Legend: This table lists the number of protein families and maker types present in the ARDB and VFDB markers created by ShortBRED-Identify for profiling synthetic metagenomes.
Characteristics of ShortBRED markers used to profile metagenomes and bacterial genomes.
| HMP | T2D | T2D_Short | Yatsunenko | Bacterial Genomes | |
|---|---|---|---|---|---|
|
| |||||
| Clustering ID | 8 | 8 | 8 | 8 | 8 |
| Minimum marker length | 85% | 85% | 85% | 85% | 85% |
| Average read BP | 101 | 90 | 75 | 450 | 100 |
| Min trusted BP | 90 | 81 | 68 | 200 | 30 |
| QM length | 30 | 27 | 22 | 66 | 33 |
|
| |||||
| Families after initial clustering | 849 | 849 | 849 | 849 | 849 |
| Families with true markers | 820 | 820 | 820 |
| 820 |
| Families without true markers | 29 | 29 | 29 |
| 29 |
|
| 4132 | 4135 | 4142 |
| 4132 |
| True markers | 4078 | 4078 | 4078 |
| 4078 |
| Junction markers | 48 | 50 | 61 |
| 48 |
| Quasi markers | 6 | 7 | 3 |
| 6 |
|
| |||||
| Samples profiled | 82 | 272 | 91 | 107 | 3305 |
Legend: This table lists characteristics of the markers used to profile the metagenomes and bacterial genomes. Each metagenome from the Chinese cohort was profiled with one of two sets of markers (T2D and T2D_Short) corresponding to the two different read sizes used in the dataset (90 and 75 bp) None of the families without True Markers were combined in the second round of clustering.
** Centroids were used to profile the Yatsunenko dataset.