| Literature DB >> 26246894 |
Qiong Wang1, Jordan A Fish2, Mariah Gilman3, Yanni Sun3, C Titus Brown4, James M Tiedje5, James R Cole6.
Abstract
BACKGROUND: Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. Current methods often assemble only fragmented partial genes.Entities:
Keywords: Assembly; Biofuel crop; Functional gene; HMM; Metagenomics; Nitrogen cycle; nifH; nirK
Year: 2015 PMID: 26246894 PMCID: PMC4526283 DOI: 10.1186/s40168-015-0093-6
Source DB: PubMed Journal: Microbiome ISSN: 2049-2618 Impact factor: 14.650
Fig. 1The Xander combined weighted assembly graph structure. M, I, and D represent HMM match, insert, and delete states, respectively. Numbers represent state position on the HMM. For simplicity, a kmer length of 6 is used and weights of the edges are not shown. The vertices shown in bold on the de Bruijn graph and profile hidden Markov model are combined to form the bold vertex in the combined graph. The green solid arrows represent all possible outgoing edges from these vertices. Boxes with ellipses indicate additional omitted graph structure. The delete HMM state is combined with the de Bruijn graph vertex from the last match; this carries forward the state information necessary to correctly form subsequent vertices in the combined graph. During path search, if this combined vertex becomes the best scoring vertex in the open set, it is removed from the open set and the adjacent combined vertices are instantiated and added to the open set
Fig. 2Xander gene assembly workflow. Two types of input sequences are required: one or more metagenomic read files used to build the de Bruijn graph and one set of reference sequences for each targeted gene, for building specialized profile HMMs using a modified version of HMMER 3.0 (see the “Implementation” section). During the search phase, Xander uses a combined weighted assembly graph to assemble genes (contigs). After assembly, several filters are applied at the quality filter step: chimeric genes, or genes below length cutoff or HMM score cutoff are discarded, and genes are clustered at 99 % aa identity and the longest one from each cluster is chosen as the representative. The quality-filtered genes are further processed to provide coverage and abundance information
Percent of vertices opened with pruning compared to no pruning with the corresponding length and count
| Prune cutoff/kmer length | Length 30 | Length 45 | ||
|---|---|---|---|---|
| Count 1a | Count 2b | Count 1a | Count 2b | |
| No pruning (# opened * 106) | 2325 | 55.8 | 17.7 | 9.2 |
| Prune 5 (% opened) | 1.55 | 16.53 | NA | NA |
| Prune 10 (% opened) | 2.66 | 25.09 | NA | NA |
| Prune 15 (% opened) | 6.34 | 29.28 | NA | NA |
| Prune 20 (% opened) | 11.58 | 31.86 | 10.7 | 6.5 |
aCount 1: de Bruijn graph requiring kmers with minimum abundance of 1 in the reads
bCount 2: de Bruijn graph requiring kmers with minimum abundance of 2 in the reads
Comparison between Xander and SAT assembly of ribosomal protein L2 (rplB) genes from HMP-defined community data
| Tool | SAT | Xandera |
|---|---|---|
| # contigs | 4 | 6 |
| # members covered | 3 | 4 |
| Median gene coverageb (%) | 75.7 | 94.6 |
| Max gene coverageb (%) | 79.9 | 100 |
| Median % nucleotide identity | 97.8 | 99.8 |
| Max % nucleotide identity | 99.8 | 100 |
aXander: kmer length of 45, prune 20, count 1 graph
bGene coverage: length of the contigs compared to the closest defined community members
Nitrite reductase (nirK) genes found in bulk assembly of pooled rhizosphere samples
| Sample | Corn |
| Switchgrass |
|---|---|---|---|
| File size (GB) | 349 | 325 | 277 |
| Data size (Gbp) | 293 | 275 | 233 |
| # protein contig clustersa | 41 | 37 | 39 |
| # OTUs at 95 % aa identity | 38 | 33 | 34 |
| Median length (aa) | 131 | 115 | 130 |
| Max length (aa) | 234 | 252 | 301 |
| Median % aa identityb | 75.6 | 79.6 | 73.3 |
| Max % aa identityb | 95.1 | 94.3 | 92.1 |
| # reads covering kmers | 105 | 123 | 106 |
| Gene abundance | 0.25 | 0.25 | 0.3 |
aNumber of protein contig clusters at 99 % aa identity
bPercent identity to nearest reference sequence
Xander assembly of pooled rhizosphere samples
| Gene |
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| Crop | C | M | S | C | M | S | C | M | S |
| # chimeric clusters removed | 16 | 207 | 11 | 0 | 1 | 0 | 14 | 28 | 44 |
| # protein contig clustersa | 1993 | 1807 | 1581 | 39 | 57 | 41 | 19,287 | 20,463 | 17,334 |
| # OTUs at 95 % aa identity | 741 | 674 | 582 | 14 | 24 | 17 | 6100 | 6887 | 6004 |
| Median (aa) | 215 | 230 | 208 | 294 | 256 | 255 | 274 | 274 | 274 |
| Longest (aa) | 380 | 372 | 370 | 296 | 296 | 296 | 285 | 285 | 284 |
| Median % aa identity | 88.3 | 84.7 | 87.8 | 92.7 | 91.9 | 91.6 | 77.7 | 75.8 | 76.3 |
| Max % aa identity | 100 | 99.4 | 98.6 | 100 | 100 | 100 | 100 | 100 | 100 |
| # reads covering kmers | 27,404 | 19,815 | 16,661 | 411 | 534 | 461 | 225,985 | 179,867 | 149,661 |
| Relative abundance | 0.121 | 0.11 | 0.111 | 0.002 | 0.003 | 0.003 | 1 | 1 | 1 |
C corn, M Miscanthus, S switchgrass, nirK nitrite reductase gene, nifH nitrogenase reductase gene, rplB ribosomal protein L2 gene
a# protein contig clusters: number of protein contigs clustered at 99 % aa identity
Fig. 3Kmer abundance of nitrite reductase gene (nirK) representative contigs assembled by Xander from the pooled rhizosphere samples. The representative contigs were chosen from clusters at 99 % aa identity. X-axis indicates the number of times (abundance) a kmer in the contigs occurred in the reads. Y-axis represents the fraction of total unique kmers with this abundance
Fig. 4Principal component analysis of rhizosphere samples (n = 7 per crop). The OTU abundances at 95 % aa identity were corrected using the mean kmer coverage of each contig. The OTU data were then standardized using the Wisconsin square root normalization as implemented in R. Ellipses represent 1 standard deviation of the points from the centroid. C corn, M Miscanthus, S switchgrass. Left: nitrite reductase (nirK). Right: ribosomal protein L2 (rplB)
Xander processing statistics with kmer length of 45 and count 1
| Sample name | HMPa | C1 | Corn |
|---|---|---|---|
| File size (GB) | 1.7 | 46 | 349 |
| Build memory (GB) | 1 | 60 | 200 |
| Build time (h) | 0.3 | 6.4 | 41 |
| Find starting kmers (h)b | 0.1 | 3.6 | 27.0 |
| Search | 0.3 (min) | 1 (min) | 6 (min) |
| Search | NA | 48 (min) | 36.7 (h) |
| Search | 1.1 (min) | 228 (min) | 49.4 (h) |
aHMP-defined community data
bFor single thread. Can be multi-threaded or run in parallel
cFor single thread. Can be run in parallel