| Literature DB >> 26319390 |
Yuzhen Ye1, Haixu Tang1.
Abstract
MOTIVATION: Metagenomics research has accelerated the studies of microbial organisms, providing insights into the composition and potential functionality of various microbial communities. Metatranscriptomics (studies of the transcripts from a mixture of microbial species) and other meta-omics approaches hold even greater promise for providing additional insights into functional and regulatory characteristics of the microbial communities. Current metatranscriptomics projects are often carried out without matched metagenomic datasets (of the same microbial communities). For the projects that produce both metatranscriptomic and metagenomic datasets, their analyses are often not integrated. Metagenome assemblies are far from perfect, partially explaining why metagenome assemblies are not used for the analysis of metatranscriptomic datasets.Entities:
Mesh:
Year: 2015 PMID: 26319390 PMCID: PMC4896364 DOI: 10.1093/bioinformatics/btv510
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A schematic illustration of the algorithm for mapping reads onto de Bruijn graphs. (a) A toy example showing four reads spanning junction k-mers in the graph (shown as the vertices). (b) Using a hash table of junction k-mers, candidates of reads that span multiple edges can be retrieved by looking up in the table. (c) For each candidate, a matched k-mer determines a unique putative location of the read in the graph (i.e. a seed match). The seed match will then be used to constrain the alignment between the read and the graph by a dynamic programming algorithm
Fig. 2.A schematic example illustrating the induced transcript graph derived from four reads (A–D) mapped to a de Bruijn graph of metagenome assembly
Performance comparison of TAG and other assemblers on the mock dataset
| Oases | Trinity | TAG | |
|---|---|---|---|
| No. of transcripts | 12598 | 24804 | 9428 |
| Perfectly aligned transcripts (percentage) | 5483 (43.5%) | 12392 (50.0%) | 9412 (99.8%) |
| Transcripts with minor problems (percentage) | 2724 (21.6%) | 2725 (11.0%) | 14 (0.15%) |
| Problematic transcripts (percentage) | 4391 (34.9%) | 9687 (39.1%) | 2 (0.02%) |
| Total length of the transcripts | 6860841 bp | 7428187 bp | 7020975 bp |
| Total length of perfectly aligned transcripts | 2265224 bp | 3858486 bp | 7002290 bp |
| Total length of good transcripts | 4076481 bp | 5025072 bp | 7020484 bp |
aOnly transcripts of at least 100 bp were considered for all programs.
bTrinity has many more transcripts, but their total length is comparable to the other methods.
cA transcript that is perfectly aligned to one of the reference genomes (with an alignment covering the entire transcript at 100% sequence identity) is considered to be correctly assembled. We consider the problem of a transcript is ‘minor’ if its longest alignment with the reference genomes is not 10 nt shorter than the transcript and the alignment has 95% sequence identity or better. Other transcripts that do not meet these criteria are considered to be problematic.
dA large fraction of the problematic transcripts for Oases and Trinity are likely caused by the presence of contaminated sequences or other artifacts so should not be considered as mis-assemblies. For example, 3494 (out of 4391) Oases transcripts have no significant alignments with the reference genomes with E-values better than 1e − 4, and therefore are unlikely transcripts from the reference genomes.
Fig. 3.The impact of k-mer size on the performance of TAG. When the k-mer size increases from 25 to 31 in SOAPdenovo2 assembly, the performance of TAG remains the same: a substantial fraction of multi-edge transcripts can be assembled by TAG. However, when further increasing the k-mer size to 35, most transcripts assembled by TAG are single-edge transcripts, indicating the TAG algorithm is not effective when a large k-mer is used. This is probably because, in this case, the metagenome assembly is fragmented rather than tangled, and as a result the total length of the transcript also decreases. Therefore, in the experiments of this article, we choose k = 31 in SOAPdenovo2 assembly, which seems to yield the best results here
Fig. 4.The path length distribution for multi-edge-spanning reads that span two or more edges when mapped to the de Bruijn graph by TAG. The X-axis represents the length of multi-edge-spanning read paths (i.e. the number of edges that the multi-edge-spanning reads span) and the Y-axis represents the total number of multi-edge-spanning reads spanning the paths of certain lengths. Paths of length 1 represent the cases when the seed extension in one direction resulted in an alignment of at most 7 bp, and thus were considered insignificant and discarded
Some statistics of TAG assembly on the human stool metatranscriptomics dataset
| Total number of reads | 27962127 × 2 (paired) |
| Number of reads mapped to contigs | 19233474 × 2 + 7645742 (single) |
| Number of multi-edge-spanning reads | 1893157 |
| Number of | 112527 (32216351 bp) |
| Number of | 2573 (340276 bp) |
| Total number of | 115100 (32556627 bp) |
| Number of | 20903 (4596622 bp) |
| Number of | 552 (110063 bp) |
| Total number of | 21455 (4706685 bp) |
| Total number of transcripts (length) | 177463 (40456052 bp) |
| Proportion of multi-edge transcripts (in length) | 15.7% (11.6%) |
Only transcripts of at least 100 bp were considered in this summary.
aPartial transcripts: the transcripts that are not fully resolved by TAG (i.e. the edge sequences); Resolved transcripts: the transcripts that are resolved by TAG and therefore likely represent full-length transcripts.
bSingle-edge transcripts: the transcripts reported by TAG that are fully contained within edges (contig) in the de Bruijn graph of the metagenome assembly (they can be considered as the results of a baseline reference-based metatranscriptome assembly approach that uses the contigs as the reference); Multi-edge transcripts: the transcripts reported by TAG that span multiple edges in the de Bruijn graph.