| Literature DB >> 22962453 |
Yu-Wei Wu1, Mina Rho, Thomas G Doak, Yuzhen Ye.
Abstract
MOTIVATION: One of the difficulties in metagenomic assembly is that homologous genes from evolutionarily closely related species may behave like repeats and confuse assemblers. As a result, small contigs, each representing a short gene fragment, instead of complete genes, may be reported by an assembler. This further complicates annotation of metagenomic datasets, as annotation tools (such as gene predictors or similarity search tools) typically perform poorly on configs encoding short gene fragments.Entities:
Mesh:
Year: 2012 PMID: 22962453 PMCID: PMC3436815 DOI: 10.1093/bioinformatics/bts388
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Alignment between a de Bruijn graph and a reference sequence. Blocks in the de Bruijn graph represent nodes, and black arrowheads represent the directed edges that connect nodes with overlapping k − 1 mers. Typically, a de Bruijn graph-based assembler will output each of the nodes as a contig. Red arrowheads constitute the optimal path of the nodes that aligns with the reference sequence derived by the network matching algorithm
The list of species contained in the mock dataset and corresponding species used as references in GeneStitch
| Species in mock dataset | Reference species |
|---|---|
| NC_002662 | NC_012984 |
| NC_008527 | NC_014724 |
| NC_008525 | NC_008529 |
| NC_010999 | NC_014106 |
| NC_008497 | NC_009513 |
| NC_008700 | NC_014012 |
| NC_008095 | NC_011891 |
| NC_008578 | NC_014666 |
| NC_002607 | NC_013967 |
aThe taxonomic ranks in the parentheses indicate the lowest common taxonomy level shared between the reference species and the species in the mock dataset.
A summary of the GeneStitch results for E. coli K-12 at 6×, 13× and 20× sequencing depths
| Sequencing depth | Reference | Genes/fragments | Gene coverage | Complete genes | Complete gene ratio | Misassembly rate |
|---|---|---|---|---|---|---|
| 6× | — | 13 947 (14 149) | 26% (28%) | 572 | 14% | — |
| 5365 | 62% | +318 | 21% | 0.3% | ||
| 4489 | 62% | +269 | 20% | 0.5% | ||
| 3916 | 62% | +227 | 19% | 0.2% | ||
| 13× | — | 6642 (9158) | 33% (50%) | 2320 | 56% | — |
| 4375 | 75% | +1070 | 82% | 0.2% | ||
| 3664 | 75% | +932 | 78% | 0.3% | ||
| 3212 | 75% | +824 | 76% | 0.2% | ||
| 20× | — | 1904 (3491) | 45% (81%) | 3264 | 79% | — |
| 1960 | 75% | +461 | 90% | 0.6% | ||
| 1484 | 76% | +418 | 89% | 0.2% | ||
| 1261 | 75% | +349 | 87% | 0.2% |
This column specifies the number of gene fragments in assembled contigs (the first row for each section) or the number of genes assembled by GeneStitch.
Gene coverage reflects the completeness of assembled genes; a small value indicates that assembled genes are highly fragmented.
This column lists the assembled genes or genes in contigs (the first row for each section) that are complete or almost complete (at least 90% of the entire length) as compared with the real genes. Additional complete gene numbers assembled by GeneStitch are highlighted by a ‘+’ sign.
This column lists the ratio of completely assembled genes versus all annotated genes in the E. coli K-12 genome.
This row lists the assembly results before applying GeneStitch.
The two numbers indicate the statistics of fragmented genes and all genes (within parentheses) in contigs. See text for details.
The ratio is calculated over all complete genes, including the ones assembled by SOAPdenovo and GeneStitch.
Fig. 2.Improvement of gene assembly by GeneStitch for the simulated and real community datasets, as evaluated by gene coverage (A) and the number of complete genes (B)
Fig. 3.An example demonstrating the inference of a gene path from a connected component in the de Bruijn graph. The reference gene recruited by BLAST in this example is YP_812362. (A) In total, 17 nodes are present in this connected component. (B) The path found by GeneStitch using the reference gene. (C) The gene path
Fig. 4.An example demonstrating the construction of a gene graph by merging gene paths. (A) only 19 nodes are shown in this figure for clarity (the actual component is larger). (B) Two paths are found by GeneStitch, using YP_003601430 and YP_004031707 as the reference genes. (C) The two paths are merged into a gene graph