| Literature DB >> 27698619 |
Jay S Ghurye1, Victoria Cepeda-Espinoza1, Mihai Pop1.
Abstract
Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.Entities:
Keywords: Assembly; Metagenomics; Microbiome
Mesh:
Year: 2016 PMID: 27698619 PMCID: PMC5045144
Source DB: PubMed Journal: Yale J Biol Med ISSN: 0044-0086
Overview of current sequencing technologies.
| Single Molecule Real-Time Sequencing (Pacific Biosciences) | 10 kbp to 15 kbp | 87% (Low) | 30 minutes to 4 hours | 5 – 10 Gb |
| Oxford Nanopore MinION Sequencing | 5 kbp to 10 kbp | 70% to 90% (Low) | 1 to 2 days | 500 Mb |
| Ion Semiconductor (Ion Torrent sequencing) | Up to 400 bp | 98% (Medium) | 2 hours | 10Gb |
| Sequencing by synthesis (Illumina) | 50 – 300bp | 99.9% (High) | 1 to 11 days | 300 Gb |
| Sequencing by ligation (SOLiD sequencing) | 75 bp | 99.9% (High) | 1 to 2 weeks | 3 Gb |
| Pyrosequencing (454) | 700 bp | 98% (Medium) | 24 hours | 400 Mb |
| Chain termination sequencing (Sanger sequencing) | 400 to 900 bp | 99.9% (High) | 20 mins to 3 hours | 50 – 100 Kb |
Figure 1Overview of different de novo assembly paradigms. Schematic representation of the three main paradigms for genome assembly – Greedy, Overlap-Layout-Consensus, and de Bruijn. In Greedy assembler, reads with maximum overlaps are iteratively merged into contigs. In Overlap-Layout-Consensus approach, a graph is constructed by finding overlaps between all pairs of reads. This graph is further simplified and contigs are constructed by finding branch-less paths in the graph, and taking the consensus sequence of the overlapping reads implied by the corresponding paths. Contigs are further organized and extended using mate pair information. In de Bruijn graph assemblers, reads are chopped into short overlapping segments (k-mers) which are organized in a de Bruijn graph structure based on their co-occurrence across reads. The graph is simplified to remove artifacts due to sequencing errors, and branch-less paths are reported as contigs.
Comparison of different . The columns in the table denote various assembly methods. The rows denote the parameters which are compared across these assembly methods. Prototypical assemblers are highlighted in each category. Assemblers marked with a * are not specifically designed for metagenomic applications.
| ✓ | ✓ | ✓ | |
| ✓ | ✓ | ✗ | |
| ✗ | ✗ | ✓ | |
| ✓ | ✗ | ✗ | |
| VCAKE*, phrap*, TIGR* | Celera Assembler*, Omega, SGA* | MetaVelvet, Meta-IDBA, Megahit, Meta-Ray, Meta-Spades |
Figure 2Metagenomic assembly pipeline. Multiple bacterial genomes within a community are represented as circles of different colors indicating multiple individuals form a same organism. Note the different levels of sequencing coverage for the individual organisms' genomes, due to the different abundance of the organisms in the original sample. After sequencing redundant reads can be removed through digital normalization, reducing the computational needs for assembly. The filtered reads are then assembled into contigs and they are classified using k-mers and coverage statistics. Contigs in each group are then binned to form draft genome sequences for organisms within the population.