| Literature DB >> 27243002 |
Narjes S Movahedi1, Mallory Embree2, Harish Nagarajan2, Karsten Zengler2, Hamidreza Chitsaz3.
Abstract
As the vast majority of all microbes are unculturable, single-cell sequencing has become a significant method to gain insight into microbial physiology. Single-cell sequencing methods, currently powered by multiple displacement genome amplification (MDA), have passed important milestones such as finishing and closing the genome of a prokaryote. However, the quality and reliability of genome assemblies from single cells are still unsatisfactory due to uneven coverage depth and the absence of scattered chunks of the genome in the final collection of reads caused by MDA bias. In this work, our new algorithm Hybrid De novo Assembler (HyDA) demonstrates the power of coassembly of multiple single-cell genomic data sets through significant improvement of the assembly quality in terms of predicted functional elements and length statistics. Coassemblies contain significantly more base pairs and protein coding genes, cover more subsystems, and consist of longer contigs compared to individual assemblies by the same algorithm as well as state-of-the-art single-cell assemblers SPAdes and IDBA-UD. Hybrid De novo Assembler (HyDA) is also able to avoid chimeric assemblies by detecting and separating shared and exclusive pieces of sequence for input data sets. By replacing one deep single-cell sequencing experiment with a few single-cell sequencing experiments of lower depth, the coassembly method can hedge against the risk of failure and loss of the sample, without significantly increasing sequencing cost. Application of the single-cell coassembler HyDA to the study of three uncultured members of an alkane-degrading methanogenic community validated the usefulness of the coassembly concept. HyDA is open source and publicly available at http://chitsazlab.org/software.html, and the raw reads are available at http://chitsazlab.org/research.html.Entities:
Keywords: colored de Bruijn graph; genome assembly; genome coassembly; single-cell genomics; uncultivable bacteria
Year: 2016 PMID: 27243002 PMCID: PMC4876485 DOI: 10.3389/fbioe.2016.00042
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Two sample colored de Bruijn graphs with colors red and blue. Nodes are k-mers and edges represent k + 1-mers. A colored bar shows multiplicity of the k-mer in the corresponding colored input data set. Each box is an output contig, and the color of a box shows non-zero colored average coverage, which is shown on the right hand side of the contig in (A). Our coassembly algorithm (A) rescues a poorly covered region of the genome in one color when it is well covered in the other, and (B) allows pairwise comparison of colored assemblies through revealing all of their shared and exclusive pieces of sequence.
The GAGE (Salzberg et al., .
| Lane 1 | Lane 6 | Identical cells | Identical cells | Non-identical cells | Non-identical cells | |
|---|---|---|---|---|---|---|
| Assembly size | 4,532,221 | 4,642,640 | 5,262,077 | 5,204,061 | 8,273,488 | 5,212,674 |
| Missing | 314,009 (6.77%) | 123,687 (2.67%) | 1,555 (0.03%) | 2,023 (0.04%) | 1,289 (0.03%) | 2,136 (0.05%) |
| Extra bases (%) | 280,998 (6.20%) | 198,072 (4.27%) | 653,307 (12.42%) | 584,534 (11.23%) | 3,661,052 (44.25%) | 597,088 (11.45%) |
| SNPs | 60 | 19 | 11 | 3 | 5 | 5 |
| Indels < 5 bp | 6 | 4 | 10 | 6 | 8 | 6 |
| Indels ≥ 5 bp | 13 | 14 | 6 | 5 | 4 | 4 |
| Inversions | 0 | 0 | 0 | 0 | 0 | 0 |
| Relocations | 12 | 11 | 2 | 3 | 2 | 3 |
| NG50 | 42,257 | 54,422 | 41,964 | 34,752 | 54,505 | 37,794 |
| Corrected NG50 | 39,975 | 44,872 | 39,334 | 32,876 | 39,334 | 36,868 |
GAGE (Salzberg et al., .
Pairwise relationships between three coassembled data sets, .
| Total (bps) | 5,228,480 | 5,240,302 | 3,366,622 | 5,228,480 | ||
| Shared (bps) | 5,210,548 | 335,648 | 336,184 | |||
| Exclusive (bps) | 179,32 | 29,754 | 4,904,654 | 3,030,974 | 3,030,438 | 4,892,296 |
| Exclusivity ratio | 0.003 | 0.005 | 0.9359 | 0.9003 | 0.9001 | 0.9357 |
Total is the total size of those contigs that have non-zero coverage in the corresponding color. Shared is the size of those contigs that have non-zero coverage in both colors. Exclusive is the size of those contigs that have non-zero coverage in the corresponding color and zero coverage in the other color in the pair.
.
Figure 2Genome coverage in (A) single-cell . Both have an average coverage of ~600×.
Evaluation results obtained from GAGE (Salzberg et al., .
| Tool | Missing ref. bases (%) | |
|---|---|---|
| Lane 1 | Lane 6 | |
| E + V-SC | 281,060 (6.06%) | 109,994 (2.37%) |
| SPAdes | 128,600 (2.77%) | 15,831 (0.34%) |
| IDBA-UD | 145,536 (3.14%) | 28,583 (0.62%) |
Quast (Gurevich et al., .
| A17 | F02 | F16 | K04 | K19 | MEB10 | MEK03 | MEL13 | C04 | K05 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| HyDA | Total | 54,237 | 1,278,742 | 604,769 | 449,148 | 371,311 | 1,182,622 | 1,666,233 | 1,150681 | 252,402 | 502,469 |
| N50 | 8,461 | 5,416 | 5,718 | 6,167 | 7,315 | 4,963 | |||||
| HyDA-Cl | Total | 1,352,341 | 1,945,701 | ||||||||
| N50 | 850 | 8,201 | 6,088 | 5,239 | 5,887 | 5,952 | 6,977 | 1,928 | 3,782 | ||
| SPAdes | Total | 169,413 | 982,263 | 618,500 | 653,866 | 1,514,813 | 1,415,399 | 390,923 | 869,586 | ||
| N50 | 1,187 | 5,944 | 5,366 | 9,332 | 3,834 | 4,234 | 3,128 | ||||
| IDBA-UD | Total | 144,512 | 1,441,353 | 927,009 | 56,6327 | 613,399 | 1,327,742 | 1,746,656 | 1,351,465 | 318,914 | 804,313 |
| N50 | 2,894 | 3,163 | 5,751 | 6,851 | 8,209 | 1,0253 | 4,706 | ||||
All statistics are based on contigs of size ≥100 bp. Only those HyDA contigs that have a coverage of at least 1 in the corresponding color are considered. Coverage cutoff was chosen to be 24 for all HyDA assemblies (−c = 24). Total is the total assembly size and N50 is the assembly N50 (the size of the contig, the contigs larger than which cover half of the assembly size). Best result is in bold face.
The exclusivity ratio (%) of row with respect to column for the 10 cells from .
| A17 | F02 | F16 | K04 | K19 | MEB10 | MEK03 | MEL13 | C04 | K05 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| A17 | 0 | 24 | 87 | 95 | 96 | 80 | 82 | 86 | 22 | 19 | |
| F02 | 77 | 0 | 96 | 98 | 99 | 71 | 68 | 72 | |||
| F16 | 96 | 96 | 0 | 73 | 73 | 37 | 22 | 38 | 96 | 55 | |
| K04 | 97 | 97 | 49 | 0 | 67 | 42 | 25 | 45 | 97 | 73 | |
| K19 | 98 | 98 | 54 | 68 | 0 | 35 | 32 | 32 | 98 | 55 | |
| MEB10 | 96 | 96 | 74 | 48 | 69 | 0 | 24 | 39 | 95 | 57 | |
| MEK03 | 97 | 97 | 73 | 54 | 74 | 38 | 0 | 37 | 96 | 58 | |
| MEL13 | 97 | 97 | 76 | 51 | 68 | 39 | 22 | 0 | 97 | 59 | |
| C04 | 44 | 39 | 89 | 96 | 97 | 85 | 86 | 90 | 0 | 64 | |
| K05 | 77 | 75 | 54 | 76 | 75 | 45 | 41 | 49 | 73 | 0 | |
Only the contigs of coverage at least 1 in the corresponding color are considered. Coverage cutoff was chosen to be 24 for all HyDA assemblies (−c = 24).
Summary of coding sequences and subsystems predicted by the RAST server (Aziz et al., .
| HyDA-colored | Spades | IDBA-UD | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Coding sequence | Subsystem | Coding sequence | Subsystem | Coding sequence | Subsystem | ||||||
| A17 | 8 | 146 | 132 | 7 | |||||||
| F02 | 1,283 | 122 | 1,375 | 121 | |||||||
| F16 | 899 | 91 | 866 | 89 | |||||||
| K04 | 559 | 75 | 508 | 66 | |||||||
| K19 | 581 | 54 | 572 | 57 | |||||||
| MEB10 | 1,491 | 151 | 1,297 | 138 | |||||||
| MEK03 | 1,856 | 180 | 1,178 | 170 | |||||||
| MEL13 | 1,435 | 154 | 1,384 | 148 | |||||||
| C04 | 48 | 375 | 320 | 36 | |||||||
| K05 | 873 | 68 | 854 | 77 | |||||||
Best result is in bold face.