| Literature DB >> 21926975 |
Hamidreza Chitsaz1, Joyclyn L Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher L Dupont, Jonathan H Badger, Mark Novotny, Douglas B Rusch, Louise J Fraser, Niall A Gormley, Ole Schulz-Trieglaff, Geoffrey P Smith, Dirk J Evers, Pavel A Pevzner, Roger S Lasken.
Abstract
Whole genome amplification by the multiple displacement amplification (MDA) method allows sequencing of DNA from single cells of bacteria that cannot be cultured. Assembling a genome is challenging, however, because MDA generates highly nonuniform coverage of the genome. Here we describe an algorithm tailored for short-read data from single cells that improves assembly through the use of a progressively increasing coverage cutoff. Assembly of reads from single Escherichia coli and Staphylococcus aureus cells captures >91% of genes within contigs, approaching the 95% captured from an assembly based on many E. coli cells. We apply this method to assemble a genome from a single cell of an uncultivated SAR324 clade of Deltaproteobacteria, a cosmopolitan bacterial lineage in the global ocean. Metabolic reconstruction suggests that SAR324 is aerobic, motile and chemotaxic. Our approach enables acquisition of genome assemblies for individual uncultivated bacteria using only short reads, providing cell-specific genetic information absent from metagenomic studies.Entities:
Mesh:
Year: 2011 PMID: 21926975 PMCID: PMC3558281 DOI: 10.1038/nbt.1966
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1Assembling single cell reads using Velvet-SC. (a) Coverage varies widely along the genome, between 1 and 12 in this cartoon example. Reads (short lines) and potential contigs (thick lines; boxes around the supporting reads) are positioned along the genome, with a box around the reads supporting each contig. There are two potential contigs to choose from in the middle, differing by a single nucleotide (C vs. T): a green contig with coverage 6.4, and a blue contig with coverage 1. With a fixed coverage threshold of 4, Velvet would delete the low coverage blue and purple contigs, and then merge the high coverage red and green contigs into a contig much shorter than the full genome. Velvet-SC instead starts by eliminating sequences of average coverage 1, which only removes the blue contig. The other contigs are combined into a single contig (b) of average coverage 9. The purple region is salvaged by Velvet-SC because it was absorbed into a higher coverage region coverage threshold increased. Velvet-SC repeats this process with a gradually increasing low coverage threshold. (c) A portion of the de Bruijn graph for the contigs described in (a). The black circles are the “vertices” and represent 5-mer strings derived from the reads, which are indicated by colored lines alongside the chains of vertices, including a blue read with an erroneous T. The lines between the vertices are termed “edges” and represent the overlaps between the 5-mers. The edges are directed from left to right in this example. The read with the C/T mismatch results in two alternative paths for assembly, both with 5 intermediate vertices. The lower of the two paths arises from the erroneous blue read and has coverage 1; it is the only part of the graph eliminated by Velvet-SC, leaving a single chain of vertices that gives a single contig for the entire genome. See Supplementary Figure S3 for an example of the condensing of contigs. An example of Velvet-SC handling of a chimeric read is presented in Supplementary Figure S4.
Comparison of assemblies of known genomes (for contigs > 110 bp): number of contigs, genome N50, the length of the largest contig, total nucleotides in the assembly, substitution error rate in the assembled contigs (per 100 kbp), number of genes completely or partially present in the assembly, and number of operons completely or partially present in the assembly. Partial means that a gene and a contig (or an operon and a contig) have an overlap of at least 100 nucleotides. Best by each criteria is indicated in bold. EULER-SR 2.0.1, Velvet 0.7.60, Velvet-SC, and EULER + Velvet-SC were run with k-mer size equal to 55. Edena 2.1.1[38] was run with a minimum overlap of 55. SOAPdenovo 1.0.4 [39] was run with k=27–31. E+V-SC stands for EULER + Velvet-SC. Gene annotations were from http://www.ecogene.org/ (E. coli) and http://cmr.jcvi.org/cgi-bin/CMR/GenomePage.cgi?org=ntsa10 (S. aureus). Operon annotations (E. coli) were from http://csbl1.bmb.uga.edu/OperonDB/displayNC.php?id=215 [40]. Some of the contigs in the single cell assemblies represent contaminants.
| Dataset | Assembler | # | N50 | Largest | Total | Subs. | Known | Complete | Partial | Predicted | Complete | Partial |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EULER-SR | 1344 | 26662 | 4369634 | 16.1 | 4324 | 3178 | 627 | 884 | 553 | 248 | ||
| EULER-SR | 1820 | 29551 | 170385 | 4469152 | 16.6 | 3339 | 734 | 561 | 283 | |||
| EULER-SR | 295 | 4598020 | 3.5 | 4119 | 115 | 788 | 80 | |||||
| EULER-SR | 4398 | 7247 | 66549 | 53.1 | 2622 | 1958 | 640 | - | nd | nd |
Figure 2Comparison of contigs generated by Velvet vs. EULER+Velvet-SC for single cell E. coli lane 1. (a,b,c) Contigs are those presented in Table 1 and are ordered from largest to smallest number of bases. The y-axis shows (a) the cumulative length, (b) the cumulative number of genes, and (c) the cumulative number of operons in the contigs. EULER+Velvet-SC improves upon Velvet in all three plots. (d) Average read coverage over a 1000 bp window (top, log scale), Velvet contigs (middle) and EULER+Velvet-SC contigs (bottom), mapped along the E. coli reference genome, with vertical staggering to help visualize small contigs. Contigs in blue or green match between the assemblies. Contigs in red or orange differ between the assemblies: they either have substantially different lengths, are broken into a different number of contigs, or are present in one assembly but missing in the other.
Comparison of Velvet-based assembler results (k=55) on SAR324_MDA assembly: total number of contigs; assembly N50 (for contigs > 110 bp); length of the largest contig (for contigs > 110 bp); total nucleotides in the assembly (for contigs > 110 bp); number of ORFs >20 bp predicted by MetaGene[22]; number of ORFs with phylogenetic assignments by APIS (see Methods); number of ORFs with COGs identified via BLAST (see Methods); and number of 111 conserved single copy genes present [30]. N50 is defined as the contig length such that using the same length or longer contigs produces half of the total assembly length.
| Assembler | # of | N50 (bp) | Largest | Total | # ORFs | # ORFs | # COGs | # Conserved |
|---|---|---|---|---|---|---|---|---|
| Velvet | 1856 | 11531 | 100589 | 3921396 | 4575 | 2462 | 2160 | 55/111 (46%) |
| Velvet-SC | 933 | 23230 | 113282 | 4284882 | 4234 | 2627 | 2307 | 75/111 (67%) |
| E +V-SC | 823 | 30293 | 113282 | 4282110 | 4154 | 2604 | 2281 | 75/111 (67%) |
Figure 3A 16S maximum likelihood tree of Deltaproteobacterial 16S sequences including SAR324_MDA (red). Sequences with species identification are from representative Deltaproteobacterial reference genomes in GenBank. The environmental 16S sequences (designated uncultured SAR324 or uncultured deltaproteobacteria) were retrieved from GenBank based on their accession numbers (see Fig. S3 of [27]). The sequences were aligned using MOTHUR [36]. The tree was inferred using the nucleotide maximum likelihood feature of PAUP* 4.0b10 [37]. Branches drawn in thick lines are clades with bootstrap support of 75% or greater. Sequences present on fosmids with extensive nucleotide similarity to the SAR324_MDA assembly are indicated (red star), as is a SAR324 fosmid (yellow star) encoding CoxL homologs also present in the SAR324_MDA assembly (see Supplementary Fig. S13).
Features of the SAR324_MDA single cell assembly (EULER + Velvet-SC). 3811 genes are those > 180 bp in length.
| 4.3 Mb | |
| 4.9-6.4 Mb | |
| 43% | |
| 20 types | |
| 17 of 21 types | |
| 1 each of 5S, 16S, 23S | |
| 3811 | |
| 75/111 (67%) | |
| 58/66 (87%) |