Literature DB >> 20966005

SmashCell: a software framework for the analysis of single-cell amplified genome sequences.

Eoghan D Harrington¹, Manimozhiyan Arumugam, Jeroen Raes, Peer Bork, David A Relman.

Abstract

SUMMARY: Recent advances in single-cell manipulation technology, whole genome amplification and high-throughput sequencing have now made it possible to sequence the genome of an individual cell. The bioinformatic analysis of these genomes, however, is far more complicated than the analysis of those generated using traditional, culture-based methods. In order to simplify this analysis, we have developed SmashCell (Simple Metagenomics Analysis SHell-for sequences from single Cells). It is designed to automate the main steps in microbial genome analysis-assembly, gene prediction, functional annotation-in a way that allows parameter and algorithm exploration at each step in the process. It also manages the data created by these analyses and provides visualization methods for rapid analysis of the results. AVAILABILITY: The SmashCell source code and a comprehensive manual are available at http://asiago.stanford.edu/SmashCell CONTACT: eoghanh@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Species

Mesh：

Year: 2010 PMID： 20966005 PMCID： PMC2982155 DOI： 10.1093/bioinformatics/btq564

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The rapid evolution of DNA sequencing platforms has had a dramatic, beneficial impact on the study of microbial ecology and population genetics. So far, these benefits have mostly come from shotgun community metagenomics that provides a high-level overview of the taxonomic and functional composition of microbial communities [see Arumugam ) for details]. However, this approach is limited in its ability to yield complete genome sequences as well as the fine-scale genetic variation that defines population substructures within these communities. One possible solution uses a combination of single-cell manipulation technologies, multiple-displacement amplification (MDA) and high-throughput sequencing to generate single-cell amplified genomes (SAGs). This approach has already been used to characterize the genomes of uncultivated microbes (Marcy ; Woyke ) and as the throughput of the associated technologies increase it should become possible to obtain high-resolution profiles of populations or communities. However, it is more difficult to produce a high-quality assembly and functional annotation from a SAG than from the output of traditional methods due to limitations inherent in sample preparation and sequencing (detailed below). To overcome this requires an iterative, exploratory approach that transforms the traditional linear process of genome assembly, gene prediction and functional annotation into a tree-like structure, each branch defined by a different choice of algorithm or parameters, one of which will be chosen as the final version (Fig. 1A). This approach is not easily achieved using existing tools, which take an assembled genome as their input and do not allow parameterization of subsequent steps (for a comparison with existing tools see Supplementary Table). In order to automate the process and deal with the resulting combinatorial increase in data we have created SmashCell. While developed for use on SAGs many of its analyses are equally applicable to traditional microbial genome sequences and low-complexity metagenomes.

Fig. 1.

(A) The data model used in SmashCell is designed to reduce redundancy and facilitate the comparison of results using different parameters and/or algorithms [MC: metagenome collection, MG: metagenome (equivalent to a SAG), AS: assembly, GP: gene prediction, FUNC: functional annotation]. (B) K-mer frequency statistics supplement sequence similarity information to identify potential contaminants. This shows a self-organizing map (SOM) trained on the tetramer frequencies of an assembly. The left panel shows a series of pie charts highlighting the taxonomic identity (determined by best hit in GenBank, those with no hits are uncoloured) of the contigs assigned to each neuron. The right panel shows the U-matrix of the SOM. (C) The abundance of single-copy COGs can be used to assess genome completeness, the presence of contamination and the quality of the assembly. (D) SmashCell uses different graphs to aid in parameter and algorithm selection. Here the results from two different gene prediction algorithms are presented, along with GC-content, quality scores and read depth.

2 FEATURES

SmashCell automates the steps common to most genome analyses—assembly, gene prediction and functional annotation—and addresses some of the challenges posed by SAGs. For instance, it is difficult to isolate a single cell for sequencing without including some environmental DNA, in effect creating a metagenome. As a result, SmashCell includes both sequence similarity and k-mer based tools to identify potential contaminants, the latter being especially useful when the target genome and/or contaminants are not closely related to existing genome sequences (see Fig. 1B for details). Another challenge with SAGs is the orders of magnitude variance in MDA product abundance along the genome, which creates several obstacles to obtaining high-quality annotation. First, it hampers genome assembly, as most algorithms are designed for lower and more evenly distributed read depth. Secondly, it vastly increases the amount of sequencing required to obtain a complete genome sequence. To address the first challenge, SmashCell includes scripts to downsample overrepresented regions of the SAG and to address the second, SmashCell uses the STRING database (Jensen ) to obtain counts of single-copy orthologous groups (Fig. 1C), which can then be used to estimate genome completeness. In addition to these SAG-specific features, SmashCell contains genome visualization and other tools (Fig. 1D) that are generally applicable to genomic and metagenomic data. SmashCell uses the same basic data model as SmashCommunity [designed for shotgun community sequencing; Arumugam )]. As a result, several of the analyses available in SmashCell can be run on data generated by SmashCommunity and vice versa. Documentation for these and many more features are available on the SmashCell website.

3 DESIGN AND IMPLEMENTATION

SmashCell is a framework written in Python that provides a variety of analysis tools that can be used either from the command line or from within other Python scripts. The main function of SmashCell is to automate the common steps in genome analysis in a way that facilitates parameter and algorithm exploration. Using the data model shown in Figure 1A, SmashCell manages the files and data associated with each of these steps, reducing redundancy and providing a layer of abstraction that simplifies access to these data. SmashCell also uses generic databases to provide a common format for assembly and gene prediction information, allowing it to work with a variety of third-party assemblers and gene prediction algorithms. In order to facilitate the exploration of genomic data, SmashCell automatically generates many different types of graphs (e.g. see Fig. 1B–D) and provides wrappers for exploratory statistical techniques.

4 in total

1. SmashCommunity: a metagenomic annotation and analysis tool.

Authors: Manimozhiyan Arumugam; Eoghan D Harrington; Konrad U Foerstner; Jeroen Raes; Peer Bork
Journal: Bioinformatics Date: 2010-10-19 Impact factor: 6.937

2. Dissecting biological "dark matter" with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth.

Authors: Yann Marcy; Cleber Ouverney; Elisabeth M Bik; Tina Lösekann; Natalia Ivanova; Hector Garcia Martin; Ernest Szeto; Darren Platt; Philip Hugenholtz; David A Relman; Stephen R Quake
Journal: Proc Natl Acad Sci U S A Date: 2007-07-09 Impact factor: 11.205

3. STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Authors: Lars J Jensen; Michael Kuhn; Manuel Stark; Samuel Chaffron; Chris Creevey; Jean Muller; Tobias Doerks; Philippe Julien; Alexander Roth; Milan Simonovic; Peer Bork; Christian von Mering
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

4. Assembling the marine metagenome, one cell at a time.

Authors: Tanja Woyke; Gary Xie; Alex Copeland; José M González; Cliff Han; Hajnalka Kiss; Jimmy H Saw; Pavel Senin; Chi Yang; Sourav Chatterji; Jan-Fang Cheng; Jonathan A Eisen; Michael E Sieracki; Ramunas Stepanauskas
Journal: PLoS One Date: 2009-04-23 Impact factor: 3.240

4 in total

11 in total

Review 1. The future is now: single-cell genomics of bacteria and archaea.

Authors: Paul C Blainey
Journal: FEMS Microbiol Rev Date: 2013-02-11 Impact factor: 16.408

Review 2. Single cell genome sequencing.

Authors: Suzan Yilmaz; Anup K Singh
Journal: Curr Opin Biotechnol Date: 2011-12-07 Impact factor: 9.740

SmashCell: a software framework for the analysis of single-cell amplified genome sequences.

1 INTRODUCTION

2 FEATURES

3 DESIGN AND IMPLEMENTATION

1. SmashCommunity: a metagenomic annotation and analysis tool.

2. Dissecting biological "dark matter" with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth.

3. STRING 8--a global view on proteins and their functional interactions in 630 organisms.

4. Assembling the marine metagenome, one cell at a time.

Review 1. The future is now: single-cell genomics of bacteria and archaea.

Review 2. Single cell genome sequencing.

Review 3. Genomic analysis at the single-cell level.

4. Single-cell sequencing provides clues about the host interactions of segmented filamentous bacteria (SFB).

5. Single-cell genomics: unravelling the genomes of unculturable microorganisms.

6. Testing the reproducibility of multiple displacement amplification on genomes of clonal endosymbiont populations.

Review 7. Integrative workflows for metagenomic analysis.

8. ProDeGe: a computational protocol for fully automated decontamination of genomes.

Review 9. Computational meta'omics for microbial community studies.

Review 10. An Integrated Multi-Disciplinary Perspectivefor Addressing Challenges of the Human Gut Microbiome.