Literature DB >> 20966005

SmashCell: a software framework for the analysis of single-cell amplified genome sequences.

Eoghan D Harrington1, Manimozhiyan Arumugam, Jeroen Raes, Peer Bork, David A Relman.   

Abstract

SUMMARY: Recent advances in single-cell manipulation technology, whole genome amplification and high-throughput sequencing have now made it possible to sequence the genome of an individual cell. The bioinformatic analysis of these genomes, however, is far more complicated than the analysis of those generated using traditional, culture-based methods. In order to simplify this analysis, we have developed SmashCell (Simple Metagenomics Analysis SHell-for sequences from single Cells). It is designed to automate the main steps in microbial genome analysis-assembly, gene prediction, functional annotation-in a way that allows parameter and algorithm exploration at each step in the process. It also manages the data created by these analyses and provides visualization methods for rapid analysis of the results. AVAILABILITY: The SmashCell source code and a comprehensive manual are available at http://asiago.stanford.edu/SmashCell CONTACT: eoghanh@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2010        PMID: 20966005      PMCID: PMC2982155          DOI: 10.1093/bioinformatics/btq564

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

The rapid evolution of DNA sequencing platforms has had a dramatic, beneficial impact on the study of microbial ecology and population genetics. So far, these benefits have mostly come from shotgun community metagenomics that provides a high-level overview of the taxonomic and functional composition of microbial communities [see Arumugam ) for details]. However, this approach is limited in its ability to yield complete genome sequences as well as the fine-scale genetic variation that defines population substructures within these communities. One possible solution uses a combination of single-cell manipulation technologies, multiple-displacement amplification (MDA) and high-throughput sequencing to generate single-cell amplified genomes (SAGs). This approach has already been used to characterize the genomes of uncultivated microbes (Marcy ; Woyke ) and as the throughput of the associated technologies increase it should become possible to obtain high-resolution profiles of populations or communities. However, it is more difficult to produce a high-quality assembly and functional annotation from a SAG than from the output of traditional methods due to limitations inherent in sample preparation and sequencing (detailed below). To overcome this requires an iterative, exploratory approach that transforms the traditional linear process of genome assembly, gene prediction and functional annotation into a tree-like structure, each branch defined by a different choice of algorithm or parameters, one of which will be chosen as the final version (Fig. 1A). This approach is not easily achieved using existing tools, which take an assembled genome as their input and do not allow parameterization of subsequent steps (for a comparison with existing tools see Supplementary Table). In order to automate the process and deal with the resulting combinatorial increase in data we have created SmashCell. While developed for use on SAGs many of its analyses are equally applicable to traditional microbial genome sequences and low-complexity metagenomes.
Fig. 1.

(A) The data model used in SmashCell is designed to reduce redundancy and facilitate the comparison of results using different parameters and/or algorithms [MC: metagenome collection, MG: metagenome (equivalent to a SAG), AS: assembly, GP: gene prediction, FUNC: functional annotation]. (B) K-mer frequency statistics supplement sequence similarity information to identify potential contaminants. This shows a self-organizing map (SOM) trained on the tetramer frequencies of an assembly. The left panel shows a series of pie charts highlighting the taxonomic identity (determined by best hit in GenBank, those with no hits are uncoloured) of the contigs assigned to each neuron. The right panel shows the U-matrix of the SOM. (C) The abundance of single-copy COGs can be used to assess genome completeness, the presence of contamination and the quality of the assembly. (D) SmashCell uses different graphs to aid in parameter and algorithm selection. Here the results from two different gene prediction algorithms are presented, along with GC-content, quality scores and read depth.

(A) The data model used in SmashCell is designed to reduce redundancy and facilitate the comparison of results using different parameters and/or algorithms [MC: metagenome collection, MG: metagenome (equivalent to a SAG), AS: assembly, GP: gene prediction, FUNC: functional annotation]. (B) K-mer frequency statistics supplement sequence similarity information to identify potential contaminants. This shows a self-organizing map (SOM) trained on the tetramer frequencies of an assembly. The left panel shows a series of pie charts highlighting the taxonomic identity (determined by best hit in GenBank, those with no hits are uncoloured) of the contigs assigned to each neuron. The right panel shows the U-matrix of the SOM. (C) The abundance of single-copy COGs can be used to assess genome completeness, the presence of contamination and the quality of the assembly. (D) SmashCell uses different graphs to aid in parameter and algorithm selection. Here the results from two different gene prediction algorithms are presented, along with GC-content, quality scores and read depth.

2 FEATURES

SmashCell automates the steps common to most genome analyses—assembly, gene prediction and functional annotation—and addresses some of the challenges posed by SAGs. For instance, it is difficult to isolate a single cell for sequencing without including some environmental DNA, in effect creating a metagenome. As a result, SmashCell includes both sequence similarity and k-mer based tools to identify potential contaminants, the latter being especially useful when the target genome and/or contaminants are not closely related to existing genome sequences (see Fig. 1B for details). Another challenge with SAGs is the orders of magnitude variance in MDA product abundance along the genome, which creates several obstacles to obtaining high-quality annotation. First, it hampers genome assembly, as most algorithms are designed for lower and more evenly distributed read depth. Secondly, it vastly increases the amount of sequencing required to obtain a complete genome sequence. To address the first challenge, SmashCell includes scripts to downsample overrepresented regions of the SAG and to address the second, SmashCell uses the STRING database (Jensen ) to obtain counts of single-copy orthologous groups (Fig. 1C), which can then be used to estimate genome completeness. In addition to these SAG-specific features, SmashCell contains genome visualization and other tools (Fig. 1D) that are generally applicable to genomic and metagenomic data. SmashCell uses the same basic data model as SmashCommunity [designed for shotgun community sequencing; Arumugam )]. As a result, several of the analyses available in SmashCell can be run on data generated by SmashCommunity and vice versa. Documentation for these and many more features are available on the SmashCell website.

3 DESIGN AND IMPLEMENTATION

SmashCell is a framework written in Python that provides a variety of analysis tools that can be used either from the command line or from within other Python scripts. The main function of SmashCell is to automate the common steps in genome analysis in a way that facilitates parameter and algorithm exploration. Using the data model shown in Figure 1A, SmashCell manages the files and data associated with each of these steps, reducing redundancy and providing a layer of abstraction that simplifies access to these data. SmashCell also uses generic databases to provide a common format for assembly and gene prediction information, allowing it to work with a variety of third-party assemblers and gene prediction algorithms. In order to facilitate the exploration of genomic data, SmashCell automatically generates many different types of graphs (e.g. see Fig. 1B–D) and provides wrappers for exploratory statistical techniques.
  4 in total

1.  SmashCommunity: a metagenomic annotation and analysis tool.

Authors:  Manimozhiyan Arumugam; Eoghan D Harrington; Konrad U Foerstner; Jeroen Raes; Peer Bork
Journal:  Bioinformatics       Date:  2010-10-19       Impact factor: 6.937

2.  Dissecting biological "dark matter" with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth.

Authors:  Yann Marcy; Cleber Ouverney; Elisabeth M Bik; Tina Lösekann; Natalia Ivanova; Hector Garcia Martin; Ernest Szeto; Darren Platt; Philip Hugenholtz; David A Relman; Stephen R Quake
Journal:  Proc Natl Acad Sci U S A       Date:  2007-07-09       Impact factor: 11.205

3.  STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Authors:  Lars J Jensen; Michael Kuhn; Manuel Stark; Samuel Chaffron; Chris Creevey; Jean Muller; Tobias Doerks; Philippe Julien; Alexander Roth; Milan Simonovic; Peer Bork; Christian von Mering
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

4.  Assembling the marine metagenome, one cell at a time.

Authors:  Tanja Woyke; Gary Xie; Alex Copeland; José M González; Cliff Han; Hajnalka Kiss; Jimmy H Saw; Pavel Senin; Chi Yang; Sourav Chatterji; Jan-Fang Cheng; Jonathan A Eisen; Michael E Sieracki; Ramunas Stepanauskas
Journal:  PLoS One       Date:  2009-04-23       Impact factor: 3.240

  4 in total
  11 in total

Review 1.  The future is now: single-cell genomics of bacteria and archaea.

Authors:  Paul C Blainey
Journal:  FEMS Microbiol Rev       Date:  2013-02-11       Impact factor: 16.408

Review 2.  Single cell genome sequencing.

Authors:  Suzan Yilmaz; Anup K Singh
Journal:  Curr Opin Biotechnol       Date:  2011-12-07       Impact factor: 9.740

Review 3.  Genomic analysis at the single-cell level.

Authors:  Tomer Kalisky; Paul Blainey; Stephen R Quake
Journal:  Annu Rev Genet       Date:  2011-09-19       Impact factor: 16.830

4.  Single-cell sequencing provides clues about the host interactions of segmented filamentous bacteria (SFB).

Authors:  Sünje J Pamp; Eoghan D Harrington; Stephen R Quake; David A Relman; Paul C Blainey
Journal:  Genome Res       Date:  2012-03-20       Impact factor: 9.043

5.  Single-cell genomics: unravelling the genomes of unculturable microorganisms.

Authors:  Victor de Jager; Roland J Siezen
Journal:  Microb Biotechnol       Date:  2011-07       Impact factor: 5.813

6.  Testing the reproducibility of multiple displacement amplification on genomes of clonal endosymbiont populations.

Authors:  Kirsten Maren Ellegaard; Lisa Klasson; Siv G E Andersson
Journal:  PLoS One       Date:  2013-11-27       Impact factor: 3.240

Review 7.  Integrative workflows for metagenomic analysis.

Authors:  Efthymios Ladoukakis; Fragiskos N Kolisis; Aristotelis A Chatziioannou
Journal:  Front Cell Dev Biol       Date:  2014-11-19

8.  ProDeGe: a computational protocol for fully automated decontamination of genomes.

Authors:  Kristin Tennessen; Evan Andersen; Scott Clingenpeel; Christian Rinke; Derek S Lundberg; James Han; Jeff L Dangl; Natalia Ivanova; Tanja Woyke; Nikos Kyrpides; Amrita Pati
Journal:  ISME J       Date:  2015-06-09       Impact factor: 10.302

Review 9.  Computational meta'omics for microbial community studies.

Authors:  Nicola Segata; Daniela Boernigen; Timothy L Tickle; Xochitl C Morgan; Wendy S Garrett; Curtis Huttenhower
Journal:  Mol Syst Biol       Date:  2013-05-14       Impact factor: 11.429

Review 10.  An Integrated Multi-Disciplinary Perspectivefor Addressing Challenges of the Human Gut Microbiome.

Authors:  Rohan M Shah; Elizabeth J McKenzie; Magda T Rosin; Snehal R Jadhav; Shakuntla V Gondalia; Douglas Rosendale; David J Beale
Journal:  Metabolites       Date:  2020-03-06
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.