| Literature DB >> 24348509 |
Sujai Kumar1, Martin Jones1, Georgios Koutsovoulos1, Michael Clarke1, Mark Blaxter2.
Abstract
Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment - whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species' genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.Entities:
Keywords: assembly; commensals; contaminants; metagenomics; next-generation sequencing; parasites; symbionts
Year: 2013 PMID: 24348509 PMCID: PMC3843372 DOI: 10.3389/fgene.2013.00237
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Software and databases used in this work.
| fastq-mcf | 1.04.636 | ||||
| ABySS | 1.3.6 | k-mer of 61 | The user might care to change the k-mer value depending on the quality and length of their read data; it is not necessary to optimize this value. The program can also be run treating any paired (mate or paired-end) data as single-end. | ||
| Bowtie 2 | 2.1.0 | -k 1 –very-fast-local | The settings used are designed to map reads uniquely and quickly | ||
| BLAST+ | 2.2.28 | -task megablast -evalue 1e-5 -max_target_seqs 1 -outfmt ‘6 qseqid staxids’ | |||
| NCBI nt | March 1, 2013 | See | |||
| gc_cov_annotate.pl | 1.0 | This work | |||
| makeblobplot.R | 1.0 | This work | 0.01 taxlevel_order | 0.01 is the threshold of displaying annotated contigs, and taxlevel_order sets the taxon level to display | |
| ggplot2 | |||||
| NCBI taxonomy heirarchy files | March 2013 | ||||
| JQuery | 1.8.2 | Additional JQuery plugins used: jquery-ui, dropkick, tagsinput, placeholder, chardin.js | |||
| Raphael | 2.1.0 | additional Raphael plugins used: raphael.export | |||
| WS230 | See | ||||
| NEMBASE4 | See | ||||
| 1.0 | Unpublished data from the | ||||
| CEGMA | 2.4 | ||||
Sequence data for Caenorhabditis sp. 5.
| JU800 | 300 | HiSeq2000 101 b PE[ | 88.6 M[ | 17.9 | 86.9 M pairs | 17.3 | ERR138445 |
| JU800 | 600 | HiSeq2000 101 b PE | 52.4 M pairs | 10.6 | 49.4 M pairs | 9.6 | ERR138446 |
Gb, gigabases; PE, paired end; M, million.
Assembly statistics for Caenorhabditis sp. 5.
| Span (bp) | 160,970,414 | 25,566,044 | 135,507,189 |
| Number of contigs[ | 12,264 | 2,148 | 10,120 |
| N50 of contigs (bp) | 32,806 | 44,901 | 31,396 |
| CEGMA completeness | 97.58% | – | 96.37% |
| Representation of | 98.1% | – | 98.1% |
| Representation of | 97.41% | – | 97.42% |
| Matches to | 79.04% | – | 79.04% |
Or scaffolds, as the contigs may contain “N” base calls.
The Caenorhabditis sp. 5 expressed sequence tag dataset includes 2,265 unigene sequences.
The Caenorhabditis sp. 5 RNA-Seq transcriptome assembly contains 30,756 unigene sequences.
Caenorhabditis briggsae is the closest fully sequenced Caenorhabditis species to Caenorhabditis sp. 5. Its proteome contains 21,961 entries.