| Literature DB >> 24755901 |
Rei Kajitani1, Kouta Toshimoto2, Hideki Noguchi3, Atsushi Toyoda4, Yoshitoshi Ogura5, Miki Okuno1, Mitsuru Yabana1, Masayuki Harada1, Eiji Nagayasu6, Haruhiko Maruyama6, Yuji Kohara7, Asao Fujiyama4, Tetsuya Hayashi5, Takehiko Itoh1.
Abstract
Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.Mesh:
Year: 2014 PMID: 24755901 PMCID: PMC4120091 DOI: 10.1101/gr.170720.113
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Schematic overview of the Platanus algorithm. (A) In Contig-assembly, a de Bruijn graph is constructed from the read set. Short branches caused by errors are removed by “tip removal.” Short repeats are resolved by k-mer extension, in which previous graphs and reads are mapped to nearby k-mers at the junctions. Finally, bubble structures caused by heterozygosity or errors are removed. Subgraphs without any junctions represent contigs. (B) In Scaffolding, links between contigs are detected using paired reads. The relationship between contigs is represented by the graph. Bubbles removed in Contig-assembly are remapped on contigs and utilized for mapping of paired-end reads and detection of heterozygous contigs. Heterozygous regions are removed as bubble or branch structures on the graph by the “bubble removal” or “branch cut” step. These simplification steps are characteristic of Platanus and especially effective for assembling complex heterozygous regions. (C) In Gap-close, paired reads are mapped on scaffolds, and reads mapped at nearby gaps are collected for each gap. If a contig is expected to cover the gap and is constructed from collected reads, the gap is closed by the contig.
Summary of the assemblies
Figure 2.Distribution of the number of 17-mer occurrences. (A) Schematic model of the distribution of k-mer occurrences. This distribution is related to that shown in Table 1. (B) Simulated heterozygous data from C. elegans. (C) Distributions of normalized 17-mer occurrences for all species.
Mismatches, small indels, and the ‘N’ rate in C. elegans (heterozygosity 0.0%) assembly
Figure 3.Results of the benchmarks of heterozygosity simulations (C. elegans). (A) Corrected scaffold-NG50 calculated by GAGE. (B) Corrected contig-NG50. (C) Number of errors reported by GAGE. Errors are defined as inversion, relocation, or translocation.
Statistics and validations of S. venezuelensis assemblies
Figure 4.Example of a heterozygous region resolved by “bubble removal” and “branch cut.” (A) Schematic model of “bubble removal” in Platanus scaffolding. (B) Alignment dot plot between two fosmids. Green lines and red dots indicate alignments and mismatches, respectively. Red and blue boxes indicate the regions corresponding to the bubbles. (C) Schematic model of “branch cut” in Platanus scaffolding. (D) Alignment dot plot between two fosmids. Green lines and red dots indicate alignments and mismatches, respectively. The blue arrow indicates the position corresponding to the root of the branch.
Statistics and validations of the oyster assemblies using BAC and RNA-contigs
Run time and peak memory usage