Literature DB >> 26072512

Misassembly detection using paired-end sequence reads and optical mapping data.

Martin D Muggli¹, Simon J Puglisi¹, Roy Ronen¹, Christina Boucher¹.

Abstract

MOTIVATION: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar.
RESULTS: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar.
AVAILABILITY AND IMPLEMENTATION: misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/.

Entities: Chemical Disease Species

Mesh：

Year: 2015 PMID： 26072512 PMCID： PMC4542784 DOI： 10.1093/bioinformatics/btv262

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Comparing genetic variation between and within a species is a fundamental activity in biological research. For example, there is currently a major effort to sequence entire genomes of agriculturally important plant species to identify parts of the genome variable in a given breeding program and, ultimately, create superior plant varieties. Robust genome assembly methods are imperative to these large sequencing initiatives and other scientific projects (Haussler ; Ossowski ; Robinson ; Turnbaugh ) because scientific analyses frequently use those genomes to determine genetic variation and associated biological traits. At present, the majority of assembly programs are based on the Eulerian assembly paradigm (Idury and Waterman 1995; Pevzner ), where a de Bruijn graph is constructed with a vertex v for every -mer present in a set of reads and an edge for every observed k-mer in the reads with -mer prefix v and -mer suffix . A contig corresponds to a non-branching path through this graph. We refer the reader to Compeau for a more thorough explanation of de Bruijn graphs and their use in assembly. SPAdes (Bankevich ), IDBA (Peng ), Euler-SR (Chaisson and Pevzner 2008), Velvet (Zerbino and Birney 2008), SOAPdenovo (Li ), ABySS (Simpson ) and ALLPATHS (Butler ) all use this paradigm and follow the same general outline: extract k-mers from the reads, construct a de Bruijn graph from the set k-mers, simplify the graph and construct contigs. One crucial problem that persists in Eulerian assembly (and genome assembly, in general) is the discovery and correction of misassembly errors in draft genomes. We define a misassembly error as an assembled region that contains a significantly large insertion, deletion, inversion, or rearrangement that is the result of decisions made by the assembly program. Identification of misassembly errors is important because true biological variations manifest in similar ways, and thus, these errors can be easily misconstrued as true genetic variation (Salzberg 2005). This can mislead a range of genomic analyses. We note that the exact definition of a misassembly error can vary and adopt the standard definition used by QUAST (Gurevich ) and other tools. See Section 3.1 for this exact definition. Once the existence and location of a misassembly are identified, it can be removed by segmenting the contig at that location. We present a computational method for identifying misassembly errors using a combination of short reads and optical mapping data. Optical mapping is a system developed in 1993 (Schwartz ) that can construct ordered, genome-wide, high-resolution restriction maps. The system works as follows (Aston and Schwartz 2006; Dimalanta ): an ensemble of DNA molecules adhered to a charged glass plate is elongated by fluid flow. An enzyme is then used to cleave them into fragments at loci where the corresponding recognition sequence occurs. Next, the fragments are highlighted with fluorescent dye and imaged under a microscope. Finally, these images are analyzed to estimate the fragment sizes, producing a molecular map. Since the fragments stay relatively stationary during the aforementioned process, the images capture their relative order and size (Neely ). Multiple copies of the genome undergo this process, and a consensus map is formed that consists of an ordered sequence of fragment sizes, each indicating the approximate number of bases between occurrences of the recognition sequence in the genome (Anantharaman and Mishra 2001). Although optical mapping data have been used for discerning structural variation in the human genome (Teague ) and for scaffolding and validating contigs for several large sequencing projects—including those for various prokaryote species (Reslewic ; Zhou , 2004), rice (Zhou ), maize (Zhou ), mouse (Church ), goat (Dong ), parrot (Howard ) and Amborella trichopoda (Chamala )—there are no publicly available tools for using this data for misassembly detection using short read and optical mapping data. In 2014, Mendelowitz and Pop (2014) further this point stating that ‘There is, thus, a critical need for the continued development and public release of software tools for processing optical mapping data, mirroring the tremendous advances made in analytical methods for second- and third-generation sequencing data’. Our tool, which we call sc>/sc>, predicts which contigs are misassembled and the approximate locations of the errors in the contigs. It takes as input the paired-end sequence read data, contigs, an ensemble of optical maps and the restriction enzymes used to construct the optical maps. misSEQuel first uses the paired-end read data to divide the contigs into two sets: those that are predicted to be correctly assembled and those that are not. Then the set of contigs that are candidates for containing misassembly errors are further divided into misassembled contigs and correctly assembled contigs using optical mapping data. Fundamental to the first step is the concept of a red–black positional de Bruijn graph, which encapsulates recurring artifacts in the alignment of the sequence read data to the contigs and their position in the contig. The red vertices in this graph indicate if a contig is likely to be misassembled and also flag the location where the misassembly error occurs. These locations are called misassembly breakpoints. In the second stage of misSEQuel where optical mapping data are used, the contigs conjectured to be misassembled are in silico digested with the set of input restriction enzymes and aligned to the optical map using Twin (Muggli ). Based on the presence or absence of alignment, a prediction of misassembly is made. The in silico digestion process computationally mimics how each restriction enzyme would cleave the segment of DNA defined by the contig, returning ‘mini-optical maps’ that can be aligned to the optical map for the whole genome. An important aspect of our work is that it highlights the need to use another source of information, which is independent of the sequence data but representative of the same genome, to identify misassembly errors. We show that optical mapping data can be used as this information source. We give results for the Francisella tularensis, loblolly pine, rice and budgerigar genomes. Each genome was assembled using various de Bruijn graph assemblers, and then misassembly errors were predicted. We present results for both real and simulated optical mapping data; simulated data were generated for the F.tularensis and loblolly pine genomes, and real optical mapping data for the rice and budgerigar genomes. Our results on F.tularensis show that misSEQuel correctly identifies (on average) 86% and 80% of locally and extensively misassembled contigs, respectively. This is a considerable improvement on existing methods, which identified (on average) 26% and 16% of locally and extensively misassembled contigs, respectively, in the same assemblies. The results on the loblolly pine genome assemblies show similar improvement. Out of the 499 extensively and 3 locally misassembled contigs in the SOAPdenovo assembly of rice, misSEQuel correctly identified 374 (75%) and 3 (100%) of them, respectively. Competing methods identified between 25% and 30% of these extensively misassembled contigs, and none of these locally misassembled contigs. Lastly, we downloaded the latest Illumina-454 hybrid assembly of budgerigar that was released by Ganapathy , and predicted misassembly errors using the accompanied Illumina paired-end data and optical mapping data. misSEQuel correctly identified 10 777 of the 13 996 extensively misassembled contigs (77%) and 2350 (out of 2937) locally misassembled contigs (80%). Hence, we tested our method across four different genomes, which all vary in size and GC content. Our conclusions, based on these experimental results, are that the specificity of misSEQuel significantly increases by incorporating optical mapping data into the prediction of misassembly errors, and the sensitivity of misSEQuel is substantially better in comparison to competing methods that just use paired-end data. Therefore, we show evidence that optical mapping data can be a powerful tool for misassembly identification.

1.1 Related work

Amosvalidate (Phillippy ), REAPR (Hunt ) and Pilon (Walker ) are capable of identifying and correcting misassembly errors. REAPR is designed to use both short insert and long insert paired-end sequencing libraries; however, it can operate with only one of these types of sequencing data. Amosvalidate, which is included as part of the AMOS assembly package (Treangen ), was developed specifically for first generation sequencing libraries (Phillippy ). iMetAMOS (Koren ) is an automated assembly pipeline that provides error correction and validation of the assembly. It packages several open-source tools and provides annotated assemblies that result from an ensemble of tools and assemblers. Currently, it uses REAPR for misassembly error correction. Pilon (Walker ) detects a variety (including misassembly) of errors in draft genomes and variant detection. Similar to REAPR, Pilon is specifically designed to use short insert and long insert libraries but unlike REAPR and amosvalidate, it is specifically designed for microbial genomes. Many optical mapping tools exist and deserve mentioning, including AGORA (Lin ), SOMA (Nagarajan ) and Twin (Muggli ). AGORA (Lin ) uses the optical map information to constrain de Bruijn graph construction with the aim of improving the resulting assembly. SOMA (Nagarajan ) uses dynamic programming to align in silico-digested contigs to an optical map. Twin (Muggli ) is an index-based method for aligning contigs to an optical map. Because of its use of an index data structure, it is capable of aligning in silico-digested contigs orders of magnitude faster than competing methods. Xavier demonstrated misassembly errors in bacterial genomes can be detected using proprietary software. Lastly, there are special purpose tools that have some relation to misSEQuel in their algorithmic approach. Numerous assembly tools use a finishing process after assembly, including Hapsembler (Donmez and Brudno 2011), LOCAS (Klein ), Meraculous (Chapman ) and the ‘assisted assembly’ algorithm (Gnerre ). Hapsembler (Donmez and Brudno 2011) is a haplotype-specific genome assembly toolkit that is designed for genomes that are highly polymorphic. RACA (Kim ) and SCARPA (Donmez and Brudno 2013) are two scaffolding algorithms that perform paired-end alignment to the contigs as an initial step and, thus, are similar to our algorithm in that respect.

2 Methods

misSEQuel can be broken down into four main steps: recruitment of reads to contigs, construction of the red–black positional de Bruijn graph, misassembly error prediction and misassembly verification using optical mapping data. We explain each of these steps in detail in the following subsections.

2.1 Recruitment of reads and threshold calculation

misSEQuel first aligns reads to contigs to identify regions that contain abnormal read alignments. Collapsed or expanded repeats will present as the read coverage being greater or lower than the expected genome coverage in the region that has been misassembled. Similarly, inversion and rearrangement errors will present as the alignment of the mate-pairs being rearranged. Figure 1 illustrates these concordant and discordant read alignments. More specifically, this step consists of aligning all the (paired-end) reads to all the contigs and then calculating three thresholds, Δ, Δ and Γ. The range defines the acceptable read depth, and Γ defines the maximum allowable number of reads whose mate-pair aligns in an inverted orientation. To calculate these thresholds, we consider all alignments of each read as opposed to just the best alignment of each read since misassembly errors frequently occur within repetitive regions where the reads will align to multiple locations. misSEQuel performs this step using BWA (version 0.5.9) in paired-end mode with default parameters (Li and Durbin 2009). Subsequently, after alignment, each contig is treated as a series of consecutive 200-bp regions. These are sampled uniformly at random times, and the mean (µ) and the standard deviation (σ) of the read depth and the mean (µ) and the standard deviation (σ) of the number of alignments where a discordant mate-pair orientation is witnessed are calculated from these sampled regions. Δ is set to the maximum of , Δ is set to and Γ is set to . The default for is th of the contig length, and this parameter can be changed in the input to misSEQuel.

Fig. 1.

An illustration about the systematic alterations that occur with rearrangements, inversions, collapsed repeats and expanded repeats. (a) Proper read alignment where mate-pair reads have the correct orientation and distance from each other. A rearrangement or inversion will present itself by the orientation of the reads being incorrect and/or the distance of the mate-pairs being significantly smaller or significantly larger than the expected insert size. This is shown in (b) and (c), respectively. (d) The proper read depth, which is uniform across the genome. (e) A collapsed repeat, which results in the read depth being greater than expected. (f) A expanded repeat, which results in the read depth being lower than expected An example illustrating the red black positional de Bruijn graph (), the positional de Bruijn graph and the de Bruijn graph on a set of aligned reads, with their corresponding sets of k-mers and positional k-mers. There exists a region in the genome that extremely high coverage, which would suggest a possible misassembly error. Namely, the positional k-mers (GCCA, 111), (CCAT, 112) and (CATT, 113) have multiplicity 10, whereas all other positional k-mers have multiplicity 5. In the de Bruijn graph where the position is not taken into account, all k-mers have multiplicity of 10 and there is no evidence of a misassembled region. We note that in this example no vertex gluing operations occur but in more complex instances, vertex gluing will occur when equal k-mers align at adjacent positions

2.2 Construction of the red–black positional de Bruijn graph

After threshold calculation, the red–black positional de Bruijn graph is constructed. For clarity, we begin by describing the positional de Bruijn graph, given by Ronen and then define the red–black positional de Bruijn graph. Although the edges in the traditional de Bruijn graph correspond to k-mers, the edges in the positional de Bruijn graph correspond to k-mers and their inferred positions on the contigs (positional k-mers). Hence, the positional de Bruijn graph is defined for a multiset of positional k-mers and parameter and is constructed in a similar manner to the traditional de Bruijn graph using an A-Bruijn graph framework from (Pevzner ). Given a k-mer s, let be the first nucleotides of s, and be the last nucleotides of s. Each positional k-mer in the input multiset corresponds to a directed edge in the graph between two positional -mers, and . After all edges are formed, the graph undergoes a gluing operation. A pair of positional -mers, and , are glued together into a single vertex if and . Two positional -mers are glued together if their sequences are the same and their positions are within from each other. We refer to the multiplicity of a positional -mer as the number of occurrences where clustered at position p. misSEQuel constructs the red–black positional de Bruijn graph from the alignment of the reads to the contigs. The red–black positional de Bruijn graph contains positional k-mers and is constructed in an identical way as the positional de Bruijn graph with the addition that each vertex (-mer) has an associated red or black color attributed to it that is defined using Δ, Δ and Γ. In addition to the multiplicity of each positional -mer, the number of positional -mers that originated from a read whose mate-pair did not align in the conventional direction is stored at each vertex. When the multiplicity is less than Δ or greater than Δ or if the observed frequency of discordant mate-pair orientation is greater than Γ, then the vertex is red; otherwise it is black.

2.3 Misassembly conjecture and breakpoint estimation

A red–black positional de Bruijn graph is constructed for each contig, and misassembly errors in each contig are detected by searching for consecutive red vertices in the corresponding graph. Depth-first search is used for the graph traversal. If there are greater than 50 consecutive red vertices, then the contig is conjectured to be misassembled. The breakpoint in the contig can be determined by recovering the position of the corresponding red vertices (e.g. the positional -mers). The number of consecutive red vertices needed to consider it misassembled can be changed via a command line parameter in misSEQuel. Our experiments were performed with the default (e.g. 50), which corresponds to a region in the contig that has length 50 bp. After this stage of the algorithm, we take contigs having regions exceeding that threshold as a set of contigs that are conjectured to be misassembled and their transitions in and out of those regions as breakpoints.

2.4 Misassembly verification

Lastly, we use optical mapping data to verify whether a contig that is conjectured to be misassembled indeed is. Verification is based on the expectation that, after in silico digestion, a correctly assembled contig has a sequence of fragment sizes that is similar to that in the optical map at the corresponding locus in the genome. In other words, an in silico-digested contig should align to some region of the optical map since both are derived from the same region in the genome. Conversely, as misassembled contigs are not faithful reconstructions of any part of the genome, when in silico digested, their sequence of fragments will likewise not have a corresponding locus in the optical map to which it aligns. Optical maps contain measurement error at each fragment size, so some criteria is needed to decide whether variation in fragment size of an in silico-digested contig and that of an optical map at a particular locus is due to variation in the size of the physical fragments or a consequence of optical measurement error. Because of this ambiguity, and the necessary tolerances to ensure correctly assembled contigs align to the locus in the optical map, misassembled contigs may also align to loci in the optical map, which by coincidence have a fragment sequence similar to the contig within the threshold margin of error. Although there are various sophisticated approaches to determining statistical significance of an alignment, such as by Sarkar , we use a model discussed by Nagarajan and take the cumulative density function 0.85 as evidence of alignment, which we found to work well empirically. In addition, a misassembled contig only fails to align to the optical map if the enzyme recognition sequence, and thus the cleavage sites, exist in the contig in a manner that disrupts a good alignment (e.g. a misassembled contig with an inverted segment may still align if cleavage sites flank the inverted segment). This implies that (i) some enzymes produce optical maps that have greater performance in identifying misassembly errors and (ii) alignment to the optical map is not as strong evidence for correct assembly as non-alignment to the optical map is for misassembly. This leads to the conclusion that an ensemble of optical maps (each made with a different enzyme) has a greater chance at revealing misassembly errors than a single optical map. As acquiring three optical maps for one genome is reasonably accessible for many sequencing projects, the process of in silico digestion and alignment is repeated for three enzymes. A contig is deemed to be misassembled if it fails to align to any one of the three optical maps. The alignment is performed using Twin (Muggli ) (with default parameters) and then these results are filtered according to the model mentioned above. For our experiments, optical maps were simulated by in silico digesting reference genomes, adding normally distributed noise with a 150 bp standard deviation and discarding fragments smaller than 700 bp.

3 Analyses

3.1 Datasets using simulated optical mapping data

Our first dataset consisted of approximately 6.9 million paired-end 101 bp reads from the prokaryote genome F.tularensis, generated by Illumina Genome Analyzer (GA) IIx platform. It was obtained from the NCBI Short Read Archive [accession number (SRA:SRR063416)]. The reference genome was also downloaded from the NCBI website [Reference genome (RefSeq:NC_006570.2)]. The F.tularensis genome is 1 892 775 bp in length with a GC content of 32%. As a measure of quality assurance, we aligned the reads to the F.tularensis genome using BWA (version 0.5.9) (Li and Durbin 2009) with default parameters. We call a read mapped if BWA outputs an alignment for it and unmapped otherwise. Analysis of the alignments revealed that 97% of the reads mapped to the reference genome representing an average depth of approximately . Our second dataset consisted of approximately 31.3 million paired-end 100 bp reads from the loblolly pine (Pinus taeda) genome (Neale ), which has GC contact of 38%. We downloaded the reference genome from the pine genome website (http://pinegenome.org/pinerefseq) and simulated reads from the largest five hundred scaffolds from the reference using ART (Huang ) (‘art illumina’). ART was ran with parameters that simulated 100 bp paired end reads with 200 bp insert size and 50x coverage. The data for this experiment are available on the misSEQuel website. We simulated an optical map using the reference genome for F.tularensis and loblolly pine since there is no publicly available one for these genomes. We assembled both sets of reads with a wide variety of state-of-the-art assemblers. The versions used were those that were publicly available before or on September 1, 2014: SPAdes (version 3.1) (Bankevich ); Velvet (version 1.2.10) (Zerbino and Birney 2008); SOAPdenovo (version 2.04) (Li ); ABySS (version 1.5.2) (Simpson ) and IDBA-UD (version 1.1.1) (Peng ). SPAdes outputs two assemblies: before repeat resolution and after repeat resolution—we report both. Some of the assemblers emitted both contigs and scaffolds. We considered contigs only but note that all scaffolds had a greater number of misassembly errors. We emphasize that our purpose here is not to compare the various assemblers, but instead it is to demonstrate that all assemblers produce misassembly errors, which are in need of consideration and correction. We used QUAST (Gurevich ) in default mode to evaluate the assemblies. Hence, our experiments use the published reference genomes as being ground truth and use the published references to identify misassembly errors in the other assemblies through QUAST. We note that this is imperfect since the reference genomes are likely not error-free. QUAST defines misassembly error as being extensive or local. An extensively misassembled contig is defined as one that satisfies one following conditions: (i) the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference; (ii) flanking sequences overlap on more than 1 kb and (iii) flanking sequences align to different strands or different chromosomes, whereas a local misassembled contig is one that satisfies the following conditions: (i) two or more distinct alignments cover the breakpoint; (ii) the gap between left and right flanking sequences is less than 1 kb and the left and right flanking sequences both are on the same strand of the same chromosome of the reference genome. We made a minor alteration to QUAST to output which contigs contain local misassembly errors. A contig can contain both extensive and local misassembly errors. Any correctly assembled contig is one that does not contain either type of error.

3.1.1 Detection of misassembly errors in F.tularensis

Table 1 gives the assembly statistics corresponding to this experiment. Comparable assembly results on this data were reported by Ilie , though in some cases we used more recent software releases (e.g. for SPAdes). Note that the number of locally misassembled contigs and the number of extensively misassembled contigs is not disjoint. A contig can be locally and extensively misassembled. Thus, Table 1 gives the number of contigs having at least one extensive misassembly error and the number of contigs having at least one local misassembly error.

Table 1.

Assembler	No. contigs (no. unaligned)	N50	Largest (bp)	Total (bp)	MA	local MA	MA (bp)	GF (%)
Velvet	358 (3 + 35 part)	7377	39 381	1 762 202	11	36	84 965	92.09
SOAPdenovo	307 (3 + 31 part)	8767	39 989	2 018 158	10	35	96 258	92.05
ABySS	96 (1 part)	27 975	88 275	1 875 628	64	32	1 330 684	95.87
SPAdes (−rr)	102 (2 + 11 part)	25 148	87 449	1 788 634	11	30	258 309	92.81
SPAdes (+rr)	100 (2 + 17 part)	26 876	87 891	1 797 197	23	31	497 356	93.75
IDBA	109 (1 + 10 part)	23 223	87 437	1 768 958	10	31	221 087	92.64

All statistics are based on contigs no shorter than 500 bp. N50 is defined as the length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs and for which the collection of all contigs of that length or shorter also contains at least half of the sum of the lengths of all contigs. The no. unaligned is the number of contigs that did not align to the reference genome, or they were only partially aligned (part). Total is sum of the length of all contigs. MA is the number of (extensively) misassembled contigs. Local MA is the total number of contigs that had local misassemblies. MA (bp) is the total length of the MA contigs. GF is the genome fraction percentage, which is the fraction of genome bases that are covered by the assembly. −rr and ++rr denotes before and after repeat resolution, respectively.

The performance comparison between major assembly tools on the F.tularensis dataset, which has a genome length of 1 892 775 bp and 6 907 220 number of 101 bp reads, using QUAST in default mode (Gurevich ) All statistics are based on contigs no shorter than 500 bp. N50 is defined as the length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs and for which the collection of all contigs of that length or shorter also contains at least half of the sum of the lengths of all contigs. The no. unaligned is the number of contigs that did not align to the reference genome, or they were only partially aligned (part). Total is sum of the length of all contigs. MA is the number of (extensively) misassembled contigs. Local MA is the total number of contigs that had local misassemblies. MA (bp) is the total length of the MA contigs. GF is the genome fraction percentage, which is the fraction of genome bases that are covered by the assembly. −rr and ++rr denotes before and after repeat resolution, respectively. The performance comparison of our method on the F.tularensis dataset The TPR in this context is a contig that is misassembled and is predicted to be so. The FPR is a correctly assembled contig that was predicted to be misassembled. The TPR and FPR are given as percentages with the raw values given in brackets. Bold values emphasize the benefit of using both data sources. The performance comparison between major assembly tools on Loblolly pine genome dataset (62 647 324 bp, 31.3 million reads, 100 bp) using QUAST in default mode (Gurevich ) Table 2 shows the results for (i) misSEQuel with paired-end data only; (ii) misSEQuel with optical mapping data only and (iii) misSEQuel with both optical mapping and paired-end data to demonstrate the benefit of combining both types of data. As demonstrated by these results, using short paired-end data alone produced a high false-positive rate (FPR) due to ambiguous read mapping in locations that contain repetitive regions. This is an inherent shortcoming of short paired-end data and demonstrates that to decrease the FPR, another source of information must be used in combination. Optical mapping data have a much lower FPR and when used in combination with paired-end data, produces optimal results. The lowest FPR was witnessed when both optical mapping and paired-end data were used. In some cases, the reduction in the FPR was dramatic: from 87% (ABySS, paired-end data) to 13% (ABySS, paired-end and optical mapping data). The true-positive rate (TPR) of locally misassembled contigs was between 77% and 100% when both paired-end and optical mapping data were used. Lastly, TPR of extensively misassembled contigs was between 55% and 100% when both paired-end and optical mapping data were used.

Table 2.

The performance comparison of our method on the F.tularensis dataset

Correction method	Assembler	MA TPR	local MA TPR	FPR
misSEQuel (paired-end data only)	Velvet	100% (11/11)	100% (36/36)	58% (180/312)
	SOAPdenovo	100% (10/10)	100% (35/35)	63% (165/263)
	ABySS	100% (64/64)	100% (32/32)	87% (20/23)
	SPAdes (−rr)	100% (11/11)	100% (30/30)	83% (52/63)
	SPAdes (++rr)	100% (23/23)	100% (31/31)	86% (49/57)
	IDBA	100% (10/10)	100% (31/31)	38% (57/149)
misSEQuel (optical mapping data only)	Velvet	55% (6/11)	69% (25/36)	24% (76/312)
	SOAPdenovo	80% (8/10)	63% (22/35)	29% (77/263)
	ABySS	69% (44/64)	88% (28/32)	13% (3/23)
	SPAdes (−rr)	91% (10/11)	87% (26/30)	21% (13/63)
	SPAdes (++rr)	87% (20/23)	81% (25/31)	16% (9/57)
	IDBA	90% (9/10)	77% (24/31)	10% (15/149)
misSEQuel (paired-end and optical mapping data)	Velvet	55% (6/11)	100% (36/36)	22% (68/312)
	SOAPdenovo	80% (8/10)	84% (21/35)	20% (53/263)
	ABySS	69% (44/64)	88% (28/32)	13% (3/23)
	SPAdes (−rr)	91% (10/11)	87% (26/30)	19% (12/63)
	SPAdes (++rr)	97% (20/23)	81% (25/31)	16% (9/57)
	IDBA	90% (9/10)	77% (24/31)	9% (14/149)
REAPR	Velvet	55% (6/11)	11% (4/36)	< 1% (2/312)
	SOAPdenovo	20% (2/10)	14% (5/35)	2% (6/263)
	ABySS	13% (8/64)	13% (4/32)	4% (1/23)
	SPAdes (−rr)	27% (3/11)	27% (8/30)	5% (3/63)
	SPAdes (++rr)	0% (0/23)	19% (6/31)	11% (6/57)
	IDBA	40% (4/10)	13% (4/31)	4% (6/149)
Pilon	Velvet	27% (3/11)	3% (1/36)	< 1% (3/312)
	SOAPdenovo	10% (1/10)	9% (3/35)	2% (5/263)
	ABySS	3% (2/64)	6% (2/32)	4% (3/23)
	SPAdes (−rr)	0% (0/11)	3% (1/30)	5% (5/63)
	SPAdes (++rr)	0% (0/23)	10% (3/31)	12% (7/57)
	IDBA	0% (0/10)	10% (3/31)	4% (5/149)

The TPR in this context is a contig that is misassembled and is predicted to be so. The FPR is a correctly assembled contig that was predicted to be misassembled. The TPR and FPR are given as percentages with the raw values given in brackets. Bold values emphasize the benefit of using both data sources.

In our experiments, we iterate through combinations of three enzymes from the REBASE enzyme database (Roberts ) and use the set of enzymes that performed best. Our results demonstrate that with a good enzyme choice over half of all extensively misassembled contigs and over 75% of locally misassembled contigs can be identified with only a 9–22% false discovery rate.

3.1.2 Detection of misassembly errors in loblolly pine

Table 3 gives the assembly statistics corresponding to this experiment. The results for the loblolly pine are listed in Table 4. Both Velvet and SOAPdenovo produced zero misassembled contigs on this dataset, so we do not include them in Table 4. misSEQuel correctly identifies between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs. The FPR was between 0.6% and 43%. Although REAPR has a lower FPR (between 3% and 11%), it is only capable of identifying a small number of extensively misassembled contigs (between 2% and 14%) and a small number of locally misassembled contigs (between 2% and 27%). Similar to the results of F.tularensis, Pilon had a lower FPR but also lower TPRs than REAPR. This is unsurprising since ‘it is optimized to use both fragment (or small) and long (or large) insert libraries’ and was created for microbial genomes (Walker ).

Table 3.

The performance comparison between major assembly tools on Loblolly pine genome dataset (62 647 324 bp, 31.3 million reads, 100 bp) using QUAST in default mode (Gurevich )

Assembler	No. contigs (no. unaligned)	N50	Largest (bp)	Total (bp)	MA	local MA	MA (bp)	GF (%)
Velvet	13 327 (0)	1740	10 823	51 851 131	0	0	0	62.21
SOAPdenovo	16 126 (0 + 1 part)	7950	63 004	57 205 817	0	0	0	90.01
ABySS	4586 (16 + 89 part)	37 089	201 382	63 349 408	127	715	1 391 565	98.17
SPAdes (−rr)	20 671 (4 + 10 part)	4809	44 993	45 079 764	7	11	65 079	81.30
SPAdes (+rr)	8607 (7 + 102 part)	16 957	108 442	59 730 939	299	57	3 734 609	94.57
IDBA	22 409 (3 + 31 part)	3990	40 213	49 765 854	61	200	292 769	79.03

Table 4.

The performance comparison of our method on the loblolly pine dataset

Correction method	Assembler	MA TPR	local MA TPR	FPR
misSEQuel	ABySS	31% (40/127)	57% (405/715)	43% (1604/3754)
	SPAdes (−rr)	100% (7/7)	73% (8/11)	<1% (135/20 653)
	SPAdes (+rr)	67% (199/299)	67% (38/57)	38% (3117/8254)
	IDBA	52% (32/61)	73% (145/200)	19% (4258/22 150)
REAPR	ABySS	7% (9/127)	2% (12/715)	3% (112/3754)
	SPAdes (−rr)	14% (1/7)	27% (3/11)	6% (1323/20 653)
	SPAdes (+rr)	7% (21/299)	5% (3/57)	5% (424/8254)
	IDBA	2% (1/61)	6% (12/200)	11% (2354/22 150)
Pilon	ABySS	7% (8/127)	2% (11/715)	2% (70/3754)
	SPAdes (−rr)	14% (1/7)	18% (2/11)	4% (923/20 653)
	SPAdes (+rr)	5% (16/299)	5% (3/57)	5% (388/8254)
	IDBA	2% (1/61)	5% (12/200)	8% (1823/22 150)

Again, a true positive in this context is a contig that is misassembled and is predicted to be so. A false positive is a correctly assembled contig that was predicted to be misassembled. Bold values highlight MISSEQUEL results.

The performance comparison of our method on the loblolly pine dataset Again, a true positive in this context is a contig that is misassembled and is predicted to be so. A false positive is a correctly assembled contig that was predicted to be misassembled. Bold values highlight MISSEQUEL results. Again, the restriction enzymes used in our experiments were chosen to be optimal by considering the set of all possible enzymes in the aforementioned database. Nonetheless, we note that if the enzyme combination was chosen at random, then the expected FPR and TPR would decrease by a small fraction for majority of the assemblies considered. The Supplementary Material shows prototypical ROC curves and heat-maps illustrating the density of enzyme combinations at various detection rates.

3.2 Datasets using real optical mapping data

We evaluated the performance of misSEQuel on rice and budgerigar. These genomes were chosen because they have available sequence and optical mapping data, are diverse in size and have undergone a significant level of verification. Rice and budgerigar have genome sizes of 430 Mb and 1.58 Gbp, respectively (Kawahara ; Tiersch and Wachtel 1991). The size of budgerigar is only predicted (Tiersch and Wachtel 1991). The validated genomic regions will allow us to use QUAST to determine FPR and TPR, as in Subsection 3.1.

3.2.1 Performance on rice genome

The sequence dataset consists of approximately 134 million 76-bp paired-end reads for rice from the japonica cultivar Nipponbare, generated by Illumina, Inc. on the Genome Analyzer (GA) IIx platform (Kawahara ). These reads were obtained from the NCBI Short Read Archive (accession SRX032913). The optical map for this same cultivar of rice was constructed by Zhou using SwaI as the restriction enzyme. This optical map was assembled from single molecule restriction maps into 14 optical map contigs, labeled as 12 chromosomes, with chromosome labels 6 and 11 both containing two optical map contigs. Both the sequence and optical mapping data were generated as part of a larger project that produced a ‘revised, error-corrected, and validated assembly of the Nipponbare cultivar of rice’ (Kawahara ). This reference genome, termed Os-Nipponbare-Reference-IRGSP-1.0 is publicly available on the Rice Annotation Project (http://rice.plantbiology.msu.edu/annotation_pseudo_current.shtml) The paired-end data were assembled using SOAPdenovo using default parameters. This assembly consists of 11 440 contigs larger than 500 bp, cove 81.3% of the reference genome (22 317 126 in size, 43.7% GC content). It has an N50 statistic of 1 680 499 extensively misassembled contigs and 3 locally misassembled contigs. We ran misSEQuel with default parameters and the SOAPdenovo assembly, optical map and paired-end data as input. Similarly, we ran REAPR with default parameters, and the SOAPdenovo assembly, and paired-end data as input. Out of the 499 extensively misassembled contigs, misSEQuel identified 374 of them (75%), whereas REAPR identified 30 (6%) and Pilon identified 25 (5%). Out of the three locally misassembled contigs, misSEQuel identified all three, but REAPR and Pilon identified none. Lastly, misSEQuel deemed that 821 of the correctly assembled contigs were misassembled (<1% FPR), whereas REAPR and Pilon deemed that 800 and 522 were deemed misassembled (<1% FPR), respectively. We further note that both misSEQuel and REAPR agreed on 472 of these correctly assembled contigs; i.e. both REAPR and misSEQuel predicted that 472 correctly assembled contigs are misassembled. This could suggest that a broadened definition of misassembly by QUAST would also deem these contigs to be misassembled.

3.2.2 Misassembly errors in the draft genome of budgerigar

A concerted effort has been in understanding the biodiversity of many bird species (Jarvis )—including the budgerigar genome—and thus, a significant amount of data have been generated for budgerigar. Pacific Biosciences (PacBio) data, short read Illumina data, 454 and optical mapping data have been generated and used for the assembly of this genome. The sequence and optical map data for the budgerigar genome were generated for the Assemblathon 2 project of Bradnam . Budgerigar has a GC content of approximately 43.8% (Jarvis ) and contains GC-rich ( 75%) regions (Howard ). Sequence data consist of a combination of Roche 454, Illumina and PacBio reads, providing 16×, 285× and 10× coverage, respectively, of the genome. All sequence reads are available at the NCBI Short Read Archive (accession ERP002324). For our analysis, we consider the assembly generated using CABOG (Miller ), which was completed by the CBCB team (Koren and Phillippy) as part of Assemblathon 2 (Bradnam ). The optical mapping data were created by Zhou, Goldstein, Place, Schwartz and Bechner using the SwaI restriction enzyme and consists of 92 separate pieces. Ganapathy released three hybrid assemblies; namely, (i) budgerigar 454-Illumina hybrid v6.3 using the CABOG (Miller ) assembler; (ii) budgerigar PacBio corrected reads (PBcR) hybrid using the CABOG assembler and (iii) budgerigar Illumina-454 hybrid using the SOAPdenovo (version 2.04) (Li ) assembler. We downloaded the PBcR assembly (ii), the Illumina-454 hybrid assembly (iii), in addition to the optical mapping data and pair-end data. The Illumina-454 assembly has 212 203 contigs (54 829 contigs 500 bp), N50 of 51 034 and largest contig of 500 974 (Howard ). The PBcR assembly has 77 556 contigs, average, N50 of 102 885 and largest contig of 849 044 (Howard ). We ran QUAST to evaluate the Illumina-454 hybrid assembly (all contigs 500 bp), using the PBcR assembly as the reference genome. It reported 13 996 extensively misassembled contigs and 2937 locally misassembled contigs. Thus, there are 39 394 contigs that contain no misassembly error. We ran misSEQuel on the Illumina-454 hybrid assembly on this using the paired-end and optical mapping. misSEQuel correctly identified 10 777 (out of 13 996) extensively misassembled contigs (77% MA TPR) and 2350 (out of 2937) locally misassembled contigs (80% local MA TPR); however, it incorrectly identified 4023 (out of the 39 394) as being misassembled (10%).

3.3 Practical considerations: memory and time

We evaluated the memory and time requirements of misSEQuel. Since misSEQuel is a multi-threaded application, its wall-clock time depends on the computing resources available to the user. misSEQuel required a maximum of 8 threads, 16 GB and 1.5 h on all assemblies of F.tularensis and a maximum of 20 GB and 2.5 h to complete on all assemblies of loblolly pine. Most genome assemblers require an incomparably greater amount of time and memory and thus, from a practical perspective, the requirements of misSEQuel are not a significant increase. The difference in the resource requirements of misSEQuel in comparison to modern assemblers is since it operates contig-wise rather than genome-wise and therefore, only deals with a significantly smaller portion of the data at a single time. We conclude by mentioning that misSEQuel is not optimized for memory and time and both could be further reduced but reimplementing the red–black positional de Bruijn graph using memory- and time-succinct data structures.

4 Discussion and conclusions

This article describes the first non-proprietary computational method for identifying misassembly errors using short read sequence data and optical mapping data. Our results demonstrate (i) a substantial number of misassembly errors can be identified in draft genomes of prokaryote and eukaryote species; (ii) our method works on genomes that vary by GC-content and size; (iii) it can be used in combination with any assembler and thus, making it a viable post-processing step for any assembly and (iv) addresses the need for methods to analyze optical mapping data. One of our main contributions is the demonstration that optical mapping can have significant benefit for misassembly error detection. A high FPR will result using paired-end data alone because of ambiguous read mapping. Furthermore, superior results were always witnessed using paired-end data and optical mapping data. In some cases, the improvement of using both datasets over a single dataset was substantial. For example, the Velvet assembly of F.tularensis had 312 correctly assembled contigs; 76 of these 312 were deemed to be misassembled when misSEQuel was ran with optical mapping data alone, whereas this improved to 68 out of 312 when both paired-end and optical mapping data were used. Similarly, when misSEQuel was ran on this same assembly, 69% (25/36) of locally misassembled contigs were identified, whereas this improved to 100% (36/36) when both datasets were used. Lastly, we point out two areas that warrant future work. One area that merits investigation is to develop methods that will distinguish between structural variation heterozygosity and paralogous variation and misassembly errors. misSEQuel is not able to detect the difference between structural variation and misassembly errors—and in fact, the high FPR might be due to this type of variation—however, methods that do so could be very valuable for finishing draft genomes. Lastly, we conclude by suggesting that efficient algorithmic selection of enzymes that will yield such informative optical maps in a de novo scenario is an area for interesting and important future work.

56 in total

1. De novo repeat classification and fragment assembly.

Authors: Pavel A Pevzner; Paul A Pevzner; Haixu Tang; Glenn Tesler
Journal: Genome Res Date: 2004-09 Impact factor: 9.043

2. Beware of mis-assembled genomes.

Authors: Steven L Salzberg; James A Yorke
Journal: Bioinformatics Date: 2005-12-15 Impact factor: 6.937

3. Short read fragment assembly of bacterial genomes.

Authors: Mark J Chaisson; Pavel A Pevzner
Journal: Genome Res Date: 2007-12-14 Impact factor: 9.043

4. ALLPATHS: de novo assembly of whole-genome shotgun microreads.

Authors: Jonathan Butler; Iain MacCallum; Michael Kleber; Ilya A Shlyakhter; Matthew K Belmonte; Eric S Lander; Chad Nusbaum; David B Jaffe
Journal: Genome Res Date: 2008-03-13 Impact factor: 9.043

5. On the evolution of genome size of birds.

Authors: T R Tiersch; S S Wachtel
Journal: J Hered Date: 1991 Sep-Oct Impact factor: 2.645

6. A single molecule scaffold for the maize genome.

Authors: Shiguo Zhou; Fusheng Wei; John Nguyen; Mike Bechner; Konstantinos Potamousis; Steve Goldstein; Louise Pape; Michael R Mehan; Chris Churas; Shiran Pasternak; Dan K Forrest; Roger Wise; Doreen Ware; Rod A Wing; Michael S Waterman; Miron Livny; David C Schwartz
Journal: PLoS Genet Date: 2009-11-20 Impact factor: 5.917

7. Aggressive assembly of pyrosequencing reads with mates.

Authors: Jason R Miller; Arthur L Delcher; Sergey Koren; Eli Venter; Brian P Walenz; Anushka Brownley; Justin Johnson; Kelvin Li; Clark Mobarry; Granger Sutton
Journal: Bioinformatics Date: 2008-10-24 Impact factor: 6.937

8. Validation of rice genome sequence by optical mapping.

Authors: Shiguo Zhou; Michael C Bechner; Michael Place; Chris P Churas; Louise Pape; Sally A Leong; Rod Runnheim; Dan K Forrest; Steve Goldstein; Miron Livny; David C Schwartz
Journal: BMC Genomics Date: 2007-08-15 Impact factor: 3.969

9. Scaffolding and validation of bacterial genome assemblies using optical restriction maps.

Authors: Niranjan Nagarajan; Timothy D Read; Mihai Pop
Journal: Bioinformatics Date: 2008-03-20 Impact factor: 6.937

10. REAPR: a universal tool for genome assembly evaluation.

Authors: Martin Hunt; Taisei Kikuchi; Mandy Sanders; Chris Newbold; Matthew Berriman; Thomas D Otto
Journal: Genome Biol Date: 2013-05-27 Impact factor: 13.583

16 in total

1. Error correcting optical mapping data.

Authors: Kingshuk Mukherjee; Darshan Washimkar; Martin D Muggli; Leena Salmela; Christina Boucher
Journal: Gigascience Date: 2018-06-01 Impact factor: 6.524

Review 2. Genome sequence assembly algorithms and misassembly identification methods.

Authors: Yue Meng; Yu Lei; Jianlong Gao; Yuxuan Liu; Enze Ma; Yunhong Ding; Yixin Bian; Hongquan Zu; Yucui Dong; Xiao Zhu
Journal: Mol Biol Rep Date: 2022-09-23 Impact factor: 2.742

3. Complete genome sequence of the biocontrol agent Serratia marcescens strain N4-5 uncovers an assembly artefact.

Authors: Larissa Carvalho Ferreira; Jude E Maul; Marcus Vinicius Canário Viana; Thiago Jesus de Sousa; Vasco Ariston de Carvalho Azevedo; Daniel P Roberts; Jorge Teodoro de Souza
Journal: Braz J Microbiol Date: 2020-09-23 Impact factor: 2.476

Review 4. Analysis of single nucleic acid molecules in micro- and nano-fluidics.

Authors: Sarah M Friedrich; Helena C Zec; Tza-Huei Wang
Journal: Lab Chip Date: 2016-03-07 Impact factor: 6.799

5. Succinct colored de Bruijn graphs.

Authors: Martin D Muggli; Alexander Bowe; Noelle R Noyes; Paul S Morley; Keith E Belk; Robert Raymond; Travis Gagie; Simon J Puglisi; Christina Boucher
Journal: Bioinformatics Date: 2017-10-15 Impact factor: 6.937

Review 6. Next Generation Sequencing of Actinobacteria for the Discovery of Novel Natural Products.

Authors: Juan Pablo Gomez-Escribano; Silke Alt; Mervyn J Bibb
Journal: Mar Drugs Date: 2016-04-13 Impact factor: 5.118

7. Improvement of the banana "Musa acuminata" reference sequence using NGS data and semi-automated bioinformatics methods.

Authors: Guillaume Martin; Franc-Christophe Baurens; Gaëtan Droc; Mathieu Rouard; Alberto Cenci; Andrzej Kilian; Alex Hastie; Jaroslav Doležel; Jean-Marc Aury; Adriana Alberti; Françoise Carreel; Angélique D'Hont
Journal: BMC Genomics Date: 2016-03-16 Impact factor: 3.969

8. Revealing misassembled segments in the bovine reference genome by high resolution linkage disequilibrium scan.

Authors: Adam T H Utsunomiya; Daniel J A Santos; Solomon A Boison; Yuri T Utsunomiya; Marco Milanesi; Derek M Bickhart; Paolo Ajmone-Marsan; Johann Sölkner; José F Garcia; Ricardo da Fonseca; Marcos V G B da Silva
Journal: BMC Genomics Date: 2016-09-05 Impact factor: 3.969

9. misMM: An Integrated Pipeline for Misassembly Detection Using Genotyping-by-Sequencing and Its Validation with BAC End Library Sequences and Gene Synteny.

Authors: Young-Joon Ko; Jung Sun Kim; Sangsoo Kim
Journal: Genomics Inform Date: 2017-12-29

10. Disentangling the Causes for Faster-X Evolution in Aphids.

Authors: Julie Jaquiéry; Jean Peccoud; Tiphaine Ouisse; Fabrice Legeai; Nathalie Prunier-Leterme; Anais Gouin; Pierre Nouhaud; Jennifer A Brisson; Ryan Bickel; Swapna Purandare; Julie Poulain; Christophe Battail; Claire Lemaitre; Lucie Mieuzet; Gael Le Trionnaire; Jean-Christophe Simon; Claude Rispe
Journal: Genome Biol Evol Date: 2018-02-01 Impact factor: 3.416