Literature DB >> 29949989

A graph-based approach to diploid genome assembly.

Shilpa Garg^1,2,3, Mikko Rautiainen^1,2,3, Adam M Novak⁴, Erik Garrison^5,6, Richard Durbin^5,6, Tobias Marschall^1,2.

Abstract

Motivation: Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.
Results: We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation: https://github.com/whatshap/whatshap. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2018 PMID： 29949989 PMCID： PMC6022571 DOI： 10.1093/bioinformatics/bty279

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

There are two homologous copies of every chromosome, one from each parent, in human and other diploid eukaryotic genomes. Determining the two genome sequences of those organisms per chromosome is important in order to correctly understand allele-specific expression and compound heterozygosity, and in order to carry out many analyses in the genetics of common diseases and in population genetics (Glusman ; Tewhey ). Furthermore, separate determination of the two haplotype sequences can in principle avoid genotyping errors in complex regions of the genome caused by simplistic models that treat variants at nearby sites as being independent. The process of assembling two distinct genome sequences from sequencing reads in a haplotype-aware manner is known as diploid or haplotype-aware genome assembly and the assembled sequences are known as ‘haplotigs’. However, next generation sequencing (NGS) reads are generally of short length and contain errors; therefore, solving the diploid genome assembly problem is fundamentally challenging. Additional challenges inherent in the genome assembly problem include dealing with short and long genomic repeats, handling general rearrangements present in the genome, and scaling efficiently with input size, genome size and hardware availability. Over the last decade, the development of various NGS technologies has impacted the assembly problem. In theory, the problem of de novo assembly—computing the consensus of two or more sequences—is NP-hard, when the problem is modeled either with string graphs or with de Bruijn graphs (Medvedev ). In the past decades, a multitude of heuristic approaches to haploid de novo assembly have been proposed (Idury and Waterman, 1995; Myers, 1995, 2005; Nagarajan and Pop, 2009, 2013; Pevzner ; Sović ). However, even with Sanger (reads of the order of 800–1000 base pairs) and Illumina sequencing, which deliver short reads with low error rates, assembly of heterozygous diploid genomes has been a difficult problem (Levy ; Vinson ). In practice, there are several short-read assemblers based on Illumina data for heterozygous genomes (Bankevich ; Li, 2015b; Kajitani ; Pryszcz and Gabaldón, 2016; Simpson and Durbin, 2012). The assemblies that they produce are accurate, but contain gaps and are composed of relatively short contigs and scaffolds. Third generation sequencing technologies such as methods available from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) deliver much longer reads, but with high error rates. There are now several long-read assemblers (Berlin ; Chin ; Hunt ; Koren ; Lin ; Vaser ; Xiao ) that use these long-read data for de novo assembly. The assemblies that are delivered from these assemblers are more contiguous, with longer contigs and scaffolds. Finally, there are hybrid assemblers that take advantage of long-read data (with its high error rate) and short-read data (with its low error rate) (Antipov ; Bashir ; Zimin ) and attempt to combine the best aspects of both. These hybrid assemblers have the potential to deliver highly accurate, repeat-resolved assemblies. However, across the short, long and hybrid categories, most assemblers require collapsing the two genome sequences of a diploid sample into a single haploid ‘consensus’ sequence (or primary contig). The consensus sequence is obtained by merging the distinct alleles at regions of heterozygosity into a single allele, and therefore losing a lot of information. The resulting haploid de novo assembly does not represent the true characteristics of the diploid input genome. Current approaches to reconstruct diploid genomes usually rely on collapsing assembly graphs to haploid contigs in intermediate steps (contig-based assembly) (Chin ; Mostovoy ; Pendleton ; Seo ), or on using a reference genome to partition the reads by haplotype (reference-guided assembly) (Chaisson ; Glusman ; Martin ). In both types of approaches, the reads are first aligned (either to the reference genome or the contigs). Second, variants such as SNVs are called based on the aligned reads. Finally, the detected variants are phased using long reads from either the same or a different sequencing technology. Because these methods represent the genome with haploid sequences in some processing steps, we refer to them as linear approaches. For both reference-guided and contig-based assembly, this third step—solving the phasing problem—has been formulated as the minimum error correction (MEC) optimization problem (Cilibrasi ; Lippert ). The reviews by Rhee and Klau and Marschall (2017) provide introductions to this formulation. There are several disadvantages to reference-guided assembly; for example, the reads are initially aligned to the reference genome and therefore the process contains reference bias. Also, this approach can fail to detect sequences or large structural variants (SVs) that are unique to the genome being assembled. However, there are also several reasons why the set of sequences/contigs produced by contig-based assembly is not ideal. First, the contigs produced by haploid assemblers ignore the heterozygous variants in complex regions, opting instead to break contiguity to express even moderate complexity. Second, the contigs do not capture end-to-end information in the genome; the ordering or relationships between contigs are critical in order to generate end-to-end chromosomal-length assemblies. One example of a newer diploid assembly method is Weisenfeld , where 10× genomics-linked read data is used to determine the actual diploid genome sequence. Their approach is based on de Bruijn graphs and applies a series of graph simplifications, where simple bubbles are detected and phased by using (short) reads that stem from the same (long) input molecule, which is determined through barcoding. There is also a recent study by Chin , who follow a linear phasing approach to generate diploid assemblies (haplotigs) for diploid genomes by using PacBio reads.

1.1 Contributions

We propose a graph-based approach for generating haplotype-aware assemblies of single individuals. Our contribution is two-fold. First, we propose a hybrid approach, integrating accurate Illumina and long PacBio reads in order to generate diploid assemblies. The Illumina reads are used to generate an assembly graph that serves as a backbone for subsequent PacBio-based steps. Second, we generalize the diploid assembly problem to encompass constructing the diploid assembly directly from the underlying assembly graph and thereby avoid ‘flattening’ the assemblies to linear sequences at any time. The two haplotype sequences can be seen as two paths over the regions of heterozygosity in the assembly graph. Phasing using an assembly graph has several advantages over linear approaches. In particular, it allows us to represent and phase heterozygous SVs, which are represented by bubbles in the assembly graph. A bubble is defined as a set of disjoint paths that share the same start and end nodes. Figure 1 illustrates how such bubbles can represent both small variants (which we define as SNVs and indels up to 50 base pairs in length) and larger SVs. Handling small variants and SVs in a unified way facilitates phasing larger blocks because haplotype reconstruction is not disrupted by SVs. Figure 2 illustrates this conceptual advantage. The figure shows four SNVs separated by two large SVs, and six reads spanning these variants. Out of those reads, the two reads r3 and r4 span the two SVs, but do not cover any of the two SNVs. Conversely, the reads which cover the SNVs on either side do not cover the SVs. In this case, Falcon Unzip generates a primary contig that spans from one end to the other, but generates incomplete and fragmented haplotigs (phased primary contigs in the language of Falcon Unzip) covering only the SNVs. In contrast, our graph-based approach attempts to phase across all types of variation, including SVs.

Fig. 1.

Fig. 2.

Input: an assembly graph (top) (consisting of four SNVs and two SVs) and the PacBio reads (gray). Output: the phased reads (colored in blue and red) and haplotigs (bottom) using Falcon Unzip and our approach. Our graph-based approach also phases the central region. Contrarily, Falcon Unzip does not phase it, and so the region does not contribute to the total haplotig size

Based on reads (middle) from the two sequences (top), the bubbles in the graph (bottom) show three different heterozygous variants; the first one is an SNV, the second one is an SV, and the third one is an indel Input: an assembly graph (top) (consisting of four SNVs and two SVs) and the PacBio reads (gray). Output: the phased reads (colored in blue and red) and haplotigs (bottom) using Falcon Unzip and our approach. Our graph-based approach also phases the central region. Contrarily, Falcon Unzip does not phase it, and so the region does not contribute to the total haplotig size We demonstrate the feasibility of our approach by performing a haplotype-aware de novo assembly of a whole pseudo-diploid yeast (SK1 + Y12) genome. We show that we generate more accurate and more contiguous phased diploid genomes compared to Falcon Unzip. Through empirical testing with different input coverage levels, we demonstrate that we require only 50× short-read coverage and as little as 10× long-read coverage data to generate diploid assemblies. This illustrates that our hybrid strategy is a cost-effective way of generating haplotype-resolved assemblies. Finally, we show that we successfully detect and phase large SVs.

2 Diploid assembly pipeline

Our assembly workflow uses short read (e.g. Illumina) and long read (e.g. PacBio) data in combination, as illustrated in Figure 3. We describe the details of this process below.

Fig. 3.

Overview of the diploid assembly pipeline

2.1 Sequence graph

Our first step is to construct a sequence graph using short read data with a low error rate, as provided by the Illumina platform. Definition 1 (Sequence graph). We define a sequence graph as a bidirected graph, consisting of a set of nodes N and a set of edges E. The nodes n are sequences over an alphabet . For each node , its reverse-complement is denoted by . An edge connects the nodes to n. Nodes may be traversed in either the forward or reverse direction, with the sequence being reverse-complemented in the reverse direction. In words, edges represent adjacencies between the sequences of the nodes they connect. Thus, the graph implicitly encodes longer sequences as the concatenated sequences of the nodes along walks through the graph. To illustrate this, we consider an example sequence graph G in Figure 4. It consists of a node set and an edge set .

Fig. 4.

For a subgraph of G, the example shows two bubbles l1 and l2, and their corresponding alleles. Reads traverse these bubbles

For a subgraph of G, the example shows two bubbles l1 and l2, and their corresponding alleles. Reads traverse these bubbles To generate the sequence graph G, we first employ SPAdes (Bankevich ), which constructs and simplifies a de Bruijn graph, and we subsequently remove the overlaps between the nodes in the resulting graph in a process we call bluntification, explained in the Supplementary Material.

2.2 Bubble detection in sequence graphs

To account for heterozygosity in a diploid genome, we perform bubble detection. The notion of bubble we use is closely based on the ultrabubble concept as defined by Paten . Briefly, bubbles have the following properties: Two-node-connectivity. A bubble is bounded by fixed start and end nodes. Removing both the start and end nodes disconnects the bubble from the rest of the graph. Note that a bubble can be viewed in either orientation. If the graph is traversed in one direction, and a bubble is encountered that starts at a node n and ends at a node , then that bubble can also be described as the bubble with start node n and end node , as it would be encountered when traversing the graph in the opposite direction. Directed acyclicity. A bubble is directed and acyclic. Directionality. All paths through the bubble flow from start to end. Minimality. No vertex in the bubble other than the start node n (with proper orientation) forms a pair with the end node (with proper orientation) that satisfies the above properties. Similarly, no vertex in the bubble other than forms such a pair with n. A bubble can represent a potential sequencing error or genetic variation within a set of homologous molecules. We represent bubbles as collections of alternative paths. Definition 2 (Path). We define path a as a linear ordering of nodes . A bubble is a collection of paths with the same start and end node and can be defined as follows: Definition 3 (Bubble). Formally, a bubble is represented as a collection of allele paths where and so on. For example, Figure 4 shows a set of two bubbles , and the set of allele paths for the bubble l2 is , where .

2.3 PacBio alignments

For phasing bubbles, we consider long reads from third generation sequence technologies such as PacBio. We align these long reads to the sequence graph G to generate paths through the graph. We perform graph alignment using a banded version of the algorithm described by Rautiainen and Marschall (2017), which is a generalization of semi-global alignment to sequence-to-graph alignment (https://github.com/maickrau/GraphAligner). There are several advantages of aligning PacBio reads to graphs instead of to a reference genome or contigs. SNPs often occur near larger variants such as insertions and deletions. SNPs are thus often missed in these regions when reads contain large mismatches with respect to the linear sequences they are aligned against. Graph alignment allows the alignment of reads to variants appropriate to each read’s phase, and to other types of complex events. Definition 4 (Alignment). We define a set of read alignments as , where each read alignment r is given by a path of oriented nodes in graph G, written . For example, in Figure 4, and the read alignment path r1 can be written as

2.4 Bubble ordering

The next stage of our algorithm is to obtain an ordering of the bubbles , which we refer to as a bubble chain. For example, in Figure 4, is a bubble chain. A general sequence graph G is cyclic, due to different types of repeats present in the genome that create both short and long cycles. Ordering bubbles in such a graph is closely related to resolving repeats, which is a challenging problem. In this study, we rely on the Canu algorithm (Koren ) to provide a bubble ordering by aligning Canu-generated contigs to our sequence graph. Furthermore, we detect repetitive bubbles—that is, bubbles that would need to be traversed more than once in a final assembly—based on the depth of coverage of aligned PacBio reads, and remove such bubbles. We deem a bubble repetitive if the number of PacBio reads aligned to its starting node is greater than a coverage threshold specified by the user over the genome. For example, given a 30× () dataset and a repeat that occurs 20 () times in the genome, then the coverage at the bubble on average is 600 ().

2.5 Graph-based phasing

Given a sequence graph G, ordered bubbles L, and PacBio alignments R, the goal is to reconstruct two haplotype sequences , called haplotigs, along each chain of bubbles. Definition 5 (Haplotype path). Formally, a pair of haplotype paths (h0, h1) can be defined as two paths through a bubble chain in the sequence graph and denoted as: where h0 and h1 may differ at the heterozygous regions defined by bubbles, and n and n are the start and end of the bubble chain. The two genome sequences can be seen as two walks through the bubbles L in the sequence graph G that are consistent with the PacBio alignments R. In maximum likelihood terminology, the goal is to find the most likely haplotype paths given the alignment paths traversing through the bubbles. For example, in Figure 4, given bubbles (l1, l2) and PacBio alignments , the goal is to find two maximum likelihood haplotype paths such that each PacBio alignment is assigned to one of the haplotypes. For a linear chain of bubbles L, the task of finding these two haplotype paths is equivalent to picking one allele path per haplotype for each bubble. To this end, we note that an alignment path r for a given read can be viewed as a sequence of allele paths traversed in consecutive bubbles. We represent this association of reads to allele paths in the form of a bubble matrix, where is the number of reads, is the number of bubbles along a chromosome, and is the maximum number of paths (or alleles) in any bubble . The entry represents the allele path index in bubble l that read r is aligned to, where a value of ‘–’ indicates that the read does not cover the bubble. In Figure 4, note that the read alignment path r4 does not cover all the nodes in any of the allele paths in l2 and hence we set the corresponding value to ‘–’. As a result, this read covers only one bubble, which renders it uninformative for phasing, and we do not consider it further. The remaining phasing-informative reads in Figure 4 are represented as: Corresponding to , we have a weight matrix . Each entry in is a tuple storing a weight for each allele, which can for instance reflect ‘phred-scaled’ (i.e. ) probabilities that the read supports a given allele. The weight of ‘0’ at the i-th entry in the tuple encodes that the read r is aligned to allele path index i in bubble l. The remaining non-zero values in tuple store the confidence scores of switching the aligned read r to other alleles in bubble l. For example, the corresponding weight matrix for (1) is given by: where the entry value means that the read r0 is aligned to allele a0 at bubble l1. Additionally, the cost of flipping it to other alleles is q1 for a1 and q2 for a2. We are now ready to present the problem formulation. The main insight is that solving phasing for bubble chains is similar to solving the phasing problem for multi-allelic SNVs in reference-based haplotype reconstruction. Therefore, we build on the previous formulation of the MEC problem (Lancia ) and its weighted version (wMEC) (Lippert ; Patterson ) and further adapt it to work on a subgraph consisting of a chain of bubbles, defining the Minimum Error Correction for graphs (gMEC) problem. Problem 1 (wMEC for bubble chains (gMEC)). Assume we are given a bubble chain and a set R of aligned reads r that pass through these bubbles, with indicating the index of the allele in bubble l that the alignment of read r passes through, or ‘–’ if it does not pass through l, and that is the cost of flipping to new value i. We want to find two paths through L, each of which consists of a sequence of allele indices specifying which allele the path takes in each bubble l, and then to flip entries of such that each row is equal to one of the paths for all non-dash entries while the incurred costs are minimized. Note that the wMEC problem constitutes a special case of gMEC, where the input graph is a chain of bi-allelic bubbles. Next, we describe how to solve gMEC via dynamic programming (DP). In the WhatsHap algorithm (Patterson ), wMEC is solved in an exact manner for bi-allelic variants using a dynamic programing approach. It runs in time, where is the number of variants to be phased and c is the maximum physical coverage. The basic idea is to proceed column-wise from left to right over a set of active reads. Each read remains active from its first non-dash position to its last non-dash position in . In column k, we denote the set of active reads as A(k), particularly, . The algorithm now considers all bipartitions of A(k), that is, all pairs of disjoint sets P and Q such that . We fill a DP table column wise and for each column k of , we fill a DP table column with entries corresponding to these bipartitions of A(k). Each entry C(k, B) is equal to the cost of solving wMEC on the partial matrix consisting of columns 1 to k of such that the bipartition of the full read set extends B according to the below definition. Definition 6 (Bipartition extension). For a given set A and a subset, a bipartitionof A is said to extend a bipartitionofifand. Once all entries of the DP table have been computed, the minimum of the last column indicates the optimal wMEC cost and the optimal bipartition can be obtained by backtracing. We refer the reader to Patterson for a more detailed explanation of this algorithm.

Solving gMEC for bubble chains

The basic idea is to now extend the dynamic program to consider all possible path-pairs through each bubble. In the bi-allelic case, we have only two paths in every bubble and, therefore, there is only one pair of distinct paths. In the multi-allelic case, we consider all possible path pairs in each bubble. The goal is to find an optimal pair of paths from the sequence graph G. Analogously to the WhatsHap algorithm for wMEC, we proceed from left to right using dynamic programming. To explain the dynamic programming algorithm that we use, consider a toy example with the weight matrix (2):

DP cell initialization

Along similar lines as Patterson , we first compute the local cost incurred by bipartition in column k, denoted , and later combine it with the corresponding costs incurred in previous columns. The cost of flipping all entries in a read set R to an allele index is given by In the same manner, we can compute costs for read set S to an allele index i. To compute the cost incurred by a bipartition in a particular column k, we minimize over all possible pairs of alleles in bubble l. There are such pairs. So given the corresponding column vectors and of the bubble matrix and of the weight matrix, respectively, and the bipartition of active reads A(k), the cost is computed by minimizing over all pairs of alleles : where the outer minimization considers all allele pairs and the inner minimization considers the two possibilities of assigning those two alleles to the two haplotypes.

DP column initialization

We initialize the first DP column by setting for all possible bipartitions B. We enumerate all bipartitions in Gray code order, as done previously in Patterson . This ensures that only one read is moved from one set to another in each step, facilitating constant time updates of the values . For a bubble matrix (1) and its corresponding weight matrix (3), the DP column cell for bipartition is given by Now, plugging values from (3) into the above equation for different bipartitions, can be filled as follows: Similarly, we can compute for other bipartitions DP COLUMN INITIALIZATION Input: Set A(1) of reads covering bubble l1. Output: forall bipartitions B of column kdo Compute using Equation 4 and store in C(1, B). end Due to the use of the Gray code order, we can perform this operation for one DP column in time.

DP column recurrence

Note that C(k, B) is the cost of an optimal solution of Problem 1 for input matrices restricted to the first k columns under the additional constraint that the solution’s bipartition of the full read set extends B. Since column k lists all bipartitions, the optimal solution to the input matrix consisting of the first k columns would be given by the minimum in that column. To compute entries in column , we add up local costs incurred in column k + 1 and costs from the previous column (see Algorithm 2). To adhere to the semantics of described above, only entries in column k whose bipartitions are compatible with B are to be considered as possible ‘predecessors’ of . Definition 7 (Bipartition compatibility). For bipartitionsof A andof, B andare compatible ifand, denoted by For example, consider the second column from (1) and (3). Let us compute for different bipartitions using recurrence in Algorithm 2: To fill DP column , we can analogously compute this for the remaining bipartitions , , and . DP TABLE Input: for all bipartitions of bubble k. Output: for all the columns k up to the last column for all columns do forall bipartitionsdo Compute using Equation 4. Combine it with cost from column k – 1 to obtain cost for column k: end where denotes the set of all bipartitions of A(k). end

Backtracing

We can backtrace from the last column to compute an optimal bipartition of all input reads. Given this bipartition, we obtain minimum-cost haplotypes as follows: Let with and be the induced bipartition in column k. We then set where a and a refer to the corresponding allele paths of bubble k (see Definition 2).

Time complexity

Computing one DP column takes time, and the total running time is for bubbles, where m is the maximum number of alleles in any bubble from L. Running time is independent of read-length and, therefore, the algorithm is suitable for the increased read lengths available from upcoming sequencing technologies.

2.6 Generation of final assemblies

To generate final assemblies, for every connected component in the base sequence graph G, we traverse along the haplotype paths (h0, h1) running through that component. For the nodes in each path, we concatenate together the nodes’ sequences from the base sequence graph G (in either in their forward or reverse-complement orientations, as specified by the path) in order to generate the final haplotig sequences.

3 Datasets and experimental setup

To evaluate the performance of our method, we consider the real data available from two haploid yeast strains SK1 and Y12 (Yue ), which we combine to generate a pseudo-diploid yeast. Both the SK1 and Y12 yeast strains are deeply sequenced using Illumina and PacBio sequencing. The Illumina dataset is sequenced to an average coverage of 469× with 151 bp paired end reads. We randomly downsample the dataset to a lower average coverage of 50×. The PacBio data is sequenced to an average coverage of 334× with an average read length of 4510 bp. For coverage analysis, we randomly downsample the PacBio reads to obtain datasets of different coverages and with their average read-lengths of 4482, 4501 and 4516 bp respectively.

3.1 Pipeline implementation

Sequence graph

The first step in our pipeline is to perform error correction on the Illumina data by using BFC (Li, 2015a), which, in our experience, retains heterozygosities well for diploid genomes. BFC is used with default parameters and provided with a genome size of 12.16 Mbp. The second step is to generate a sequence graph that includes heterozygosity information. To construct such a graph, we first construct the assembly graph by using a modified version of SPAdes v3.10.1 (Bankevich ). We modify the original SPAdes to skip the bubble removal step and retain the heterozygosity information in the graph, and run it with default parameters plus the ––only-assembler option. It uses the short Illumina reads to generate a De Bruijn-based assembly graph without any error correction. We then convert the assembly graph to a bluntified sequence graph using VG (Garrison ). After graph simplification, the resulting sequence graph has 158 567 nodes and 190 767 edges.

Bubble detection

In the next stage, we use VG’s snarl decomposition algorithm (Paten ) to detect the regions of heterozygosity, or snarls, in the sequence graph. This results in 29 071 bubbles.

PacBio alignments

After bubble detection, we align different coverage levels (10×, 20× and 30×) of long read PacBio data to the generated sequence graph using GraphAligner (https://github.com/maickrau/GraphAligner). This resulted in 21 868, 43 459 and 73 129 PacBio alignments for input coverages of and , respectively.

Bubble ordering

To obtain an ordering of bubbles, we perform de novo assembly using Canu v1.5 (Koren ) on each PacBio dataset. As suggested by Giordano , we use Canu v1.5 with the following parameter values: corMhapSensitivity = high,corMinCoverage = 2,correctedErrorRate = 0.10,minOverlapLength = 499,corMaxEvidenceErate = 0.3. Next, we align these Canu contigs to the sequence graph to obtain the bubble ordering, which we define as the sequence of bubbles encountered by each aligned contig. Note that we use Canu solely for bubble ordering. In this paper, we restrict ourselves to phasing bubbles only in unique, non-repetitive regions. We detect repetitive bubbles based on the coverage depth of the PacBio alignments and remove them from downstream analyses. The coverage depth threshold used is 1.67 times the average coverage. This results in 148, 80 and 71 bubble chains, and 26 576, 27 556 and 27 741 bubbles, at coverages of and respectively.

Graph-based phasing

For each of the coverage conditions, we take as input the ordered bubbles, the long-read PacBio alignments and the sequence graph, and solve the gMEC problem by assuming constant weights in the weight matrix . The optimal bipartition is computed via backtracing and the final haplotigs are generated by concatenating the node labels of the two optimal paths. These steps have been implemented in our WhatsHap software as a subcommand phasegraph.

3.2 Running Falcon Unzip

The main goal of this study is to measure the performance of phasing using a graph-based approach, and, in particular, the quality of haplotypes at heterozygous sites achievable by using this method with low coverage PacBio data. Therefore, we compared our graph-based approach to the state-of-the-art contig based phasing method Falcon Unzip, which also generates diploid assemblies. The Falcon Unzip (Chin ) algorithm first constructs a string graph composed of ‘haploid consensus’ contigs, with bubbles representing SV sites between homologous loci. Sequenced reads are then phased and separated for each haplotype on the basis of heterozygous positions. Phased reads are finally used to assemble the backbone sequence (primary contigs) and the alternative haplotype sequences (haplotigs). The combination of primary contigs and haplotigs constitutes the final diploid assembly, which includes phasing information dividing single-nucleotide polymorphisms and SVs between the two haplotypes. We ran Falcon Unzip using the parameters given in the official parameter guide (http://pb-falcon.readthedocs.io/en/latest/parameters.html). We tried to run Falcon Unzip for lower coverages of 10× and 20×, but it did not generate output in these cases (and we assume it is not designed for such low coverages). Therefore, we only ran Falcon Unzip for 30× PacBio coverage. Primary contigs and haplotigs were polished using the Quiver algorithm and corrected for SNPs and indels using Illumina data via Pilon, with the parameters ‘–diploid’ and ‘–fixall’ (Walker ).

3.3 Assembly performance assessment

To evaluate the accuracy of the predicted haplotypes, we align reference assemblies of the two yeast strains SK1 and Y12 (Yue ) to the sequence graph. We emphasize that these reference assemblies are only used for evaluation purposes and are not a part of our assembly pipeline. We use the following performance measures for the evaluation of diploid assemblies:

Phasing error rate

Over the yeast genome, we compare the different diploid assemblies with the ground truth haploid genomes of SK1 and Y12. As with the reference assemblies, we align the haplotigs produced by Falcon Unzip to our sequence graph. For each phased bubble chain, the predicted haplotype is expressed as a mosaic of the two true haplotypes, minimizing the number of switches. This minimum then gives the number of switch errors. The phasing error rate is defined as the number of switch errors divided by the number of phased bubbles. Figure 5 illustrates this calculation for a toy example. The top panel shows the true references aligned to the sequence graph. At the bottom, predicted haplotypes (from Falcon Unzip or our graph-based approach) are aligned to the graph. Comparing the true and predicted haplotypes, we see one switch between SV1 and SV2, which means that the switch error count is one. The number of phase connections between consecutive bubbles is five and the resulting switch error rate for this example is 1/5.

Fig. 5.

For a subgraph of G, this example shows the true (top) and predicted (bottom) versions of two haplotype alignments (red and blue) through a series of bubbles. When comparing the correspondingly-colored lines between the two versions, we see one switch between SV1 and SV2: the prediction contains one switch error. Six bubbles have been phased, for a total of five phase connections between consecutive bubbles. Therefore, the phasing error rate is 1/5

Average percent identity

We consider the best assignment of each haplotig to either of the two true references, obtained by aligning the haplotig to the references. For each whole diploid assembly, we compute the average of the best-alignment percent identities over all haplotigs.

Assembly contiguity

We assess the contiguity of the assemblies by computing the N50 of haplotig size.

Assembly completeness

We consider two assembly completeness statistics: first, the total length of haplotigs assembled by each method, and second, the total number of unphased contigs.

4 Results

In this section, we present the results of our analysis of the diploid assemblies generated by our method and by Falcon Unzip on the datasets described above.

4.1 Coverage analysis

To discover a cost-effective method for assembling a diploid genome, we consider PacBio datasets that vary in terms of coverage—specifically, 10×, 20× and 30× coverage are considered. One of the primary aims of our study is to compare two approaches—the graph-based approach we implemented and the contig-based phasing done by Falcon Unzip. In doing so, we quantify the agreement between the diploid assemblies generated by both methods and the true references. Table 1 shows the assembly performance statistics for both of these methods. In order to assess the accuracy of the competing diploid assemblies, we compute the phasing error rate and the average percent identity at different PacBio coverages. For the graph-based approach, we observe that as we increase the long read coverage from 10× to 30×, the average identity of haplotigs increases from 99.5% to 99.8% and the phasing error rate decreases from 2.5% to 0.7%. In contrast, Falcon Unzip produces haplotigs with an average identity of 99.4% and phasing error rate of 3.8% at 30× coverage. Overall, comparing the agreement between the graph-based approach (at 10× coverage) and Falcon Unzip (at 30× coverage) to the true references, our graph-based approach delivers better haplotigs with respect to all measures reported in Table 1. We believe that one reason for this is that we use an Illumina-based graph as a backbone. Furthermore, optimally solving the gMEC formulation of the phasing problem most likely contributes to generating accurate haplotigs. Overall, our analysis supports the conclusion that our approach delivers accurate haplotype sequences even at a long read coverage as low as 10×.

Table 1.

Comparison of two phasing methods, Falcon Unzip and our graph-based approach, at different PacBio coverage levels

Statistics	PacBio coverage	Graph-based approach	Falcon Unzip
Diploid assemblies quality
Average identity (%)	10×	99.50	—
	20×	99.61	—
	30×	99.80	99.4
Phasing error rate (%)	10×	2.5	—
	20×	1.5	—
	30×	0.7	3.8
Contiguity
N50 haplotig size (bp)	10×	40k	—
	20×	42k	—
	30×	43k	32k
Completeness
Haplotig size (Mbp)	10×	20.7	—
	20×	21.1	—
	30×	23.9	16.6
# Unphased contigs	10×	2	—
	20×	2	—
	30×	2	77

Note: For computing the ‘haplotig N50’, we only consider those portions of a contig for which two haplotypes are available, i.e. those regions where Falcon reports both a primary contig and an alternative haplotig. For ‘haplotig size’, we sum the length of contigs on both haplotypes (‘primary contigs’ plus ‘haplotigs’ in terms of Falcon’s output), so the target size is twice the genome size (24.3 Mbp in case of yeast).

Comparison of two phasing methods, Falcon Unzip and our graph-based approach, at different PacBio coverage levels Note: For computing the ‘haplotig N50’, we only consider those portions of a contig for which two haplotypes are available, i.e. those regions where Falcon reports both a primary contig and an alternative haplotig. For ‘haplotig size’, we sum the length of contigs on both haplotypes (‘primary contigs’ plus ‘haplotigs’ in terms of Falcon’s output), so the target size is twice the genome size (24.3 Mbp in case of yeast). To analyse the effect of different coverages of the Illumina short-read datasets on the quality of our haplotigs, we went back to the original, high coverage Illumina dataset (which we had been downsampled to 50× coverage) and downsampled it to 100× coverage, i.e. twice the amount of reads used above. We observed that increasing the coverage did not have a drastic effect on the quality of haplotigs. The average phasing identity rose to 99.81% and the total haplotig size was 23.9 Mbp, which is virtually identical to the results for 50× as reported in Table 1. With an increase in average PacBio coverage from 10× to 30×, the haplotype contiguity achievable by using our approach improves from 40 kbp to 43 kbp. By way of comparison, Falcon Unzip delivers haplotigs with a N50 length of 32 kbp at the same coverage level. This highlights the fact that our approach generates more contiguous haplotypes compared to Falcon Unzip. In terms of haplotype completeness, our approach yields diploid assemblies of length 20.7, 21.1 and 23.9 Mbp at average PacBio coverages of 10×, 20× and 30×, respectively. At coverage 30×, Falcon Unzip delivers a total assembly size of 16.6 Mbp, while the total length of both haplotypes of the pseudo-diploid yeast genome is 24.3 Mbp. Our approach therefore delivers more complete haplotypes at a long-read coverage of 10× compared to Falcon Unzip at a coverage of 30×. There are 2 haplotigs that are not phased by our approach; this is due to the lack of heterozygosity over those regions. In comparison there are 77 (out of 123) contigs that are not phased by Falcon Unzip. In summary, our graph-based approach delivers complete and contiguous haplotype sequences even at a relatively low coverage of 10×.

4.2 Bubble characterization

We attempted to characterize the nature of the heterozygous genomic variation encoded in the phased bubbles. There are 25 033 bi-allelic bubbles phased by our approach when using 30× coverage PacBio data. Of these bubbles, there are 15 293 for which both allele sequences have a length of at most 1 bp, out of which 15 258 are single base pair substitutions (SNVs) and 35 are 1 bp indels. The remaining 9 740 bubbles either encode two or more small variants or more complex differences. To differentiate these cases, we computed an alignment between the two allele paths and refer to those bubbles for which the alignment contains only substitutions but no indels as ‘pure substitutions’. Figure 6a shows the joint distribution of length and (Hamming) distance for these pure substitution bubbles. This analysis reveals, on the one hand, that many longer pure substitions have a low distance and hence encode multiple SNVs and, on the other hand, that there also exists a population of more complex substitutions. For the 1 489 bubbles not classified as pure substitutions, which we refer to as ‘mixed bubbles’, Figure 6b shows the absolute length difference between the two alleles. While this difference is small for most bubbles, there are 93 bubbles with a length difference of 21 bp or more. To further elucidate the nature of the sequence differences, Figure 6c presents the joint distribution of length of the longer allele and substitition rate, which is defined as the fraction of substitutions among all edit operations done to align the two sequences. That is, a pure insertion or deletion has a substitution rate of 0.

Fig. 6.

Structural variation analysis of phased bubbles from our graph-based approach. (a) Joint distribution of allele length and Hamming distance, for pure substitutions. (b) Distribution of size difference between the two alleles, for mixed bubbles and indels. Pure substitutions always have a size difference of 0, and are not included in the figure. (c) Joint distribution of the length of the longer allele and the substitution rate, for mixed bubbles. With a higher substitution rate, the bubble has more substitutions, and with a lower rate more indels

5 Discussion

The Falcon Unzip method (Chin ) is based purely on PacBio reads which exhibit a high error rate; it is therefore not suitable for lower coverages. By using (costly) high coverage PacBio data, Falcon Unzip can generate good quality assemblies with an average haplotig identity of up to 99.99% (Chin ). However, it follows a conservative approach for phasing genomic variants. As sketched in Figure 2, Falcon Unzip generates long primary contigs, but tends to phase them only partially. To address the above problems, we have created a novel graph-based approach to diploid genome assembly that combines different sequencing technologies. By using one technology producing shorter, more accurate reads, and a second technology delivering long reads, we produce accurate, complete and contiguous haplotypes. Our method provides a cost-effective way of generating high quality diploid assemblies. By performing phasing directly in the space of sequence graphs—without flattening them into contigs in intermediate steps—we can phase large SVs, which is not possible using linear approaches. We have tested our approach using real data, in the form of a pseudo-diploid yeast genome, and we have shown that we deliver accurate and complete haplotigs. Furthermore, we have shown that we can detect and phase SVs. In this study, our main focus was on phasing unique regions of the genome. As a next step, we plan to develop techniques for phasing repetitive regions as well. Resolving repeats and polyploid phasing are closely related problems, as pointed out by Chaisson ). Therefore, we will aim to solve heterozygous variants and repeats in a joint phasing framework, in order to obtain even more contiguous diploid genome assemblies that include both types of features. That would also remove the need to run an external assembler (Canu) for bubble ordering. Finally, our framework allows, in principle, for incorporating additional data from other sequencing technologies, such as chromatin conformation capture (Burton ), linked read sequencing (Weisenfeld ), and single-cell template strand sequencing (Strand-seq; Porubský ). In previous studies on reference-based haplotyping, we have shown such integrative approaches to be very powerful for reconstructing chromosome-scale haplotypes (Chaisson ; Porubsky ); we believe similar results can be obtained for de novo diploid genome assemblies. Click here for additional data file.

38 in total

1. The fragment assembly string graph.

Authors: Eugene W Myers
Journal: Bioinformatics Date: 2005-09-01 Impact factor: 6.937

2. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.

Authors: Konstantin Berlin; Sergey Koren; Chen-Shan Chin; James P Drake; Jane M Landolin; Adam M Phillippy
Journal: Nat Biotechnol Date: 2015-05-25 Impact factor: 54.908

3. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

Authors: Chen-Shan Chin; David H Alexander; Patrick Marks; Aaron A Klammer; James Drake; Cheryl Heiner; Alicia Clum; Alex Copeland; John Huddleston; Evan E Eichler; Stephen W Turner; Jonas Korlach
Journal: Nat Methods Date: 2013-05-05 Impact factor: 28.547

4. Assembly of long error-prone reads using de Bruijn graphs.

Authors: Yu Lin; Jeffrey Yuan; Mikhail Kolmogorov; Max W Shen; Mark Chaisson; Pavel A Pevzner
Journal: Proc Natl Acad Sci U S A Date: 2016-12-12 Impact factor: 11.205

Review 5. The importance of phase information for human genomics.

Authors: Ryan Tewhey; Vikas Bansal; Ali Torkamani; Eric J Topol; Nicholas J Schork
Journal: Nat Rev Genet Date: 2011-02-08 Impact factor: 53.242

6. A hybrid approach for the automated finishing of bacterial genomes.

Authors: Ali Bashir; Aaron Klammer; William P Robins; Chen-Shan Chin; Dale Webster; Ellen Paxinos; David Hsu; Meredith Ashby; Susana Wang; Paul Peluso; Robert Sebra; Jon Sorenson; James Bullard; Jackie Yen; Marie Valdovino; Emilia Mollova; Khai Luong; Steven Lin; Brianna LaMay; Amruta Joshi; Lori Rowe; Michael Frace; Cheryl L Tarr; Maryann Turnsek; Brigid M Davis; Andrew Kasarskis; John J Mekalanos; Matthew K Waldor; Eric E Schadt
Journal: Nat Biotechnol Date: 2012-07-01 Impact factor: 54.908

7. The diploid genome sequence of an individual human.

Authors: Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

8. Whole-genome haplotyping approaches and genomic medicine.

Authors: Gustavo Glusman; Hannah C Cox; Jared C Roach
Journal: Genome Med Date: 2014-09-25 Impact factor: 11.117

9. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement.

Authors: Bruce J Walker; Thomas Abeel; Terrance Shea; Margaret Priest; Amr Abouelliel; Sharadha Sakthikumar; Christina A Cuomo; Qiandong Zeng; Jennifer Wortman; Sarah K Young; Ashlee M Earl
Journal: PLoS One Date: 2014-11-19 Impact factor: 3.240

10. Fast and accurate de novo genome assembly from long uncorrected reads.

Authors: Robert Vaser; Ivan Sović; Niranjan Nagarajan; Mile Šikić
Journal: Genome Res Date: 2017-01-18 Impact factor: 9.043

17 in total

Review 1. Pangenome Graphs.

Authors: Jordan M Eizenga; Adam M Novak; Jonas A Sibbesen; Simon Heumos; Ali Ghaffaari; Glenn Hickey; Xian Chang; Josiah D Seaman; Robin Rounthwaite; Jana Ebler; Mikko Rautiainen; Shilpa Garg; Benedict Paten; Tobias Marschall; Jouni Sirén; Erik Garrison
Journal: Annu Rev Genomics Hum Genet Date: 2020-05-26 Impact factor: 8.929

Review 2. Genome sequence assembly algorithms and misassembly identification methods.

Authors: Yue Meng; Yu Lei; Jianlong Gao; Yuxuan Liu; Enze Ma; Yunhong Ding; Yixin Bian; Hongquan Zu; Yucui Dong; Xiao Zhu
Journal: Mol Biol Rep Date: 2022-09-23 Impact factor: 2.742

3. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.

Authors: Kishwar Shafin; Trevor Pesout; Ryan Lorig-Roach; Marina Haukness; Hugh E Olsen; Colleen Bosworth; Joel Armstrong; Kristof Tigyi; Nicholas Maurer; Sergey Koren; Fritz J Sedlazeck; Tobias Marschall; Simon Mayes; Vania Costa; Justin M Zook; Kelvin J Liu; Duncan Kilburn; Melanie Sorensen; Katy M Munson; Mitchell R Vollger; Jean Monlong; Erik Garrison; Evan E Eichler; Sofie Salama; David Haussler; Richard E Green; Mark Akeson; Adam Phillippy; Karen H Miga; Paolo Carnevali; Miten Jain; Benedict Paten
Journal: Nat Biotechnol Date: 2020-05-04 Impact factor: 54.908