Literature DB >> 32530974

Minimum error correction-based haplotype assembly: Considerations for long read data.

Sina Majidian¹, Mohammad Hossein Kahaei¹, Dick de Ridder².

Abstract

The single nucleotide polymorphism (SNP) is the most widely studied type of genetic variation. A haplotype is defined as the sequence of alleles at SNP sites on each haploid chromosome. Haplotype information is essential in unravelling the genome-phenotype association. Haplotype assembly is a well-known approach for reconstructing haplotypes, exploiting reads generated by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often used for reconstruction of haplotypes from reads. However, problems with the MEC metric have been reported. Here, we investigate the MEC approach to demonstrate that it may result in incorrectly reconstructed haplotypes for devices that produce error-prone long reads. Specifically, we evaluate this approach for devices developed by Illumina, Pacific BioSciences and Oxford Nanopore Technologies. We show that imprecise haplotypes may be reconstructed with a lower MEC than that of the exact haplotype. The performance of MEC is explored for different coverage levels and error rates of data. Our simulation results reveal that in order to avoid incorrect MEC-based haplotypes, a coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2020 PMID： 32530974 PMCID： PMC7292361 DOI： 10.1371/journal.pone.0234470

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Among the various types of genetic variations, single nucleotide polymorphisms (SNPs) are the most widely studied among others in genome wide association studies (GWAS). The genome of diploids like humans consists of two homologous pairs: the paternal and maternal chromosomes. A haplotype, the sequence of alleles at SNP sites on each homologous chromosome, can be measured through direct experiments or can be reconstructed by computational approaches [1, 2]. Due to the high cost of experimental methods, the computational techniques have attracted more attention. These techniques can be categorized as phasing or assembly approaches. Phasing makes use of the genotypes of multiple individuals to infer the haplotype. In the haplotype assembly approach, sets of reads generated by DNA sequencing devices are exploited for haplotype reconstruction. While haplotype assembly can be performed for a single individual, phasing cannot. Moreover, phasing is difficult in the presence of low-frequency and de novo variants. The history of DNA sequencing technologies consists of three generations. Firstly, the low-throughput Sanger sequencing machines were built in the late 1980s, thanks to the invention of the chain termination procedure. Subsequently, multiplexing strategies were used for the development of the so-called second generation technologies of the early 2000s. Today, Illumina is the dominant platform of this second generation, providing massively high throughput, up to billions of reads, with a length of a few hundred bases and an error probability lower than 0.001 [3]. Utilizing such short reads incurs limitations, precluding assembly of repetitive regions and detection of structural variants larger than read length. The third-generation of sequencing technology, namely single-molecule sequencing as provided by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), produces exceptionally long reads of up to a million bases. The bottleneck of this third-generation technology is the low per-base accuracy in comparison to that of the second generation, such that the error probability may exceed 0.1 [4]. Both second and third generation sequencing technologies have been used for haplotype assembly. Although the sequencing reads provided by all above-mentioned technologies do not keep track of the haplotypic origin of reads, a haplotype assembly algorithm tends to reconstruct the haplotypes using overlaps among reads. In the absence of sequencing errors, this is a trivial problem to solve. A simple bipartitioning scheme can be used to divide reads into two groups corresponding to two haplotypes, such that those reads in each group do not conflict. But in real cases, the presence of errors makes the problem computationally hard to solve. Several criteria have been proposed in the literature, including minimum fragment removal (MFR), minimum SNP removal (MSR) and minimum error correction (MEC) [5]. The idea behind MFR is to find the minimum number of reads containing errors, which should then be removed. The heuristic algorithms for solving this model are time-consuming and not suitable for low coverage input data. In MSR-based algorithms, several SNP positions are removed to make haplotyping possible. Thus, the haplotypes contain some gaps, leading to a high rate of missing SNPs, which is undesired. The dominant objective function utilized for the haplotype assembly problem is the MEC, also known as the minimum letter flip [2]. This function is also used in evaluating the performance of different haplotype reconstruction algorithms [6, 7]. Minimizing the MEC function can be rewritten as a MAXCUT problem, which is NP-hard, leading to a large number of heuristic algorithms [5]. Some examples include the HapCUT algorithm (which iteratively computes max-cuts of a read graph [8]), a branch-and-bound genetic algorithm approach [9], an integer linear programming approach [10] and a clustering approach [11], as well as multiple dynamic programming approaches [12-16]. Despite the existence of all these methods utilizing the MEC for haplotype reconstruction, it is crucial to note that this criterion may fail to identify the exact haplotype when there is a high error rate in the reads [9, 17]. In addition, a negative correlation between the haplotype accuracy and the MEC has already been reported in [18], as discussed in the Results section. While this issue has been mentioned briefly in previous studies, it has never been systematically investigated in an effort to understand the implications across different sequencing platforms. In this work, we provide insight into the MEC function to clarify the above ambiguities. The following section presents the fragment matrix model, defines MEC and introduces two theorems regarding MEC performance. The performance curve for MEC is introduced and discussed in the Results section. Furthermore, several DNA sequencing devices are evaluated based on their characteristics, including the error probability values. Finally, simulations of long and short reads are provided to explore practical consequences.

Methods

For diploids, haplotype assembly is the process of reconstructing two haplotypes from overlapped aligned reads. Throughout this paper, we only consider bi-allelic SNPs—that is, SNPs with only one alternative allele against the reference allele [8, 12]. Below we describe the construction of the fragment matrix. Prior to this construction, we remove those reads that cover less than two SNP sites, because these are not informative for haplotype assembly. Non-SNP bases of each read are also omitted.

Fragment matrix model

We assume that there are N reads obtained from both chromosomes. For a haplotype with the length of l, an N × l fragment matrix is constructed whose rows embed the reads and whose columns correspond to the heterozygous SNP sites [19, 20]. The SNP sites not covered by the reads are coded with zero. Then, bases of reads are converted to −1 (alternative allele) or 1 (reference allele), assuming bi-allelic SNPs. As an example of an error-free case, consider the first exon of HLA-A, a gene on chromosome 6 -with NCBI reference sequence number NG_029217.2. Its first 40 bases are presented in Fig 1a. It contains five bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). The procedure of constructing the fragment matrix is depicted in Fig 1d. In this example, the exact haplotypes that should be reconstructed by the haplotype assembly algorithms are {CGTAG} and {TCGCA}.

Fig 1

An example of fragment matrix model for the first 40 bases of exon 1 of HLA-A gene.

An example of fragment matrix model for the first 40 bases of exon 1 of HLA-A gene.

This gene is located on chromosome 6 with NCBI reference sequence number NG_029217.2. It contains 5 bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). a) An example of homologous chromosomes in which the SNP sites are indicated in bold, b) an example of aligned reads, c) the fragments after removing non-informative reads and non-SNP bases and d) the constructed fragment matrix. The fragment matrix can be modeled using a matrix completion approach [19, 20]. In the error-free case, is a partially observed matrix modelled as where is the completed version of matrix (see section B of S1 Appendix for more details). PΩ is the observation operator defined as in which Ω is the set of indices of known entries. In order to generalize the model to the more realistic case allowing erroneous entries, we use an additive measurement error model inspired by [11, 19, 20]: To define the error matrix , we should first clarify what we mean by an error. A substitution error is the conversion of a DNA base to one of the other three possible bases during the sequencing procedure. As mentioned earlier, during fragment matrix construction, only two bases (reference and alternative alleles) for each SNP site are permitted and other possible bases are ignored; as a result, a substitution to the ignored bases does not affect the entries of the fragment matrix. Accordingly, we introduce the term bi-allelic substitution, or simply bi-substitution to make it distinguishable from generally defined substitution. A bi-substitution error occurs when a reference allele is converted to the alternative allele or vice versa. Consequently, an error in the entries of PΩ() is simplified as a change from −1 to 1 or vice versa. This can be formulated as an addition of 2 (or −2) to each erroneous entry of PΩ() which is represented in error matrix . We assumed that each non-zero entry of is erroneous with a probability of p, the bi-substitution error probability, independent of the other entries. This value equals one third of the substitution error probability of the sequencing device p.

MEC definition

If the reads contain no errors, the corresponding rows of fragment matrix are compatible with each other and haplotypes are extracted using a simple clustering technique. However, in practice, sequencing devices may produce erroneous reads due to which the compatibility of reads is lost. To cope with this problem, the MEC approach is employed by inverting the sign of some entries of the fragment matrix to make it compatible [9]: Find the minimum number of entries of that should be inverted to make the fragment matrix compatible. Cluster the rows of the augmented fragment matrix and reconstruct the haplotype. For fragment matrix of dimension N × l and candidate haplotype vector of length l, the MEC function is calculated as in which is the i row of and the extended Hamming distance (EHD) is defined as [8, 10]. Furthermore, d(⋅, ⋅) is a mismatch indicator which penalizes its dissimilar arguments by one: Therefore, the EHD function represents the number of mismatches between two vectors. From this point of view, MEC(, ) indicates the whole number of mismatches between each row of and the vector . It is notable that the function D(⋅, ⋅) is not a distance from the mathematical point of view [21], though it is named as such (See sections A and C of S1 Appendix).

Analysis of MEC performance

Consider as an optimal solution resulting from a given method by minimizing the MEC function. The question arises: does minimizing this function guarantee reaching the exact haplotypes (i.e., the true haplotypes of the individual)? In Theorem 1, we demonstrate not only that this solution offers no guarantee of finding the exact haplotype, but also that the MEC function will not lead to the exact haplotype. Theorem 1. There exists a vector different from the exact haplotype with a lower MEC, when the k column of the fragment matrix, , contains some erroneous entries whose number E( is greater than half of its coverage. In a mathematical expression: where c( is the coverage (or the read depth) of the k SNP site. The coverage indicates the number of reads that covers the SNP and is equal to the number of known entries of the k column of . We conclude that the ratio E(/c(, called the bi-substitution rate, plays a key role in the evaluation of a sequencing device. From a practical perspective, E(, the number of nonzero values of the k column of E, represents the number of bi-substitutions at the corresponding genomic position (see section Fragment matrix model). The proof of Theorem 1 is presented in section B of S1 Appendix. The core idea of the proof is to consider equal to except in its k entry, whose sign is inverted. This guarantees a lower MEC. Note that if the antecedent is not satisfied, the MEC approach works properly. In practice, fulfilling the antecedent of Theorem 1 is a major point to be investigated further. To explore this point, Theorem 2 presents the probability of the antecedent not occurring. Theorem 2. The probability of obtaining a minimum MEC value for the exactly correct haplotype (P{c- MEC}) is equal to in which p is the bi-substitution error probability. Proof: According to Theorem 1, the MEC approach works properly when the number of erroneous entries of each column is lower than half of its corresponding coverage. Based on the above assumption, the number of erroneous entries of each column of is independent of the other columns. Then, we have: An erroneous entry gets the opposite sign due to the bi-allelic assumption. This follows a Bernoulli distribution of ±1 with the probability of error p. Thus, the number of errors in the j column follows a Binomial distribution given by . Therefore, we can write: Accordingly, using (8) and (9) the proof of Theorem 2 is complete.

Results

Performance curves of MEC

The outcome of Theorem 2 is calculated for various scenarios with different probabilities of error and coverage levels. This is done by introducing performance curves for MEC. The y-axis indicates the probability of obtaining a correct MEC P{c- MEC} and the x-axis the bi-substitution error probability p. In practice, the average coverage of input data provided for haplotype assembly varies from very low to very high levels. Based on the existing literature on coverage distribution among different genomic positions [19, 22, 23], we consider two different distributions, including Poisson and quasi-uniform (i.e., the analogue of the uniform distribution defined for a discrete random variable), as well as constant coverage levels. The error probability of various datasets may also differ dramatically due to the specifications of the DNA sequencer. In Fig 2a, the performance curve, P{c- MEC} versus p = [0.0001, 0.5] is presented for different coverage values. In three cases, we consider c( = 2, 10 and 100 for j = 1, …, l, respectively. Next, c(s are defined randomly by the quasi-uniform distribution over three different intervals [1, 2], [1, 10] and [1, 100]. In addition, MEC performance is investigated for coverage values of SNP sites with the Poisson distribution with mean λ = 2, 10 and 100. Furthermore, Fig 2b displays P{c- MEC} for different lengths of haplotypes l = {100, 10k, 1M} and coverage values c = {2, 10, 30}.

Fig 2

Performance curves of MEC approach.

Performance curves of MEC approach.

a: Comparison of P{c- MEC} for different coverage levels (constant c = {2, 10, 100}, quasi-uniform over c = {[1, 2], [1, 10], [1, 100]} and Poisson distribution with mean λ = {2, 10, 100}). b: Comparison of P{c- MEC} for different haplotype lengths l = {100, 10k, 1M} and different coverage values c = {2, 10, 30}. In Fig 2a, it is seen that P{c- MEC} is inversely proportional to the sequencing error probability p. Additionally, depending on the coverage distribution, each P{c- MEC} begins to drop after a particular threshold. For example, for the Poisson distribution with mean λ = 10 and l = 1k, this threshold is p = 2%. In this case, the MEC approach is unable to reconstruct the exact haplotype for p > 2%. This problem arises when the number of errors in column is more than half of its coverage, as expressed in Theorem 1. The existence of such a column is more likely as the error probability increases. Fig 2b presents our investigation on the effect of the haplotype length on P{c- MEC}. It demonstrates that a higher haplotype length l leads to incorrect haplotypes at a lower bi-substitution error probability p.

Evaluation of sequencing technologies: Theory

Here, we analyze the MEC for different DNA sequencing devices based on our reasoning. Table 1 presents the results of the evaluation of different devices launched by Illumina, PacBio and ONT. For each device, the evaluation employs the typical number of reads per run, the read length and error probability as reported in literature [4, 24–26]. In order to provide a fair comparison, we set the coverage value at 10. To calculate the number of runs needed (denoted by n) for such coverage, we used the averaged coverage formula, the Lander-Waterman equation, as following: where l, N and G show the read length, the total number of reads per run and the human genome length, respectively.

Table 1

Comparison MEC applicability of different sequencing devices, for the substitution error probability p, the total number of reads N in millions, the read length l and the number of runs n needed for a coverage of 10.

For Illumina technology, the read length corresponds to the paired-end setting.

Device	p_s	N_t	l_r	n	P{c- MEC}	MEC applicability
Illumina MiSeq V3	0.001	50	300	2	0.97	Yes
Illumina HiSeq 4000	0.001	2500	150	1	0.97	Yes
Illumina HiSeq X	0.001	2600	150	1	0.97	Yes
Pacific BioSciences RS II	0.06	0.055	20k	30	0.23	No
Pacific BioSciences Sequel	0.06	0.35	12k	10	0.23	No
Oxford Nanopore MinION	0.02	0.1	200k	2	0.42	No

Comparison MEC applicability of different sequencing devices, for the substitution error probability p, the total number of reads N in millions, the read length l and the number of runs n needed for a coverage of 10.

For Illumina technology, the read length corresponds to the paired-end setting. The applicability of the MEC approach for data generated by each device is reported in the last column of Table 1, based on the value of P{c- MEC}. This shows that the MEC criterion works well for short reads produced by Illumina devices, but not for long reads produced by PacBio or ONT. A larger value of n corresponds to a higher sequencing cost for each device. It should be noted that for each run, long-read devices are far more expensive than short-read devices.

Evaluation of sequencing technologies: Simulations

We run various simulations to provide a deeper understanding of MEC-based haplotype assembly. First, using DNA sequencing data, we estimate how often MEC failures can occur based on Theorem 1. The accuracy of the reconstructed haplotype is also investigated in terms of switch error rate and haplotype block length.

On the satisfaction of Theorem 1

Here, we inspect the effect of short and long sequencing reads along with their corresponding error profiles for the satisfaction of antecedent of Theorem 1. To do so, we use the bi-substitution rate defined in the Methods section. We briefly present the details of our simulations. We consider the 21st chromosome of the human genome (GRCh38) [27] as the reference DNA sequence. Bi-allelic SNPs are introduced at a rate of one in a thousand bases [28] across the mentioned reference using haplo-generator, part of the haplosim package [29]. For generating PacBio long reads, we use the PBSIM package [26] in which the PacBio error profile is used. Then, we align the reads using minimap2 [30]. We run the ART package [31] for generating short paired-end reads and Burrows-Wheeler Aligner (BWA) [32] for aligning them. We sort the aligned reads using the samtools package [33]. Afterwards, using the mpileup subprogram of samtools [33], alleles for each position are extracted from the sorted aligned reads. Then, the required statistics for all introduced SNPs are calculated. For both Illumina reads and PacBio long reads, the number of SNPs with a bi-substitution rate of greater than or equal to 0.5 are depicted in Fig 3.

Fig 3

Number of SNPs with bi-substitution rate of greater than or equal to 0.5 (high bi-substitution) for Illumina reads and PacBio long reads at different coverage levels.

In S1 and S2 Figs, we depict the histogram of bi-substitution rates of SNP sites. For coverage values up to 25 for PacBio data, there are some positions in which the bi-substitution rate is greater than 0.5. This leads to the satisfaction of the antecedent of Theorem 1 and thus MEC failure. When we set the coverage greater than or equal to c = 30, no SNP site with high bi-substitution rate remains.

Haplotype reconstruction accuracy

We now examine the direct effect of coverage on the accuracy of the reconstructed haplotype. We utilize the well-known HapCUT algorithm as a MEC-based haplotype assembly method. The output of HapCUT consists of haplotype blocks, whose continuity can be evaluated by calculating the average block length. Larger haplotype blocks, indicating that haplotypes are reconstructed more continuously, are of interest. To evaluate the accuracy of the reconstructed haplotype, we calculate the switch error rate by dividing the number of switch errors by the haplotype length. A change in the parental origin of an allele compared to the previous allele is called a switch error. The switch error rate and average block length of the haplotype reconstructed by HapCUT are depicted for different coverage values from c = 10 to 45 in Fig 4a and 4b, respectively. The results are provided for 20 independently generated datasets. As seen in both figures, by increasing the coverage, the accuracy and continuity of the reconstructed haplotype increases. For a dataset with low coverage, specifically lower than 25 per haploid, not only are there many switches but the reconstructed haplotype is highly fragmented as well. This corroborates the findings in Fig 3.

Fig 4

Accuracy of reconstructed haplotypes using HapCUT in terms of average haplotype block length and switch error rate.

Discussion

The issue addressed in this paper has been recognized previously by Duitama et al. [18], who note that a candidate haplotype with lower MEC is associated with lower reconstruction accuracy. This result can be predicted from the model we described. It should be noted that, while we assume errors to be an independent and identically distributed (iid), in reality this may not hold true, although this assumption has been used before widely [6, 34, 35]. Though PacBio reads have no systematic error, errors in alignment and variant calling may exist due to high numbers of insertions and deletions. Acquiring comprehensive error models for all sequencing technologies is a difficult task and exploiting them in our model would make the derivation unfeasible. Therefore, we used an approach that is simplified yet close to reality. However the focus of this paper is on diploid, we present the MEC formula for polyploids and we show that MEC failure may also happen in a specific polyploid case in section D of S1 Appendix.

Conclusion

We investigated the reliability of the MEC approach for haplotype assembly. We demonstrate that in some practical circumstances, an imprecise haplotype may be reconstructed with a lower MEC than that of the exact haplotype. The theoretical MEC performance curves were obtained for different coverage values and error rates. Based on our analyses, we evaluated some DNA sequencing devices by the MEC criterion. It was found that this approach can generate misleading results for low-coverage error-prone long reads generated by Pacific BioSciences and Oxford Nanopore Technologies platforms. In order to address this issue, one should exploit a high coverage for long reads. The results provided in this study suggest that using MEC-based haplotype assembly methods on available long reads, reconstruction of the true haplotypes is not feasible for coverage lower than 25 per haploid (i.e., 50 overall). An important future direction for this work is to do a thorough research on the extent of the issues with MEC for the polyploid genome.

Histogram of bi-substitution rates for Illumina reads.

a: coverage 10. b: coverage 15. c: coverage 20. (TIFF) Click here for additional data file.

Histogram of bi-substitution rates for PacBio reads.

a: coverage 10. b: coverage 15. c: coverage 20. d: coverage 25. e: coverage 30. f: coverage 45. The red bars indicate results for which the antecedent of Theorem 1 is satisfied. (TIFF) Click here for additional data file. A: Properties of extended hamming distance. B: Proof of Theorem 1. C: Properties of MEC. D: Extension to polyploid genomes. (ZIP) Click here for additional data file.

30 in total

1. Haplotype reconstruction from SNP fragments by minimum error correction.

Authors: Rui-Sheng Wang; Ling-Yun Wu; Zhen-Ping Li; Xiang-Sun Zhang
Journal: Bioinformatics Date: 2005-02-24 Impact factor: 6.937

2. On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes.

Authors: Paola Bonizzoni; Riccardo Dondi; Gunnar W Klau; Yuri Pirola; Nadia Pisanti; Simone Zaccaria
Journal: J Comput Biol Date: 2016-06-09 Impact factor: 1.479

Review 3. Coming of age: ten years of next-generation sequencing technologies.

Authors: Sara Goodwin; John D McPherson; W Richard McCombie
Journal: Nat Rev Genet Date: 2016-05-17 Impact factor: 53.242

4. Optimal algorithms for haplotype assembly from whole-genome sequence data.

Authors: Dan He; Arthur Choi; Knot Pipatsrisawat; Adnan Darwiche; Eleazar Eskin
Journal: Bioinformatics Date: 2010-06-15 Impact factor: 6.937

5. PBSIM: PacBio reads simulator--toward accurate genome assembly.

Authors: Yukiteru Ono; Kiyoshi Asai; Michiaki Hamada
Journal: Bioinformatics Date: 2012-11-04 Impact factor: 6.937

6. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming.

Authors: Shreepriya Das; Haris Vikalo
Journal: BMC Genomics Date: 2015-04-03 Impact factor: 3.969

7. HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data.

Authors: Emily Berger; Deniz Yorukoglu; Jian Peng; Bonnie Berger
Journal: PLoS Comput Biol Date: 2014-03-27 Impact factor: 4.475

8. NGS based haplotype assembly using matrix completion.

Authors: Sina Majidian; Mohammad Hossein Kahaei
Journal: PLoS One Date: 2019-03-26 Impact factor: 3.240

9. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing.

Authors: Peter Edge; Vikas Bansal
Journal: Nat Commun Date: 2019-10-11 Impact factor: 14.919

10. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

3 in total

Review 1. Molecular tools for the analysis of the microbiota involved in malolactic fermentation: from microbial diversity to selection of lactic acid bacteria of enological interest.

Authors: Gabriel Alejandro Rivas; Danay Valdés La Hens; Lucrecia Delfederico; Nair Olguin; Bárbara Mercedes Bravo-Ferrada; Emma Elizabeth Tymczyszyn; Liliana Semorile; Natalia Soledad Brizuela
Journal: World J Microbiol Biotechnol Date: 2022-01-06 Impact factor: 3.312

2. HaploMaker: An improved algorithm for rapid haplotype assembly of genomic sequences.

Authors: Mario Fruzangohar; William A Timmins; Olena Kravchuk; Julian Taylor
Journal: Gigascience Date: 2022-05-17 Impact factor: 7.658

3. flopp: Extremely Fast Long-Read Polyploid Haplotype Phasing by Uniform Tree Partitioning.

Authors: Jim Shaw; Yun William Yu
Journal: J Comput Biol Date: 2022-01-17 Impact factor: 1.479

3 in total