Literature DB >> 30913270

NGS based haplotype assembly using matrix completion.

Sina Majidian1, Mohammad Hossein Kahaei1.   

Abstract

We apply matrix completion methods for haplotype assembly from NGS reads to develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by applying a mathematical model to convert the reads to an incomplete matrix and estimating unknown components. This process is followed by quantizing and decoding the completed matrix in order to estimate haplotypes. These algorithms are compared to the state-of-the-art algorithms using simulated data as well as the real fosmid data. It is shown that the SNP missing rate and the haplotype block length of the proposed HapOPT are better than those of HapCUT2 with comparable accuracy in terms of reconstruction rate and switch error rate. A program implementing the proposed algorithms in MATLAB is freely available at https://github.com/smajidian/HapMC.

Entities:  

Mesh:

Year:  2019        PMID: 30913270      PMCID: PMC6435133          DOI: 10.1371/journal.pone.0214455

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The Single Nucleotide Polymorphism (SNP) is a kind of genetic variation with a frequency greater than 1% in population. In diploid organisms, genomes are organized into pairs of chromosomes, a paternal and a maternal copy. The sequence of SNPs on each copy of a pair of chromosomes is called a haplotype. A genotype is the conflation of two haplotypes on the homologous chromosomes. An SNP is called homozygous, if a pair of alleles at this locus is made up of two identical nucleotides, and is heterozygous, otherwise. From the evolutionary point of view, the SNP happens as a consequence of mutation. However, since the mutation rate is low, several mutations of a locus rarely occur. Thus, it is usual to assume that the majority of SNPs are bi-allelic, meaning that each SNP can be chosen from just two of the four possible nucleotides, i.e., A, T, C, and G [1]. Accordingly, in this work we similarly use this assumption. The haplotype is widely used in the Genome Wide Association Studies (GWAS), clinical genetics, linkage analysis, drug-design, and personalized medicine [2]. To extract a haplotype, one may use the following three approaches where the last two approaches are mathematical: Applying high-cost experimental and expensive methods for every single individual which is of course not desirable [2]. Haplotype phasing wherein the haplotypes are inferred from the genotypes of multiple individuals. As such, a method based on the maximum parsimony assumption [3] and statistical methods like SHAPEIT, developed based on the Hidden Markov Model [1, 4] may be mentioned. Note that using this approach, the haplotype of an individual can not be found separately and also is challenged by the low-frequency and also de novo variants [2]. Estimating haplotypes from Next Generation Sequencing (NGS) reads i.e. nucleotide sequence of fragments. Using this approach, known as the haplotype assembly, haplotyping of a single individual becomes feasible. In this regard, HapCUT2 [5], HapTree [6], and HapSAT [7] are three famous methods developed based on probabilistic models. These methods are sensitive to the selected model and thus fragile to the model error. A recent method for haplotype assembly is AltHap [8] which has shown accurate results compared to H-PoP [9], SCGD [10], and HapTree [6]. The H-PoP is a heuristic algorithm originated from the Balanced Optimal Partition (BOP) optimization model which benefits from the Minimum Error Correction (MEC) as well as the maximum fragments cut approaches [11]. The SDhaP [12] is also another heuristic method based on correlation clustering and non-convex optimization which does not guarantee reaching the global optimum. The innovation of this article is threefold. First, the haplotype assembly is mathematically formulated based on matrix completion methods. Secondly, three new algorithms called the Haplotype assembly based on Singular Value Thresholding (HapSVT), Haplotype assembly based on Nuclear norm minimization (HapNuc), and Haplotype assembly based on OPTSPACE (HapOPT) are proposed. Next, in the section of Results, these algorithms are compared to some benchmark methods in terms of the reconstruction rate and the switch error rate.

Model of haplotypes

To exploit the NGS reads as the raw data, a computational modeling is needed. For this purpose, similar to [10], we first convert the sequence of nucleotides which can be either reads or haplotypes into a sequence of numbers. The SNP nucleotides are converted to 1 and −1 for the wild and rare alleles, respectively. As an example, Table 1 depicts the alleles of the β2AR gene [3] for which the maternal and paternal haplotypes of an individual are shown by and , respectively. The corresponding codewords based on the above modeling are presented in the last column.
Table 1

Haplotypes of β2AR genes and their corresponding codewords.

NucleotidesCodewords
AllelesG/AC/AG/AC/GT/CT/CT/CG/AC/GG/A{1/-1,1/-1,…}
hmACGGCCCGGG{-1,1, 1,-1,-1,-1,-1, 1,-1,1}
hpGCACTTTACG{ 1,1,-1, 1, 1, 1, 1,-1, 1,1}
Next, assuming that each read has been aligned to the reference genome, the non-SNP sites of each read are omitted. Then, the reads are coded using the procedure described in Table 1, and are completed by adding zeros for the length of l as shown for 10 aligned reads in Table 2. As seen in this example, for the 1st row, we get {-1 1 1 0 0 0 0 0 0 0} with 3 sites of ±1 and 7 sites of zeros.
Table 2

Example of aligned reads for β2AR genes and the considered codewords.

ReadsNucleotidesCodewords
1ACG-1110000000
2GGCC001-1-1-10000
3GGGG001-10000-11
4GCACTT11-11110000
5ACTACG00-10011-111
6GCTT1100110000
7CCG-1100-100001
8ACCCC-1100-1-1-1000
9GCTAC1001001-110
10ACCG00-11000011
Without loss of generality, by representing the codewords of Table 2 by the vectors , i = 1, …, N, we form the read matrix , where N is the number of reads. In fact, is an incomplete matrix with the rank of 2 which consists of the maternal and paternal haplotypes in its rows. At this stage, we may utilize matrix completion methods to complete this low rank matrix. To do so, by estimating the zero entries of , we obtain the completed matrix which has the same dimension as , i.e., N × l where l is the haplotype length. According to Table 2, these matrices are given by (1) and (2). From , one can observe that only two of its rows are different and thus the desired haplotypes are given by These vectors can then be decoded to the sequence of nucleotides using the first row of Table 1. To the best of our knowledge, no algorithm has been reported to distinguish between the maternal and paternal haplotypes and therefore and may be interchanged with each other. It should be noted that the above example is an error-free case to clarify the procedure of data modeling which can be trivially solved. For the erroneous case; which is the subject of our work, is an incomplete version of + where shows the noise matrix [8].

Proposed methods

We present three new algorithms for haplotype assembly whose general block diagram is illustrated in Fig 1. The goal is to estimate and from the noisy reads. The first two blocks have been explained before. In the third block, we receive an incomplete matrix with a few known entries where the set of indices of known entries is given by Ω [10]. Then, we intend to estimate the unknown entries based on rank assumption. Mathematically, this is modeled by the following optimization problem: It is worth mentioning that here we have not only considered the case of all-heterozygous variants, but also included the case of both heterozygous and homozygous variants. This can be realized as a point of this work in comparison to some other methods that are restricted to heterozygous variants. In the all-heterozygous case, the two haplotypes will be the negative of each other, i.e., = − and thus the rank of will be one (See (5)).
Fig 1

Block diagram of the proposed algorithms.

To solve (5), the nuclear norm minimization, Singular Value Thresholding (SVT), and OPTSPACE methods have already been reported [13], based on which we introduce three new algorithms called the HapSVT, HapNuc, and HapOPT.

Haplotype assembly based on Singular Value Thresholding (HapSVT)

To explain the proposed HapSVT algorithm, we first introduce the SVT which is based on Singular Value Decomposition (SVD) [14] defined for the read matrix as where H denotes the hermitian operator, and and have orthonormal columns with the dimension of N × r and l × r, respectively. By applying the singular value shrinkage operator D(⋅) to , we obtain where It is worth noting that D() is the optimal value of the optimization problem where ‖⋅‖ is the Frobenius norm and ‖⋅‖* shows the nuclear norm as the summation of singular values. To perform the matrix completion part as shown in Fig 1, we recursively use the SVT in two steps. In the first step, starting with the initial matrix 0 = , the singular value shrinkage operator is used as Then, in the second step, the difference between the projected matrix and the initial matrix is compensated for the known entries using for k = 1, 2, …, where is an operator which keeps the entries of the matrix corresponding to Ω unchanged, and sets the other entries to zero. The iterations continue until the condition is satisfied and the last is reported as the completed matrix . To extract and , we compute the reduced row echelon form of and by using the first two pivot positions, two independent rows of are obtained. Then, in order to acquire the paternal and maternal haplotypes the entries are quantized to 1 and −1. The procedures of the HapSVT algorithm is depicted in Algorithm 1. Algorithm 1: Haplotype assembly using SVT (HapSVT). input: N aligned reads output: Haplotypes , /* Read Matrix Preparation */ 1 Convert the sequences of nucleotides (reads) to the sequences of numbers. 2 Add zeros to each read to construct s with the length of l. 3 Construct the read matrix (N × l). /* Matrix Completion (SVT) */ 4 Initialize Y0 = , k = 0, i = 1. 5 while do 6   k = k + 1 7    = D() 8 9 end 10 = /* Reduced Row Echelon Form (RREF) Calculation */ 11 [, ] = RREF() /* Haplotype Extraction */ 12 = 2 * ( > 0) − 1 13 = ((1),:) 14 = ((2),:) 15 Convert the entries of and to the nucleotides.

Haplotype assembly based on Nuclear norm minimization (HapNuc)

A popular method for matrix completion is based on relaxing the non-convex rank function to a convex function. Since the number of nonzero singular values determines the rank of a matrix, an approximation of the rank function is defined by the summation of singular values, known as the nuclear norm [15]. In this way, the optimization problem is cast as This problem can be solved easily using the CVX, a MATLAB based package [16]. It has been shown that the nuclear norm minimization has strong mathematical guarantees to achieve the optimal solution [15, 17, 18]. To develop the new HapNuc algorithm, we substitute the SVT part of Algorithm 1 by nuclear norm minimization.

Haplotype assembly based on OPTSPACE (HapOPT)

Another method for matrix completion is known as OPTSPACE [19] in which unlike the two previous methods, we assume that the rank of the desired matrix is known. The OPTSPACE consists of the following three steps: a) trimming, b) projection, and c) cleaning, as explained below. a) In the trimming step, those columns of with the degrees larger than 2|Ω|/l are set to zero where |⋅| shows the cardinality of a set and l is the haplotype length. The degree of a column (or a row) shows the number of its known entries. This step is also performed for the rows of with the degrees larger than 2|Ω|/N where N is the number of reads. b) The trimmed obtained from Step (a) is projected to the space of rank r matrices using where P(Σ) = diag(σ1, …σ) and and are given by (6). c) The cleaning step is performed by solving the following optimization problem, which contains two minimization parts. The inner part results in a function in terms of and . To solve the outer minimization part, we use a gradient based recursive method whose initial matrices are computed from Step (b), i.e., 0 = and 0 = . Then, this recursive method leads to the optimal solution . To finalize the third new HapOPT algorithm, we should substitute the SVT part of Algorithm 1 by the above three steps.

Results

Using extensive simulations, we compare the performance of the proposed HapSVT, HapNuc, and HapOPT algorithms with that of the three recent benchmark algorithms AltHap [8], HapCUT2 [5], and SDhaP [12]. It has already been shown that these algorithms outperform some other algorithms like RefHap [20], SCGD [10], HapTree [6], and H-PoP [9]. For comparison purposes, a well-known criterion is the reconstruction rate defined as [21] where and are the reconstructed haplotypes which are compared to the known maternal and paternal haplotypes, and . Moreover, is the augmented hamming distance between two vectors which counts the number of non-identical sites using where is defined as To consider another criterion for performance evaluation, we make use of the SWitch Error Rate (SWER), defined as the number of switches divided by the haplotype length [22]. A switch happens when the parental origin of an allele with respect to that of the previous allele differs from one parent to another. For example, by considering = [1, 1, 1, 1] and = [−1, −1, −1, −1] as the grand truth haplotypes and the estimated haplotypes as and , one switch has been occurred.

Simulated data

First, we use the simulated data [21] generated based on real human haplotypes in the HapMap project. This dataset; which contains different read matrices with various error rates and coverage values originated from different haplotype lengths, has vastly been used in previous studies [10, 23, 24]. We choose the longest available haplotype from the dataset with the length of l = 700. The coverage value of the NGS paired-end reads varies from c = 3 to its greatest value c = 10. The average number of reads are N = 561, 936, and 1873 for coverage values of c = 3, 5, and 10, respectively. The number of SNPs covered in each read is a constant value equal to 7.4. Also, 10% (and 20%) of the entries of the read matrix are contaminated by noise with uniform distribution. The results are averaged over 100 independent trials of the experiment. Table 3 shows the reconstruction rates for different coverage values and error rates. The corresponding SWERs are also depicted in Table 4. In this case, HapCUT2 is not examined, since it needs the Variant Call Format (VCF) file which is not available for this simulated dataset [21]. As seen in both Tables 3 and 4, the proposed HapOPT algorithm outperforms the others in terms of the reconstruction rate as well as the SWER. It is worth reminding that the SDhaP solves a non-convex optimization problem using a heuristic technique with the gradient descent algorithm which does not guarantee reaching the global optimum. Furthermore, as a consequence of increasing the coverage value, a better performance is achieved by a lower SWER and a higher reconstruction rate.
Table 3

Reconstruction rates for different algorithms on simulated data [21].

The best values are in boldface.

coverageerror rate (%)SDhaPAltHapHapOPT(Proposed)HapSVT(Proposed)HapNuc(Proposed)
31097.8799.0499.0798.3898.32
51099.1999.6699.7297.2198.82
101099.641199.5399.64
32096.6697.3297.3897.0097.31
52097.3698.2498.4397.4797.47
102097.0299.4599.2598.6698.6
Table 4

SWERs for different algorithms on simulated data [21].

The best values are in boldface.

coverageerror rate (%)SDhaPAltHapHapOPT (Proposed)HapSVT (Proposed)HapNuc (Proposed)
3100.0700.0380.0270.1110.120
5100.0190.00580.0040.2070.049
10100.0018000.0120.003
3200.2270.2470.2180.3500.345
5200.1360.1230.1010.2430.266
10200.0650.01780.0180.0800.121

Reconstruction rates for different algorithms on simulated data [21].

The best values are in boldface.

SWERs for different algorithms on simulated data [21].

The best values are in boldface.

Real fosmid data

We evaluate the proposed algorithms on the sequence data of the individual NA12878 fabricated based on a fosmid approach [20]. The coverage of this data set is c = 3 and the average read length is 40 kb, and hence, is a low-coverage and long-read dataset. For evaluation purposes, we consider the trio-phased haplotype from the GATK resource bundle, as the grand truth containing 1.3 million heterozygous variants in common with fosmid dataset [22, 25]. This dataset has already been used in several studies [5, 8, 22]. In the simulated dataset used in the last section, each read overlaps at least one another read, while for the real data these overlaps do not necessarily occur. In this situation, our algorithm incorporates the overlaps for haplotype estimation, and as a result, the output of each algorithm is some disjoint parts of the whole haplotype, called haplotype blocks. To evaluate a common length for these blocks, we consider their mean and also the AN50 defined as the median of blocks lengths in base pairs weighted by a proportion of correctly estimated alleles [6]. Also, we define the SNP Missing Rate (SMR) for each chromosome as the ratio of the number of missing SNPs in the estimates and the haplotype length [26]. The results on the real fosmid data are shown in Table 5. One can see that both HapOPT and AltHap algorithms achieve lower SNP missing rates in comparison to HapCUT2 and SDhaP. Moreover, HapOPT and AltHap have a better span in terms of AN50.
Table 5

Mean and AN50 of haplotype blocks lengths for different algorithms on real fosmid data.

SDhaPHapCUT2AltHapHapOPT (Proposed)
Chr.SMRMeanAN50 (kb)SMRMeanAN50 (kb)SMRMeanAN50 (kb)SMRMeanAN50(kb)
16.271.52546.771.12296.272.72346.272.7234
26.968.62418.368.32196.969.72236.969.7223
38.169.72188.669.31958.070.62048.070.6204
410.063.419210.463.11729.964.61779.964.6177
58.269.52198.869.02068.270.32108.270.3210
67.382.42437.981.92247.384.02367.384.0236
77.269.72227.669.52077.171.02127.171.0212
87.875.62298.375.22077.776.82207.776.8220
97.079.62497.579.22306.980.92356.980.9235
106.883.92387.383.42176.784.92206.784.9220
117.177.12347.576.82257.078.32287.078.3228
126.473.42627.373.02416.774.12496.774.1249
1310.269.120310.768.718610.170.319110.170.3191
146.577.52597.077.12386.378.42466.378.4246
156.073.72516.473.22285.974.12345.974.1234
163.896.63454.296.23173.797.93273.797.9327
173.970.83234.570.43053.971.53103.971.5310
187.175.32287.674.92167.076.02237.076.0223
193.190.83743.590.43453.093.83603.093.8360
204.392.43144.892.02974.293.73044.293.7304
216.681.12527.080.82426.482.42426.482.4242
222.7123.74453.2123.24252.6123.94262.6123.9426
To assess the accuracy of different algorithms, the corresponding reconstruction rates [5, 22] are presented in Fig 2. Moreover, we have considered both short and long SWERs [5, 22]. By a long switch, we mean that the parental origin does not change for at least two SNPs and if two switches occur one after each other, we consider it as a short switch. These two metrics are reported on real fosmid data in Figs 3 and 4.
Fig 2

Reconstruction rate of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

Fig 3

Short SWER of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

Fig 4

Long SWER of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

From the above results, one can observe that HapOPT outperforms SDhaP and AltHap in terms of the reconstruction rate as well as long and short SWERs with a reasonable runtime as reported in Table 6. Note that although, HapCUT2 achieves the best accuracy, still its SNP missing rate is greater than that of HapOPT. These results on the whole show that HapOPT is a promising tool for haplotype assembly with the best SNP missing rate and a good accuracy in terms of reconstruction rate and SWER.
Table 6

Runtime of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

SDhaPAltHapHapCUT2HapOPT (Proposed)
Runtime (Minutes)51018355

Conclusion

We have exploited matrix completion methods including SVT, nuclear norm minimization, and OPTSPACE for haplotype estimation. This was led to developing the new HapSVT, HapNuc, and HapOPT algorithms. Our experimental comparison on simulated data revealed that HapOPT is more accurate than SDhaP and AltHap in terms of reconstruction rate and switch error rate. Also, the results on real noisy fosmid data showed that the accuracy of HapOPT is better than that of SDhaP and AltHap and also is comparable to that of HapCUT2 in terms of the reconstruction rate and the short and long SWERs. Moreover, it was shown that HapOPT outperforms the recently addressed algorithms, HapCUT2 and SDhaP, in terms of the mean, SNP missing rate, and AN50 of the haplotype block length. Furthermore, the proposed algorithm is not restricted to the heterozygous assumption, as commonly considered in peer algorithms. On the whole, we can conclude that using the proposed HapOPT, the haplotype is reconstructed more completely and continuously with acceptable accuracy. Also, the proposed optimization problem is capable of estimating haplotypes for different ploidy levels. Our research direction for future is to work on polyploids.
  18 in total

1.  Haplotype inference by maximum parsimony.

Authors:  Lusheng Wang; Ying Xu
Journal:  Bioinformatics       Date:  2003-09-22       Impact factor: 6.937

2.  Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study.

Authors:  Ehsan Motazedi; Richard Finkers; Chris Maliepaard; Dick de Ridder
Journal:  Brief Bioinform       Date:  2018-05-01       Impact factor: 11.622

Review 3.  A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem.

Authors:  Filippo Geraci
Journal:  Bioinformatics       Date:  2010-07-11       Impact factor: 6.937

4.  A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors:  Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal:  Nat Genet       Date:  2011-04-10       Impact factor: 38.330

5.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques.

Authors:  Jorge Duitama; Gayle K McEwen; Thomas Huebsch; Stefanie Palczewski; Sabrina Schulz; Kevin Verstrepen; Eun-Kyung Suk; Margret R Hoehe
Journal:  Nucleic Acids Res       Date:  2011-11-18       Impact factor: 16.971

6.  HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data.

Authors:  Emily Berger; Deniz Yorukoglu; Jian Peng; Bonnie Berger
Journal:  PLoS Comput Biol       Date:  2014-03-27       Impact factor: 4.475

7.  Haplotype estimation for biobank-scale data sets.

Authors:  Jared O'Connell; Kevin Sharp; Nick Shrine; Louise Wain; Ian Hall; Martin Tobin; Jean-Francois Zagury; Olivier Delaneau; Jonathan Marchini
Journal:  Nat Genet       Date:  2016-06-06       Impact factor: 38.330

8.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.

Authors:  Peter Edge; Vineet Bafna; Vikas Bansal
Journal:  Genome Res       Date:  2016-12-09       Impact factor: 9.043

9.  A fast and accurate algorithm for single individual haplotyping.

Authors:  Minzhu Xie; Jianxin Wang; Tao Jiang
Journal:  BMC Syst Biol       Date:  2012-12-12

10.  Sparse Tensor Decomposition for Haplotype Assembly of Diploids and Polyploids.

Authors:  Abolfazl Hashemi; Banghua Zhu; Haris Vikalo
Journal:  BMC Genomics       Date:  2018-03-21       Impact factor: 3.969

View more
  2 in total

1.  Hap10: reconstructing accurate and long polyploid haplotypes using linked reads.

Authors:  Sina Majidian; Mohammad Hossein Kahaei; Dick de Ridder
Journal:  BMC Bioinformatics       Date:  2020-06-18       Impact factor: 3.169

2.  Minimum error correction-based haplotype assembly: Considerations for long read data.

Authors:  Sina Majidian; Mohammad Hossein Kahaei; Dick de Ridder
Journal:  PLoS One       Date:  2020-06-12       Impact factor: 3.240

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.