Literature DB >> 19455219

Minimum conflict individual haplotyping from SNP fragments and related genotype.

Xiang-Sun Zhang¹, Rui-Sheng Wang, Ling-Yun Wu, Wei Zhang.

Abstract

The Minimum Error Correction (MEC) is an important model for haplotype reconstruction from SNP fragments. However, this model is effective only when the error rate of SNP fragments is low. In this paper, we propose a new computational model called Minimum Conflict Individual Haplotyping (MCIH) as an extension to MEC. In contrast to the conventional approaches, the new model employs SNP fragment information and also related genotype information, thereby a high accurate inference can be expected. We first prove the MCIH problem to be NP-hard. To evaluate the practicality of the new model we design an exact algorithm (a dynamic programming procedure) to implement MCIH on a special data structure. The numerical experience indicates that it is fairly effective to use MCIH at the cost of related genotype information, especially in the case of SNP fragments with a high error rate. Moreover, we present a feed-forward neural network algorithm to solve MCIH for general data structure and large size instances. Numerical results on real biological data and simulation data show that the algorithm works well and MCIH is a potential alternative in individual haplotyping.

Entities: Chemical Disease Gene Species

Keywords: individual haplotyping; NP-hard; dynamic programming; feed-forward neural network; minimum conflict individual haplotyping; reconstruction rate

Year: 2007 PMID： 19455219 PMCID： PMC2674671

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

The availability of complete genome sequence for human beings [Vent et al. 2001] makes it possible to investigate genetic differences and to associate genetic variations with complex diseases [Hoehe et al. 2000]. Single nucleotide polymorphism (SNP)—a single DNA base varying from one individual to another, is believed to be the most frequent form to address genetic differences [Chakravarti, 1998; Li et al. 2005]. SNPs are found approximately every 1000 base pairs in the human genome and turn to be promising tools for doing disease association study. Many research works have been carried out for determining SNP sites or designing a detailed SNP map for human genome [Altshuler et al. 2000; Helmuth, 2001]. The nucleotides in a SNP position are called alleles. Almost all SNPs have two different alleles which we denote the wild type as 1 and the mutant type as −1. The SNP sequence information on each copy of a pair of chromosomes in a diploid genome is called a haplotype which is a string over {−1, 1}. A genotype is the conflated information of a pair of haplotypes on homologous chromosomes. For a genotype, if a pair of alleles at a SNP site is made of two identical values, this SNP site is called homozygous, otherwise it is called heterozygous. Haplotypes generally have more information content than that individual SNPs have in disease association studies [Stephens et al. 2001], but it is substantially more difficult to determine haplotypes than to determine genotypes or individual SNPs through experiments. Hence, computational methods that can reduce the cost of determining haplotypes become attractive alternatives. There are generally two classes of computational methods for determining haplotype. One class concerns with infering haplotypes from the genotype samples in a population. There are several models based on different assumptions on the biological system under consideration [Clark, 1990; Gusfield, 2002; Halperin and Eskin, 2004; Li et al. 2005; Wang and Xu, 2003]. The second class, called single individual haplotyping or haplotype assembly, is based on the data and methodology of shotgun sequence assembly [Lancia et al. 2001; Lippert et al. 2002]. The input data consists of aligned short genome fragments with SNPs coming from DNA shotgun sequencing or data generated by a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short genome fragments are actually the aligned SNP fragments. Computationally, the individual haplotyping problem is to determine the “best” pair of haplotypes from data (SNP fragments) which is possibly inconsistent and contradictory. It focuses on partitioning SNP fragments into two sets according to SNP states, with each set determining a haplotype [Lancia et al. 2001; Lippert et al. 2002]. For such a problem, there are several models based on different error assumptions [Lancia et al. 2001; Li, L.M. et al. 2004; Lippert et al. 2002; Rizzi et al. 2002; Wang et al. 2005], of which the Minimum Error Correction model (MEC) is widely adopted. MEC assumes that the inconsistence of the data comes from realistic sequence errors and these errors can be corrected. However, MEC is only effective in the case that SNP fragments have a low error rate. When the error rate of SNP fragments is relatively high, we can not reconstruct original haplotypes with a high accuracy by error correction only [Wang et al. 2005]. To improve the haplotyping quality, we need to either reduce the errors in SNP fragments which will call for improvement of the shortgun experiment, or add extra information to the given SNP fragment set. Since genotype data can be much more easily (and also economically) obtained, a computational model combining both SNP fragments and genotype information will be a realistic strategy. The idea is motivated by the method in [Eskin et al. 2004]. In their paper they phased long genotypes using local haplotype information. In this paper, we propose a new computational model (Minimum Conflict Individual Haplotyping: MCIH) for individual haplotyping by using this strategy, which is also proved to be NP-hard. There are two ways to show that a new established model is practical or more effective than an existing model: one is to theoretically prove that the solution of the new model is superior to that of the existing model, the other is to numerically solve the problem and compare the solutions obtained by the two models. It is obvious that we can only do it in the second way. For this reason we try to design an exact algorithm for the MCIH problem. A special data structure suited for a dynamic programming algorithm is displayed and used to evaluate the model. Computational results by this exact algorithm confirm the significance of MCIH. Moreover, a feed-forward neural network (FNN) is designed for computing approximate solutions to the problem (MCIH) in general case and of large size. Extensive computational results show the effectiveness of the proposed FNN algorithm and MCIH. The paper is organized as follows: In Section I, we give the problem definition together with the complexity analysis of MCIH. The model evaluation is shown in Section II. In Section III, a feed-forward neural network algorithm is designed for implementing MCIH. Experiment results are given in Section IV. Section V concludes the paper.

I. Formulation and Problem

For a genotype = (g1, g2, ···, g), if the jth SNP site is wild type homozygous, g = 2; if it is mutant type homozygous, g = −2; when it is heterozygous, g = 0. A pair of haplotypes 1 = (h11, h12, ···, h1) and 2 = (h21, h22, ···, h2) is called compatible with a genotype if the following conditions hold: for each SNP site j where g = −2, h1 = h2 = −1; for each SNP site j where g = 2, h1 = h2 = 1; for each SNP site j where g = 0, h1 = −h2 = −1 or h1 = − h2 = 1. Suppose that there are m SNP fragments from a pair of chromosomes and the length of each corresponding haplotype is n. Define an m × n SNP matrix = (m), whose entry m has the value −1, 1 or 0 (for a missing or skipped base, we call it a hole). Each row corresponds to a SNP fragment f and each column corresponds to a SNP site. Since the given SNP fragments may have different lengths, but are generally less than n, we also assign value 0 to the uncovered elements in a row. Let x,y ∈ {−1, 1, 0} and define then the distance between two SNP fragments f = (m1, ···, m) and f = (m1, ···, m) is defined as HD(f, f) = ∑=1d(m, m). If HD(f, f) > 0, we say two fragments f and f are in conflict, otherwise we call them compatible. HD(f, f) > 0 indicates that either f, f are not from the same chromosome copy or there are errors in the data. HD(f, f) is similar to the Hamming distance, i.e., the number of mismatches (conflicts) between two fragments. The distance between a fragment and a haplotype is defined in the similar way. The MEC problem [Lippert et al. 2002; Wang et al. 2005] is defined as: Given a set of SNP fragments, correct a minimum number of SNP states (−1 into 1 and vice versa), such that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments, and each set determines a haplotype. In order to add the information of genotype into MEC, we propose the following combinatorial optimization model for individual haplotyping: MCIH (Minimum Conflict Individual Haplotyping): Given a set of SNP fragments (a SNP matrix ) from an individual’s DNA and the related genotype , reconstruct a pair of haplotypes compatible with g and involving a minimum number of conflicts with the given SNP fragments. MCIH is in fact to correct a minimum number of SNP states under the guidance of the genotype information so that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments which determine a pair of haplotypes compatible with the genotype. The computational complexity of the MCIH problem (similar spirit with [Eskin et al. 2004]) is discussed in Appendix A where we prove it to be NP-hard. It indicates that the MCIH problem may have no efficient algorithm for exact solutions.

II. Model Evaluation

The purpose of presenting the new model is to get a high-quality solution of the haplotype assembly problem. To show that MCIH is a potential alternative, we evaluate the model by studying its exact solutions. A special data structure with Markov property suited for a dynamic programming algorithm is considered as follows. Let l and r be the beginning and ending positions of the ith SNP fragments f on the SNP matrix respectively, 1 ≤ i ≤ m. For any two fragments f and f, we assume The rows of the SNP matrix is reordered according to the beginning positions and the ending positions of SNP fragments. For two rows m and m, i < j if and only if l < l. If l = l and r < r, the ith row will also be put before the jth row. From the proof of the computational complexity in Appendix A we know that the MCIH with this special structure remains NP-hard. To solve MCIH with this special structure, we give a dynamic programming (DP) algorithm. The input data of the DP algorithm are a genotype with length n and a SNP matrix of the type (2). The outputs of the DP algorithm are a partition of the SNP fragments and a pair of halotypes (1, 2) generated from the partition. In fact, the DP algorithm is used to reconstruct one haplotype, say 1. Clearly, 2 can be obtained immediately from and 1. Suppose that (j) = (x,···, x) is an assignment to the positions l, ···, r of 1, and (j) is the corresponding assignment to 2 at the same positions. If g ≠ 0, x =x̄ = g/2, otherwise x + x̄ = 0 for k = l, ···, r. Let f(j, (j)) denote the number of conflicts between the SNP fragment f and 1 with assignment (j), f(j, (j)) denote the number of conflicts between the SNP fragment f and 2 with assignment (j). Define f (j, (j)) = min{ f(j, (j)), f(j, (j))} which implies the haplotype that the jth fragment belongs to. The main steps of the dynamic programming algorithm are as follows:

Step 1

Initialization. Let j = 1. N(1, (1)) = f (1, (1)) = min{ f1(1, (1)), f2 (1, (1))} for all possible (1).

Step 2

Follow the recurrence formula. Define N(j, (j)) as the minimum associated number of conflicts between the first j fragments and haplotypes (1, 2) at positions 1, 2, ···, r, with the assignment (j). where b = g/2 for g ≠ 0, otherwise b ∈{1,−1}, k = l, ···, l+1 −1.

Step 3

Trace the solution to obtain a pair of haplotypes (1, 2) and a partition of the SNP fragments. When all N(j, (j)) for 1 ≤ j ≤ m are computed by the recurrence formula, we can find N(m, (m)) for all (m) = (x, ···, x). Thus, the solution of the MCIH problem can be obtained by tracing the solution forwardly which leads to a minimal value at each j from the following formula: For all j, f(j, (j)) = min{f1(j, (j)), f2(j, (j))} determines a partition of the SNP fragments. Assume that L is the maximum length of SNP fragments. It can be shown that the dynamic programming algorithm solving the special MCIH problem has the complexity O(22m). Hence, the algorithm is exponential with the maximum length of SNP fragments. However, when the maximum length of SNP fragments is fixed, this algorithm is linear to m, i.e., the number of fragments. In real applications, the value of L is generally between 3 and 8. As expected, numerical results (see Appendix B for details) of DP on the special data structure display that MCIH at the cost of genotype information improves the reconstruction rate greatly at various error rates of SNP fragments. An extreme case is considered where every SNP site is heterozygous. It means that the model has no homozygous site information available from the given genotype. The numerical results show that in this case MCIH still has a higher reconstruction rate than MEC. This indicates that MCIH is not trivial, i.e., a higher reconstruction rate does not only depend on homozygous site information. Heterozygous site information has much contribution to the accuracy of the reconstructed haplotypes.

III. A Feed-Forward Neural Network Algorithm

As discussed in Introduction, the MCIH problem is actually a classification problem. That is, given a set of SNP fragments, we want to classify it into two fragment subsets such that each subset determines a haplotype to solve the problem with minimum conflicts. It is well known that feed-forward neural network (FNN) is a powerful tool for classification. In general, an FNN maps a set of objects into classes C1, ···, C characterized by the attributes of the objects through repeatedly learning the objects and adjusting its parameters (neuron connection weights). In our problem, a set of SNP fragments are to be divided into two classes such that each class can be assembled as one of the haplotypes which will solve the problem with as few conflicts as possible. The proposed FNN (see Figure 1) consists of three layers. The m input neurons represent m SNP fragments. Each input neuron accepts an n-dimensional vector on {1, −1, 0}. Two hidden neurons represent two subsets of fragments corresponding to one pair of haplotypes. The input dimension of the hidden neuron is also n and the outputs of the two neurons are a pair of tentative haplotypes. There is only one neuron in the last layer, which simply conflates the tentative haplotypes and outputs a tentative genotype. Comparing this tentative genotype with the given genotype will provide us the information to adjust the neuron connection weights by the popular back-propagation algorithm (see Appendix B and related literatures for feed-forward neural networks [Rumelhart et al. 1986; Zhang, 2000]). The main characteristic of the designed neural network is trying to achieve two objectives, i.e., compatibility and minimum number of conflicts simultaneously.

Figure 1

A three layer forward neural network.

Forwarding process of FNN

The forwarding process of FNN can be stated as follows. The inputs and outputs of the first layer are m rows of the SNP matrix, 1, ···, , i.e., m SNP fragments f, i = 1, ···, m. In fact, the neurons in the first layer are trivial identity maps I. There are 2m parameters, i.e., the weights from the first layer to the second layer, The inputs to the second layer are 1 = (y11, ···, y1) and 2 = (y21, ···, y2), where Two neurons in the hidden layer have the same sigmoidal neuron function: The outputs of the second layer neurons are 1 = (h11, ···, h1) and 2 = (h21, ···, h2), where The inputs to the third layer are 1 and 2. The neuron function of the third layer is a linear summation, i.e., the output of the third layer is z = (z1, ···, z), where

Learning rules

The MCIH problem is to find a pair of haplotypes (1, 2) to Objective-1: minimize for all partitions (F1, F2) of f1, ···, f and all possible pairs of haplotypes (1, 2), and Objective-2: satisfy or minimize for all possible pairs of haplotypes (1, 2), where = (g1, g2, ···, g) is the given genotype.

1) Learning to satisfy objective-1

The distance between l and (i.e., f) is defined as where d is defined by the formula (1) and Classify all SNP fragments into two disjoint sets according to their distances to 1 and 2, i.e., For the neuron corresponding to 1 in the second layer, the network learns to minimize the following error function between 1 and the SNP fragments in F1: For the neuron corresponding to 2, the network learns to minimize the error function between 2 and the SNP fragments in F2:

2) Learning to satisfy objective-2

The objective that the third layer adjusts to is to minimize the difference between the tentative genotype and the original genotype :

A Back-Propagation-like procedure

The main steps of the algorithm is as follows:

Initialization

Set parameter values L1, L2, ρ, λ and ɛ. Randomly set the initial values of the weight matrix (0) with entries w ∈ [0, 1], i = 1, ···, m, l = 1, 2. t = 0.

Step 1—Feed-forwarding

Input a SNP matrix. Do computation of (3), (5) and (6) to get a pair of tentative haplotyes (1, 2).

Step 2—Learning

Substep 2.1: Obtain a fragment partition (F1, F2) using (1, 2) according to the formulae (10) and (11). Substep 2.2: Calculate the gradients of the error functions ∇1R11, ∇2R12, ∇1R2, ∇2R2 according to (17), (18), (19), and (20). Substep 2.3: Update the current weight matrix (t) by where ρ is the step length along the negative gradient. L1 and L2 are parameters to be selected. Substep 2.4: If || (t + 1) − (t)|| < ɛ, go to Step 3, otherwise return to Step 1.

Step 3—Output

Record a pair of haplotypes (1, 2) and a genotype as the output. The formulae to compute the derivatives in the algorithm are given in Appendix B. In contrast to DP algorithm that gives an exact solution, the algorithm based on the neural network can not ensure an exact optimal solution but is proved very efficient for large-scale problems by the numerical results in next section.

IV. Simulation and Results

In this section, we will use both real data and simulation data to test MCIH and the FNN algorithm for haplotype reconstruction. The computation is implemented on a 2.26G Hz Pentium 4 processor PC using Microsoft Visual C++ compiler 6. In our experiments, we use reconstruction rate (RR), the similarity degree between the original haplotypes and the reconstructed haplotypes, to measure performance of an algorithm or a model. Assume that = (1, 2) is the original pair of haplotypes, and = (1, 2) is the reconstructed pair of haplotypes. We define RR as: where r = HD (h, ĥ), i = 1, 2, j = 1, 2. We use compatibility rate (CR) to measure the similarity degree between the original genotype and the reconstructed genotype : For a partition P = (F1, F2) and a pair of haplotypes (1, 2), the corresponding conflict number is defined by (7). In this paper, we set step length ρ = 0.05, ɛ = 0.01 and λ = 0.1 (in fact, the algorithm is robust with these parameters) in the algorithm. In (15) and (16), we set L1 = 0.2 and L2 = 0.8.

Experiment on angiotensin-converting enzyme (ACE)

Angiotensin-converting enzyme catalyses the conversion of angiotensin I to the physiologically active peptide angiotensin II, which controls fluid-electrolyte balance and systemic blood pressure. Because it has a key function in the renin-angiotensin system, many association studies have been performed with DCP1 (encode angiotensin-converting enzyme). Literature [Rieder et al. 1999] completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes. Out of the 78 varying sites, 52 are non-unique polymorphic sites. Among these 11 individuals, there are two identical genotypes. We omit one of them. In addition, we omit the genotypes with no more than one heterozygous site whose haplotypes can be infered immediately. Now each of the 8 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: the number of SNP fragments m = 20; other parameters, the hole rate of fragments hr: 0.25, 0.5, 0.75; the error rate of fragments e: 0.05, 0.1, 0.15, 0.2 and 0.25. The results of MCIH and MEC (solved by an exact algorithm in [Wang et al. 2005]) averaged on eight individuals is illustrated in Figure 2. All the results are obtained by running the algorithms only once. When the error rate of SNP fragments is low, the neural network is robust and efficient. When the error rate of SNP fragment is high, the network may get into a plight of local minima. The genotype compatibility rate returned by the algorithm for most instances is 100% or at least larger than 98%. Only several instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 94% and 96%. In addition, the neural network algorithm solves each of these instances in several seconds.

Figure 2

The results of MCIH and MEC on ACE. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

Figure 2 shows that MCIH has a much higher reconstruction rate at various parameter settings, which indicates that with additive genotype information MCIH is effective. Even if an approximate algorithm is employed, it is more effective for haplotype reconstruction than MEC.

Experiment on data from chromosome 5q31

Now we performed simulations using the data from public Daly set [Daly et al. 2001]. They reported a high-resolution analysis of a haplotype structure across 500kb on chromosome 5q31 using 103 SNPs in a European derived population which consists of 129 trios. The haplotypes of 129 children from the trios can be infered from the genotypes of their parents through pedigree information and the nontransmitted chromosomes as an extra 129 (pseudo) haplotype pairs. Markers for which both alleles could not be inferred are marked as missing. Among the resulting set of 258 haplotype pairs, the ones with more than 20% missing alleles are removed, which leaves us 147 haplotype pairs. Among these pairs, 18 genotypes with no more than one heterozygous site are omitted, then 129 pairs of haplotypes are left as the test set. Each of the 129 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: m = 30; other parameters, hr: 0.25, 0.5, 0.75; e: 0.05, 0.1, 0.15, 0.2 and 0.25. The results of MCIH and MEC averaged on 129 pairs of haplotypes are illustrated in Figure 3.

Figure 3

The results of MCIH and MEC on Daly set. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

The picture again shows that MCIH is quite effective and the designed back-propagation-like algorithm has a very good performance. The genotype compatibility rate returned by the algorithm for most instances is 100% or greater than 98%. Only a few instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 92% and 96%. In addition, the neural network algorithm solves instances with a low error rate and hole rate in several seconds. For a few instances with a high error rate and hole rate, however, it takes the algorithm several minutes to stop.

Experiments on Hudson’s simulation data

We use a well-known program ms [Hudson, 2002] that uses coalescent theory to generate a simulated population of haplotypes. ms has a parameter r as the recombination rate of population haplotypes. Firstly, let r = 0, then 20 haplotypes with 80 SNP sites are generated using ms and form a haplotype set. Then we randomly combine two halotypes in the haplotype set into a pair of individual haplotypes. 15 pairs of haplotypes are obtained by this way. Each of the 15 pairs of haplotypes is used to generate 15 instances according to the parameters as those in the last subsection. All of these instances form a dataset. The results of the two models averaged on 15 pairs of haplotypes are shown in Figure 4. To further evaluate MCIH, let r = 100 and other parameters be the same. The results of two models on this dataset are summarized in Figure 5. The genotype compatibility rate for almost all instances is 100% or larger than 98%.

Figure 4

The results of MCIH and MEC on Hudson’s data with r = 0. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

Figure 5

The results of MCIH and MEC on Hudson’s data with r = 100. From left to right, hr = 0.25, hr = 0.5, hr = 0.75.

From Figure 4 and Figure 5 we can see that MCIH is effective on haplotypes not only without recombination but also with a high recombination rate.

V. Conclusion

Individual haplotyping is an important problem in computational biology. In this paper, we proposed a new computational model for individual haplotyping—MCIH as an improvement of MEC and as an alternative way for biologists to solve the haplotyping problem more efficiently. The model is proved to be NP-hard. To evaluate the new model, we displayed a special SNP matrix structure for which a dynamic programming algorithm can be used to solve MCIH exactly. Comparing the exact solutions of MCIH and MEC, we argue that the proposed MCIH is worth further studying. Due to the computational intractability of the MCIH problem, a feed-forward neural network is designed and a back-propagation-like procedure is formed as an efficient approximate algorithm. Computational results on multiple data sets show that the designed algorithm performs well and MCIH at the cost of additive genotype information has a higher accuracy of haplotype reconstruction than MEC, especially in the case of SNP fragments with a high error rate. Since genotype information can be obtained much easily and economically, the new model is practical as a supporting tool for reconstruction of haplotypes.

Table 1

The results of two models on simulated data.

error rate	s=0.5		s=0.0

	MEC	MCIH	MEC	MCIH
0.05	0.941	1.000	0.965	0.996
0.1	0.904	0.969	0.950	0.984
0.15	0.863	0.969	0.890	0.946
0.2	0.786	0.908	0.834	0.922
0.25	0.763	0.863	0.766	0.830

16 in total

1. Haplotype variation and linkage disequilibrium in 313 human genes.

Authors: J C Stephens; J A Schneider; D A Tanguay; J Choi; T Acharya; S E Stanley; R Jiang; C J Messer; A Chew; J H Han; J Duan; J L Carr; M S Lee; B Koshy; A M Kumar; G Zhang; W R Newell; A Windemuth; C Xu; T S Kalbfleisch; S L Shaner; K Arnold; V Schulz; C M Drysdale; K Nandabalan; R S Judson; G Ruano; G F Vovis
Journal: Science Date: 2001-07-12 Impact factor: 47.728

2. Genome research: map of the human genome 3.0.

Authors: L Helmuth
Journal: Science Date: 2001-07-27 Impact factor: 47.728

3. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem.

Authors: Ross Lippert; Russell Schwartz; Giuseppe Lancia; Sorin Istrail
Journal: Brief Bioinform Date: 2002-03 Impact factor: 11.622

4. Haplotype inference by maximum parsimony.

Authors: Lusheng Wang; Ying Xu
Journal: Bioinformatics Date: 2003-09-22 Impact factor: 6.937

5. Haplotype reconstruction from SNP fragments by minimum error correction.

Authors: Rui-Sheng Wang; Ling-Yun Wu; Zhen-Ping Li; Xiang-Sun Zhang
Journal: Bioinformatics Date: 2005-02-24 Impact factor: 6.937

6. A parsimonious tree-grow method for haplotype inference.

Authors: Zhenping Li; Wenfeng Zhou; Xiang-Sun Zhang; Luonan Chen
Journal: Bioinformatics Date: 2005-07-07 Impact factor: 6.937

7. It's raining SNPs, hallelujah?

Authors: A Chakravarti
Journal: Nat Genet Date: 1998-07 Impact factor: 38.330

8. Sequence variation in the human angiotensin converting enzyme.

Authors: M J Rieder; S L Taylor; A G Clark; D A Nickerson
Journal: Nat Genet Date: 1999-05 Impact factor: 38.330

9. Haplotype reconstruction from genotype data using Imperfect Phylogeny.

Authors: Eran Halperin; Eleazar Eskin
Journal: Bioinformatics Date: 2004-02-26 Impact factor: 6.937

10. The sequence of the human genome.

Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu
Journal: Science Date: 2001-02-16 Impact factor: 47.728

2 in total

1. HMEC: A Heuristic Algorithm for Individual Haplotyping with Minimum Error Correction.

Authors: Md Shamsuzzoha Bayzid; Md Maksudul Alam; Abdullah Mueen; Md Saidur Rahman
Journal: ISRN Bioinform Date: 2013-01-28

2. Minimum error correction-based haplotype assembly: Considerations for long read data.

Authors: Sina Majidian; Mohammad Hossein Kahaei; Dick de Ridder
Journal: PLoS One Date: 2020-06-12 Impact factor: 3.240

2 in total