Literature DB >> 19087329

Shape-IT: new rapid and accurate algorithm for haplotype inference.

Olivier Delaneau¹, Cédric Coulonges, Jean-François Zagury.

Abstract

BACKGROUND: We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.
RESULTS: Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.
CONCLUSION: Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2008 PMID： 19087329 PMCID： PMC2647951 DOI： 10.1186/1471-2105-9-540

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

The recent advent of genotyping chips, which can analyze up to 500,000 single nucleotide polymorphisms (SNP) per individual, offers a powerful tool for large scale association studies in human diseases. The most common approach to find genes possibly implicated in a disease relies on the comparison, in patients and controls, of the distributions of SNP markers. An approach to increase the power of such studies is to focus on more complex markers which capture implicitly the linkage disequilibrium (LD) between SNPs: the combination of SNP alleles on the same chromosome called haplotypes. Haplotypes are of great interest to study complex diseases since they are generally derived from chromosomal fragments which are transmitted from one generation to the next or which may have a biological meaning such as the promoter or the exons of a gene [1]. Beyond the biomedical applications, the comparison of haplotype distributions between populations also provides new insights in the diversity, the history and the migrations of human populations. For instance, several studies [2-6] have recently highlighted that genetic diversity of the human genome is organized in regions called haplotype blocks in which SNPs exhibit a high degree of LD and few common haplotypes. These haplotype blocks are delimited by recombination hotspots and chromosomes can thus be viewed as mosaics of common haplotypes. The recently developed HapMap project, dedicated to establish a dense map of SNPs and LD in various human populations [7-9], has emphasized the interest of haplotypes to study human diversity. Regular genotyping (based on PCR/sequencing or on chips) provides the genotype for each SNP but does not allow the determination of the haplotypes (i.e. the combination of SNP alleles on each chromosome), and current experimental solutions to this problem are still expensive and time-consuming [10,11]. Clark was first to introduce a computational alternative [12]: the determination of haplotypes via a parsimony criterion which leads to a minimal set of haplotypes sufficient to explain the entire population. Since then, efficient statistical algorithms have been developed under the random mating assumption where the observed genotypes are formed by sampling independently two unknown haplotypes. This assumption, coupled with a probabilistic model for the haplotypes, permits to define the likelihood of the observed genotypes as a function of the model parameters. Thus, in order to infer haplotypes, the most likely parameter values are estimated via an Expectation Maximization algorithm (EM) or a Gibbs sampler algorithm (GS) on the observed genotypes. The first EM-based model estimated the most likely haplotypes frequencies for observed genotypes without making any assumption on the mutation and recombination history of haplotypes [13]. Many software were built on this simple model and the best-known is certainly PLEM [14]. Later on, two new models were developed based on the idea that the haplotypes were arising through mutation and recombination events from few founder haplotypes. In Gerbil [15], haplotype blocks are strictly defined by dynamic programming and in each block, the haplotypes are derived through mutations from founder haplotypes. On the other hand, in Fastphase [16], in HIT [17], and in HINT [18], both mutation and recombination events on founder haplotypes are simultaneously modeled through a hidden Markov model (HMM). All these methods estimate founder haplotypes from observed genotypes via EM algorithms. For the GS-based algorithms, the general case relies on sampling haplotypes for a genotype in function of all the haplotypes currently assigned to the other genotypes. The model of Haplotyper [19] simply favors haplotypes which have been already assigned to many genotypes. In Phase v1.0 [20], the idea was to favor the sampling of haplotypes which likely coalesce with the already assigned ones. At last, in Phase v2.1 [21,22], the sampled haplotypes are mosaics of the previously sampled ones modeled in a HMM. Recently, an alternative approach to the statistical algorithms was proposed in 2snp [23] which computes LD measures for all pairs of SNPs and then resolves genotypes by finding the maximum spanning trees. Several studies have suggested that the HMM-based methods were the most accurate to infer the haplotypes [17,18,24], certainly because of the flexible definition of the haplotype blocks which depends generally on the physical distance between SNPs [16]. Among the HMM-based methods, Phase v2.1 is often considered as the most accurate developed so far [24-30] which explains why it is widely used in genetic association studies [31-33] and why it was used to phase the genotype data of the HapMap project [8]. The strength of Phase v2.1 probably comes from two particularities. First, the HMM is built during the GS iterations with a number of haplotypes proportional to the number of genotypes in opposition to other HMM-based methods which define a fixed number of founder haplotypes. Second, the haplotypes are inferred by summing over all the possible hidden state sequences of the HMM (Forward algorithm) whereas many other HMM-based methods infer haplotypes by sampling only the most probable hidden sequence in the HMM (Viterbi algorithm). However, the required running time increases dramatically with the number of SNPs since the search space grows exponentially. This prevents the easy use of Phase v2.1 in the current high-throughput chips. This fact has previously motivated us to develop Ishape [27] which matches Phase v2.1 accuracy while maintaining feasible running times. For that, we have used a two-step strategy: 1. we defined a limited space of possible haplotypes with a rapid pre-processing algorithm based on bootstrapped EM haplotypes estimations 2. on this limited set of haplotypes, we then used an accurate Phase-like algorithm. The rapidity of the first step is made possible thanks to an iterative implementation of the EM algorithm which avoids any exponential growth of the space of possible haplotypes and includes the SNPs one after the other during the computations. In practice, Ishape runs up to 15 times faster than Phase 2.1 (for up to 100 SNPs) with a similar accuracy in populations with high LD, such as Caucasian genomes. In this work, we present major improvements which greatly reduce the computational time of Phase v2.1. These improvements have been implemented in the software package Shape-IT and compared to the widely used competitor software. Schematic representation of a sample of . In this example, the space of possible haplotypes Sfor individual i contains 4 haplotype pairs with 8 distinct haplotypes. The possible phases between heterozygous markers are shown in bold.

Algorithm

Notations (Figure 1)

Let's assume we have a sample of n genotypes G = {G1,..., G} describing the allelic content of n diploid individuals over s SNPs. A genotype is split into a haplotype pair by setting the phases between the z heterozygous SNPs (z ≤ s). The number of distinct haplotype pairs consistent with a genotype is then 2(. Let S = {S1,..., S} denotes the total haplotype space where Sis the space of possible haplotype pairs associated with the ith genotype. Moreover, let's assume we have the recombination parameters ρ = {ρ1,..., ρ} in the s-1 intervals between the s SNPs of the sample as described by Stephens et al [22].

Gibbs sampler algorithm

The GS algorithm considers the haplotype reconstructions of n individuals as a set of n random variables H = {H1,..., H} with sampling spaces in S and it estimates the conditional joint distribution of H given G and some recombination parameters ρ: Pr(H | G, ρ). In simple words, it computes a conditional probability for each haplotype pair of S in light of the observed genotypes G and the recombination pattern between the SNPs. Given these probabilities, the haplotype frequencies and the most likely haplotype pair for each genotype are straightforward to obtain. In practice, Pr(H | G, ρ) is estimated by sampling from the stationary distribution of a Gibbs sampler (GS) H(0),..., H(,... where a state H(is a particular realization of the random variables of H: n haplotype pairs from S which resolves the n genotypes of G. The GS starts with a random haplotype realization H(0), and goes from H(to H(by updating the haplotype pair of an individual i in light of the 2n-2 other haplotypes found in H(, that we call . This "haplotypes update" step is done by sampling a new haplotype pair from the conditional distribution Pr(H| , ρ) proposed by Fearnhead and Donnelly [34] and Li and Stephens [35]. This conditional distribution, called FDLS distribution in the following, is computed thanks to a hidden Markov model for haplotypes described in the next section. The important fact here is that computation of Pr(H| , ρ) constitutes the most time-consuming part of the GS since it has to be done on a space of possible haplotype pairs which grows exponentially with the number of heterozygous SNPs. An iteration of the GS algorithm corresponds to update successively the haplotypes of the n individuals of G given a randomly initialized order of treatment. Between iterations, according to the Metropolis Hasting acceptance rates described by Stephens et al [22], we accept or reject (1) new values for the recombination parameters ρ = {ρ1,..., ρ} in the s-1 intervals between SNPs and (2) new treatment order of genotypes in the GS. To finally obtain Pr(H | G, ρ), we discard the first iterations of the GS as burn-in iterations (typically 100) and for the n genotypes G, we average the distribution Pr(H| , ρ) on several main iterations (typically 100). Representation of the execution trellis of the hidden Markov model used to compute the probability of a haplotype. The haplotypes h1,..., h2denote the previously sampled haplotypes which are used to compute the probability of the observed haplotype h. The sets {o1,..., o} and {q1(k), ..., q(k)} correspond respectively to the observed state sequence of haplotype h and to the hidden state sequence of haplotype h. The transition probability a(k,l) corresponds to the probability of jumping from hidden state q(k) of haplotype hto hidden state q(l) of haplotype h, and the emission probability b(k) corresponds to the probability of observing ogiven the hidden state q(k). To compute the probability of observing the sequence h = {o1, ..., o} in this HMM, one must sum up the probabilities of observing h over all (2n - 2)possible sequences of s hidden states which is done efficiently by the forward algorithm.

Computation of a haplotype pair probability in a HMM (Figure 2)

First of all, we assume that genotypes are produced by sampling independently two haplotypes according to their respective probabilities, which yields: where δ= 0 if h ≠ h' and δ= 1 if h = h'. The conditional probability π of haplotype h reflects how likely h corresponds to an "imperfect mosaic" of the other haplotypes {h1, ..., h2} [22]. The underlying idea is that haplotype h has been probably created through the generations as a recombined sequence of haplotypes from the pool {h1, ..., h2}, possibly altered by some mutations. One models this by computing the probability of observing the sequence h = {o1, ..., o} in a hidden Markov model λ designed to represent all possible mosaics of {h1, ..., h2}: π(h|h1, ..., h2, ρ) = Pr(o1, ..., o|λ). Such HMM λ can be viewed as a trellis of s × (2n - 2) hidden states q(k) with 1 ≤ j ≤ s and 1 ≤ k ≤ 2n-2. A hidden state q(k) of λ corresponds to the allele of haplotype hat SNP j and it is linked to all the hidden states q(l) (1 ≤ l ≤ 2n-2) at SNP j+1 in order to model all the possible recombination jumps of haplotypes between SNPs j and j+1 (Figure 2). Then, a sequence of s hidden states in λ through the s SNPs corresponds to a particular mosaic of {h1, ..., h2}. The probability of observing h = {o1, ..., o} in λ is computed thanks to transition probabilities between hidden states which mimic recombination and thanks to emission probabilities from hidden alleles to observed alleles which mimic mutation. Similar hidden Markov models have been proposed, but they generally rely on a limited number of founder haplotypes where the most likely transition and emission probabilities are estimated from observed genotype data via an EM algorithm [17,18]. Here, the emission and transition probabilities are defined with prior distributions depending respectively on a constant mutation parameter and on the variable recombination parameters ρ . The objective of this section is not to fully describe the probabilistic model of transitions and emissions since this has already been done by Stephens and Scheet [22]. Instead, we focus on how the haplotype probability is computed in such a HMM λ from transition and emission probabilities. We thus assume that the following quantities are known as set up by Stephens and Scheet:

Figure 2

Representation of the execution trellis of the hidden Markov model used to compute the probability of a haplotype. The haplotypes h1,..., h2denote the previously sampled haplotypes which are used to compute the probability of the observed haplotype h. The sets {o1,..., o} and {q1(k), ..., q(k)} correspond respectively to the observed state sequence of haplotype h and to the hidden state sequence of haplotype h. The transition probability a(k,l) corresponds to the probability of jumping from hidden state q(k) of haplotype hto hidden state q(l) of haplotype h, and the emission probability b(k) corresponds to the probability of observing ogiven the hidden state q(k). To compute the probability of observing the sequence h = {o1, ..., o} in this HMM, one must sum up the probabilities of observing h over all (2n - 2)possible sequences of s hidden states which is done efficiently by the forward algorithm.

• The transition probability a(l,k) from the state q(l) of haplotype hfor SNP j to the state q(k) of haplotype hfor SNP j+1. If l ≠ k then a(l,k) is the probability for hto be recombined with hbetween SNP j and SNP j+1 (large dashed arrows in Figure 2). And conversely, if l = k then a(l,l) is the probability for hto be not recombined between the two SNPs (plain arrows in Figure 2). • The emission probability b(k) of the hidden allele of q(k) in the observed allele oof h (small dashed arrows in Figure 2). If the hidden allele is different from the observed one, then b(k) corresponds to the probability that the hidden allele q(k) has been altered in oby a mutation event. Else, b(k) corresponds to the probability that no mutation has occurred. In the HMM λ, the probability of a hidden states' sequence is given by the product of the corresponding transition probabilities. And the probability to observe h = {o1, ..., o} given a particular hidden states' sequence is obtained by the product of the probabilities for the hidden alleles to be emitted in the observed ones. Finally, to compute the probability Pr(h|λ), one must sum up the probabilities of observing h over all (2n - 2)possible sequences of s hidden states. An alternative to this expensive computational approach is to define a forward probability α(k) as the probability for the incomplete observed sequence {o1, ..., o} to be emitted by all the possible hidden sequences that end at state q(k). Then, the partial posterior probability πuntil SNP j of h can be written as follows: And the total probability of h over the s SNPs becomes: The computations of α(k) for k = 1,..., 2n-2 and j = 1,..., s are efficiently done by a recursive algorithm for HMM called forward algorithm [36]. It starts from initial values: And recursively computes the αvalues from the αvalues as follows: Computing all the α values for a haplotype requires now running time in O(sn2) instead of O(n). Different representations of the space of possible haplotypes pairs S. The left panel (A) shows the list representation commonly used by haplotype software such as Phase v2.1. The lower right panel (C) shows the representation used by Shape-IT. White and black circles indicate the phases between the heterozygous SNPs. On this example we use the same genotype Gdescribed in Figure 1. For iterations as performed by Phase v2.1 (A), the list requires the exploration of 20 nodes (4 haplotype pairs × 5 SNPs). With the complete tree representation (B) 10 nodes need to be explored, and with the incomplete tree representation as performed by Shape-IT (C), only 7 nodes need to be explored. The difference observed between (B) and (C) results from the pruning strategy which avoids the exploration of the nodes with probability ≤ 0.01.

Figure 1

Schematic representation of a sample of . In this example, the space of possible haplotypes Sfor individual i contains 4 haplotype pairs with 8 distinct haplotypes. The possible phases between heterozygous markers are shown in bold.

Computation of the FDLS distribution from a haplotype list by Phase v2.1 (Figure 3A)

The Phase v2.1 algorithm considers the haplotype space Sas a list of haplotypes compatibles with the genotype Gwhere zis the number of heterozygous SNPs. And it computes the FDLS distribution over this list with equations (3) and (1) on the HMM λ. This approach is computationally intensive for two reasons. First, it performs many times the same computations of α values with the forward algorithm since the haplotypes of Sare derived from the same genotype and share thus identical allelic segments. For instance, as shown in Figure 3A, several haplotypes of Sdiffer only in the last SNPs while the computation of forward values α starts each time from the first SNP. Second, the list of haplotypes grows exponentially with the number of heterozygous SNPs which prevents any application with a high number of SNPs. To partially overcome this problem, a "divide for conquer" solution called "partition-ligation" (PL) was first proposed by Niu et al [14,19,21]. It has been included in the Phase v2.1 algorithm as follows: it first divides the genotypes into segments of limited size (typically 5–8 SNPs), determines the most probable haplotypes on each segment with complete runs of the GS, and then progressively ligates haplotypes of the adjacent segments in several runs until completion. When two adjacent segments are ligated, the space S of candidate haplotype pairs is initialized from all combinations of the most probable haplotypes previously found in each segment. However, the PL procedure remains computationally expensive because it implies 2s/p - 1 (where p is the size of the partitions) complete runs of the algorithm, each time on a quadratic number of combinations of adjacent plausible haplotypes.

Figure 3

Different representations of the space of possible haplotypes pairs S. The left panel (A) shows the list representation commonly used by haplotype software such as Phase v2.1. The lower right panel (C) shows the representation used by Shape-IT. White and black circles indicate the phases between the heterozygous SNPs. On this example we use the same genotype Gdescribed in Figure 1. For iterations as performed by Phase v2.1 (A), the list requires the exploration of 20 nodes (4 haplotype pairs × 5 SNPs). With the complete tree representation (B) 10 nodes need to be explored, and with the incomplete tree representation as performed by Shape-IT (C), only 7 nodes need to be explored. The difference observed between (B) and (C) results from the pruning strategy which avoids the exploration of the nodes with probability ≤ 0.01.

Computation of the FDLS distribution from a complete binary tree by Shape-IT (Figure 3B)

To compute the FDLS distribution while avoiding any redundant calculations of α values, our algorithm uses a complete binary tree (called haplotype tree in the following) instead of an exhaustive list to represent the haplotype pairs space S. It can be viewed as an extension of the forward algorithm which computes the probabilities of observing in the HMM λ several pairs of sequences classified into a binary tree rather than observing a unique sequence. Such a haplotype tree is easily derived from a partition of genotype Ginto m unambiguous segments : each one starts from a heterozygous SNP, includes all the following homozygous SNPs, and ends before the next heterozygous SNP. A node of the haplotype tree corresponds to a genotype segment , and the two children nodes, to the two possible switch orientations with the following segment (g, ) and (, g). Then, a single path from the root to a leaf corresponds to a single possible haplotype pair of S(Figure 3B). To compute efficiently the FDLS distribution, Shape-IT explores the haplotype tree with a single recursive algorithm which combines the reconstruction of the haplotypes and the calculation of associated α forward values. In practice, it iterates the nodes by level-order (i.e. segment-order) to avoid any previous construction in memory of the haplotype tree. When visiting a node with the associated genotype segment (g, g'), the algorithm makes recursively a quadruplet q = {h, α, h', α'} where h and h' are the two haplotypes with respective forward values α and α' corresponding to the current explored path in the haplotype tree. Once all the nodes visited, the haplotype pairs of Sand the FDLS distribution are given respectively by the haplotypes and the forward values of the quadruplets associated to the leaf nodes. This approach is implemented in the algorithm 1 (Figure 4).

Figure 4

Algorithm 1 to compute the FDSL distribution on the complete haplotype tree.

Algorithm 1 to compute the FDSL distribution on the complete haplotype tree. This algorithm avoids all the unnecessary forward value computations made when using the representation by haplotype lists. However, the haplotype tree to be explored still grows exponentially with an increasing number of heterozygous SNPs. It results in a list L whose size is multiplied by two at each level explored (Figure 4). As with the classical haplotype list approach, this algorithm can be simply implemented in a PL strategy: first, a haplotype tree is derived for each segment of genotype, and then the most probable adjacent subtrees are determined and combined until completion. We have used an alternative strategy described in the next paragraph.

Computation of the FDLS distribution from an incomplete binary tree by Shape-IT (Figure 3C)

In practice, the number of haplotype pairs sufficiently probable to be sampled in the FDLS distribution is roughly linear with the number of SNPs instead of being exponential. As an alternative to the classical and expensive PL strategy, we have thus modified our recursive algorithm to explore only the paths in the haplotype tree which correspond to the most plausible haplotype pairs. In other words, our algorithm aims at identifying an incomplete binary tree of limited size which captures at best the informative part of FDLS distribution (Figure 3C). For that, recursions are made only on nodes exhibiting a probability, as given by expressions (2) and (1), greater than a threshold f initially defined. In practice, it results in maintaining a list L of quadruplets of limited size for each level of the tree explored, which no longer grows exponentially with the number of heterozygous SNPs. The corresponding modifications made in algorithm 1 are implemented in algorithm 2 (Figure 5). Obviously the value of the threshold f affects the number of quadruplets kept at each level of the haplotype tree and thus, the number of haplotype pairs on which the FDLS distribution is computed. It is clear that the value of threshold f influences the diversity of haplotypes to be captured and so, the computational effort needed. However, the strength of our algorithm clearly lies in the greatly reduced complexity with the number of SNPs of the FDLS computation step. Moreover, compared to the 2s/p - 1 complete runs of the GS required by the PL strategy, it treats all the SNPs in a single run.

Figure 5

Algorithm 2 to compute the FDSL distribution on the incomplete haplotype tree.

Methods

We have implemented our algorithm in the software package Shape-IT publicly available at . We have extensively compared Shape-IT with the widely used haplotype inference software 2snp [23], Gerbil [15], Fastphase [16], PL-EM [14], Ishape [27] and Phase v2.1 [21,22] on 3 kinds of datasets described hereafter. All the software were run with default parameters on a standard 2 GHz computer with 1 Go of RAM. In the comparisons, we have tried to work as close as possible to real conditions: on the one hand, we have used tightly linked SNPs such as those used in a single gene fine mapping and on the other hand, we have used TagSNPs with a low level of LD which correspond to the worst conditions to infer haplotypes. At last, we have also made estimations of the running times required by the most accurate software to infer the haplotypes of a 300 K Illumina chips.

Single gene datasets

First, we have used genotypes for which the haplotypes have been completely determined experimentally: the GH1 [37] and ApoE [38] genes. The GH1 dataset contains 14 SNPs for 150 Caucasian individuals and the ApoE dataset contains 9 SNPs for 90 individuals of mixed ethnic origins. For each gene, we have additionally generated 100 replicates by randomly masking 5% of the alleles in order to simulate real experimental conditions (missing data). On these datasets, we have measured the IER (Individual Error Rate) and the MER (Missing data Error Rate) which corresponds respectively to the percentage of individuals incorrectly inferred and to the percentage of missing data incorrectly inferred. Although of limited size, these two genes are very useful to compare precisely the haplotype frequency estimations made by the algorithms via the IF coefficient [25], since haplotype frequencies are commonly used by the geneticists in genetic association studies.

HapMap trio datasets

Second, we have worked on trios' genotypes (2 parents and 1 child) derived from the HapMap project [7,8]. We have collected five regions of 10 Mb on chromosomes 1, 2, 3, 4 and 5 in African (YRI) or European (CEU) populations. The 10 resulting chromosomal regions have been preprocessed by the Haploview software [39] to remove SNPs with Mendelian inconsistency or with insufficient minor allele frequency (MAF). From these chromosomal regions, we have generated several HapMap datasets according to the choices of markers described in Table 1[24,27]. On all these trios' genotypes, the parent haplotypes can be partially obtained (about ~80% of the phases between adjacent heterozygous SNPs are determined), and we have measured the running times of the various algorithms and the SER (Switch Error Rate) of haplotypes inferred by the various software. The SER corresponds to the percentage of known phases between adjacent heterozygous SNPs (obtained thanks to the trios affiliation) incorrectly inferred [22,27], which is more adapted than the IER on large numbers of SNPs because the IER does not differentiate between one or several heterozygous SNPs incorrectly inferred.

Table 1

Hapmap trio datasets description

Datasets	Chromosome	#datasets	#SNP	#indiv	Details
CEU Size	1 to 5	250	10 to 160	60	50 datasets of 10, 20, 40, 80 and 160 adjacent SNPs with MAF above 5%
CEU Density	1 to 5	300	40	60	50 datasets with spanned distance between SNP above 0, 0.5, 1, 2, 4 and 8 kb (MAF 5%)
CEU MAF	1 to 5	150	40	60	50 datasets with MAF above 1%, 5% and 10%
YRI Size	1 to 5	250	10 to 160	60	50 datasets of 10, 20, 40, 80 and 160 adjacent SNPs with MAF above 5%
YRI Density	1 to 5	300	40	60	50 datasets with spanned distance between SNP above 0, 0.5, 1, 2, 4 and 8 kb (MAF 5%)
YRI MAF	1 to 5	150	40	60	50 datasets with MAF above 1%, 5% and 10%
CEU illumina 50	12	300	50	60	15,000 illumina SNPs grouped by dataset of 50 SNPs
CEU illumina 100	12	150	100	60	15,000 illumina SNPs grouped by dataset of 100 SNPs
CEU illumina 200	12	75	200	60	15,000 illumina SNPs grouped by dataset of 200 SNPs
GRIV	1	90	50 to 200	100 to 300	3,500 illumina SNPs grouped by dataset of 50, 100 and 200 SNPs

Description of the benchmarks derived from the HapMap trios datasets that we used to compare accuracy and runtimes of the various algorithms in Table 4. For each parameter (size, density, and MAF) 10 samples were chosen in each of the chromosomes 1 to 5, i.e. a total of 50 tests per parameter.

Hapmap trio datasets description Description of the benchmarks derived from the HapMap trios datasets that we used to compare accuracy and runtimes of the various algorithms in Table 4. For each parameter (size, density, and MAF) 10 samples were chosen in each of the chromosomes 1 to 5, i.e. a total of 50 tests per parameter.

Table 4

Hapmap trio datasets results

Datasets	Shape-IT		Phase v2.1		Fastphase		Ishape		2snp		Gerbil		PLEM

	SER	Time	SER	Time	SER	Time	SER	Time	SER	Time	SER	Time	SER	Time
CEU Size	1.1		1.1		1.5		1.1		2.2		2.3		2.0
		53		832		113		93		< 1		50		10
YRI Size	1.7		1.7		2.3		1.8		4.5		3.9		4.2
		64		1,209		125		138		< 1		131		10
CEU Density	2.3		2.3		2.7		2.4		4.2		4.0		4.1
		26		214		64		43		< 1		5		6
YRI Density	3.7		3.7		4.9		3.9		8.5		7.5		8.8
		35		490		71		80		< 1		9		5
CEU MAF	1.1		1.1		1.2		1.2		2.0		2.1		1.7
		19		104		71		22		< 1		2		4
YRI MAF	1.5		1.5		2.0		1.5		4.5		3.8		3.2
		26		173		80		38		< 1		4		4
CEU 50 illumina SNP	6.3		6.3		7.2		6.6		10.7		9.2		12.2
		51		1,214		60		161		< 1		22		5
CEU 100 illumina SNP	6.7		6.8		7.7		9.2		11.3		9.7		N/A
		143		11,678		144		461		< 1		254		N/A
CEU 200 illumina SNP	7.2		N/A		8.0		N/A		11.5		9.9		N/A
		372		N/A		198		N/A		< 1		2,038		N/A

N/A: software was unable to handle some of these datasets (errors or untracktable running times). Results of the various tested software on the HapMap trios datasets described in Table 1. For each software tested, the mean percentage of heterozygous markers incorrectly inferred (SER) is shown in the upper-left corner, and the mean running time in seconds is shown in the lower-right corner.

To investigate on the impact of low LD in haplotype inference, we have also used a set of 15,000 adjacent Tag SNPs picked up from the large arm of chromosome 12 and found in the 300 K Illumina chips.

GRIV cohort datasets

Third, we have generated large SNP datasets from subjects of the GRIV (Genomics of Resistance to Immunodeficiency Virus) cohort genotyped with the 300 K Illumina chip. The GRIV cohort comprehends about 400 Caucasian subjects collected for genomic studies in AIDS [1,40-43]. These datasets were used to estimate the running times required by the most accurate software to infer the haplotypes of a 300 K Illumina chips. For that, we have generated 10 datasets from the GRIV cohort data for various numbers of markers (50, 100 and 200) and for various numbers of individuals (100, 200 and 300). Then the average running time over the 10 datasets of each combination of SNP number and genotype number was used to extrapolate the running time required to infer the haplotypes over the 300,000 SNPs. Accuracy of the different values tested for the threshold . This comparison was done on 300 datasets of 50 Tag SNPs called CEU Illumina 50.

Results

Empirical determination of the threshold f (Figure 6)

As discussed in the section Algorithm, Shape-IT relies on a threshold f to discard some branches of the haplotype binary trees. So, we have tested several values for f: the accuracy is clearly stable for values below 0.01. Since the running time was optimal for f = 0.01, we have used this value as default in all the following comparisons. Results obtained by various haplotyping software on the experimentally determined ApoE dataset. For the various software tested, we measured the percentage of individuals incorrectly reconstructed (IER), the percentage of missing data incorrectly inferred (MER), and the distance between real and inferred haplotype frequencies (IF) on the ApoE with complete genotypes and 5% random missing genotypes. Results obtained by various haplotyping software on the experimentally determined GH1 dataset. For the various software tested, we measured the percentage of individuals incorrectly reconstructed (IER), the percentage of missing data incorrectly inferred (MER), and the distance between real and inferred haplotype frequencies (IF) on the GH1 with complete genotypes and 5% random missing genotypes.

Comparisons on the single gene datasets (Table 2 and 3)

On these datasets, Shape-IT, Ishape and Phase v2.1 give clearly the better haplotype reconstructions and frequency estimations compared to the other software. One can notice that Ishape seems to be slightly more accurate than Shape-IT and Phase v2.1. For the completion of missing data, all the methods (except 2snp) are closely related. Hapmap trio datasets results N/A: software was unable to handle some of these datasets (errors or untracktable running times). Results of the various tested software on the HapMap trios datasets described in Table 1. For each software tested, the mean percentage of heterozygous markers incorrectly inferred (SER) is shown in the upper-left corner, and the mean running time in seconds is shown in the lower-right corner.

Comparisons on the HapMap trio datasets (Table 1 and 4)

As a matter of accuracy, Shape-IT and Phase v2.1 outperform all the other methods. Ishape comes second but plunges when dealing with larger number of Tag SNPs. Fastphase comes third but it seems to work relatively better when the datasets get bigger. 2snp, Gerbil, and PLEM do not match the accuracy of the other software. All the software get higher error rates when the number of Tag SNPs increases which is probably the consequence of the increasing complexity of the LD pattern when dealing with limited numbers of individuals. As a matter of speed, the fastest software is clearly 2snp. For relatively small numbers of SNPs, PLEM and Gerbil are also very fast, but become very slow when the number of SNPs increases or when the LD pattern gets more complex to capture. Among the 4 most accurate software (Phase v2.1, Fastphase, Ishape, and Shape-IT), Phase v2.1 is the slowest, Shape-IT is the fastest for small and medium-sized SNP samples (< 100 SNPs), and Fastphase becomes faster for larger numbers of SNPs (see additional file 1). Comparison of the estimated running times of various software on 300 K Illunima genotyping chips datasets. Estimations of the running times in days of the 4 most accurate software (Phase v2.1, Ishape, Fastphase and Shape-IT) to infer the haplotypes for 100, 200, or 300 genotypes derived from Illumina 300 k chips partitioned into segments of either 50 SNPs, or 100 SNPs, or 200 SNPs. For each combination #SNPs #genotypes, the running time estimations were extrapolated from the measures performed on 10 datasets extracted from the GRIV cohort 300 K Illumina chip genomic data.

Running time on the GRIV cohort datasets (Table 5)

On these datasets, Shape-IT runs between 15 to 150 times faster that Phase v2.1, depending on the segmentation strategy used (50, 100 or 200 SNPs) and the number of genotypes in the population (100, 200 or 300). Fastphase remains the fastest software but closely followed by Shape-IT. The increase of SNP and genotype numbers strongly cripples Phase v2.1 and Ishape, while it is better handled by Shape-IT and Fastphase.

Discussion and conclusion

We have developed a new algorithm derived from the Phase v2.1 Gibbs sampler scheme. We have improved the most time-consuming steps by using binary tree representations and by avoiding the PL procedure thanks to an incomplete exploration of binary trees. The resulting software, Shape-IT, is extremely accurate like Phase v2.1, but may run up to 150 times faster as shown in our tests. These results have an impact for the computation of haplotypes in genome scans as shown in Table 5. As an example, for the 300,000 SNPs of an Illumina genotyping chip, inferring haplotypes on 6,000 segments of 50 SNPs with a regular 2 GHz computer would take for Shape-IT about 10 days for 100 individuals, 13 days for 200 individuals, 28 days for 300 individuals while it would take for Phase v2.1 151 days for 100 individuals (15 times more), 443 days for 200 individuals (34 times more) and 1372 days for 300 individuals (49 times more). The gain of time using Shape-IT is thus considerable and practically very useful to exploit datasets derived from large-scale genotyping chips.

Table 5

Comparison of the estimated running times of various software on 300 K Illunima genotyping chips datasets.

#SNPs	#genotypes	Fastphase	Ishape	Shape-IT	Phase v2.1
50	100	10	29	10	151
100	100	6	37	12	519
200	100	6	41	19	3,137
50	200	21	34	13	443
100	200	21	119	29	2,739
200	200	21	124	37	7,601
50	300	37	113	28	1,372
100	300	41	268	52	6,514
200	300	42	261	81	12,757

Estimations of the running times in days of the 4 most accurate software (Phase v2.1, Ishape, Fastphase and Shape-IT) to infer the haplotypes for 100, 200, or 300 genotypes derived from Illumina 300 k chips partitioned into segments of either 50 SNPs, or 100 SNPs, or 200 SNPs. For each combination #SNPs #genotypes, the running time estimations were extrapolated from the measures performed on 10 datasets extracted from the GRIV cohort 300 K Illumina chip genomic data.

An important aspect of this work is that other haplotype inference software relying on HMM may gain to implement this new binary tree representation of the observed genotypes. Moreover, we have not found in the literature the description of this algorithm whereas it might be useful for other fields using HMM.

Availability and requirements

Project name: Shape-IT v1.0 Project home page: Operating systems: MacOS, Windows, Linux32bits and Linux64bits. Programming language: C++ Do not forget to read the manual file, manual_ShapeITv1.0.pdf, to get the detailed information. The software remains confidential until publication of the work. It will be freely available to academics, and a licence will be needed for non-academics (patented for business and commercial applications).

Authors' contributions

OD and CC worked on developing the methods and programs used in this study under the direct supervision of JFZ who conceived the study. All the authors have read and approved the final manuscript.

Additional file 1

Detailed trio datasets results. Detailed results of the various software tested on the HapMap trios datasets described in Table 1. For each software tested, the mean percentage of heterozygous markers incorrectly inferred (SER) and the average running time in seconds are shown. Click here for file

Table 2

Results obtained by various haplotyping software on the experimentally determined ApoE dataset.

ApoE	0%MD		5%MD
	IER	IF	IER	MER	IF

2snp	20.0	83.8	22.7	7.3	83.9
Fastphase	11.3	89.4	17.4	6.1	87.5
Gerbil	20.0	81.3	20.3	6.6	84.6
Ishape	5.6	94.1	10.2	5.9	92.5
Shape-IT	5.6	94.1	10.5	6.2	92.4
Phase v2.1	5.8	94.0	10.2	5.8	92.4
PLEM	12.5	89.8	16.0	6.5	88.7

For the various software tested, we measured the percentage of individuals incorrectly reconstructed (IER), the percentage of missing data incorrectly inferred (MER), and the distance between real and inferred haplotype frequencies (IF) on the ApoE with complete genotypes and 5% random missing genotypes.

Table 3

Results obtained by various haplotyping software on the experimentally determined GH1 dataset.

GH1	0%MD		5%MD
	IER	IF	IER	MER	IF

2snp	15.7	88.2	22.0	7.5	88.3
Fastphase	10.5	92.5	17.3	4.5	90.7
Gerbil	11.8	92.8	16.7	4.2	91.6
Ishape	10.1	93.8	15.0	4.5	92.6
Shape-IT	10.3	93.6	14.9	4.5	92.5
Phase v2.1	10.3	93.7	15.2	4.5	92.5
PLEM	12.4	90.3	17.2	4.8	89.4

For the various software tested, we measured the percentage of individuals incorrectly reconstructed (IER), the percentage of missing data incorrectly inferred (MER), and the distance between real and inferred haplotype frequencies (IF) on the GH1 with complete genotypes and 5% random missing genotypes.

41 in total

1. Estimating recombination rates from population genetic data.

Authors: P Fearnhead; P Donnelly
Journal: Genetics Date: 2001-11 Impact factor: 4.562

2. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms.

Authors: Tianhua Niu; Zhaohui S Qin; Xiping Xu; Jun S Liu
Journal: Am J Hum Genet Date: 2001-11-26 Impact factor: 11.025

3. Haplotype tagging for the identification of common disease genes.

Authors: G C Johnson; L Esposito; B J Barratt; A N Smith; J Heward; G Di Genova; H Ueda; H J Cordell; I A Eaves; F Dudbridge; R C Twells; F Payne; W Hughes; S Nutland; H Stevens; P Carr; E Tuomilehto-Wolf; J Tuomilehto; S C Gough; D G Clayton; J A Todd
Journal: Nat Genet Date: 2001-10 Impact factor: 38.330

4. The structure of haplotype blocks in the human genome.

Authors: Stacey B Gabriel; Stephen F Schaffner; Huy Nguyen; Jamie M Moore; Jessica Roy; Brendan Blumenstiel; John Higgins; Matthew DeFelice; Amy Lochner; Maura Faggart; Shau Neen Liu-Cordero; Charles Rotimi; Adebowale Adeyemo; Richard Cooper; Ryk Ward; Eric S Lander; Mark J Daly; David Altshuler
Journal: Science Date: 2002-05-23 Impact factor: 47.728

5. Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms.

Authors: Zhaohui S Qin; Tianhua Niu; Jun S Liu
Journal: Am J Hum Genet Date: 2002-11 Impact factor: 11.025

6. A first-generation linkage disequilibrium map of human chromosome 22.

Authors: Elisabeth Dawson; Gonçalo R Abecasis; Suzannah Bumpstead; Yuan Chen; Sarah Hunt; David M Beare; Jagjit Pabial; Thomas Dibling; Emma Tinsley; Susan Kirby; David Carter; Marianna Papaspyridonos; Simon Livingstone; Rocky Ganske; Elin Lõhmussaar; Jana Zernant; Neeme Tõnisson; Maido Remm; Reedik Mägi; Tarmo Puurand; Jaak Vilo; Ants Kurg; Kate Rice; Panos Deloukas; Richard Mott; Andres Metspalu; David R Bentley; Lon R Cardon; Ian Dunham
Journal: Nature Date: 2002-07-10 Impact factor: 49.962

7. Direct molecular haplotyping of long-range genomic DNA with M1-PCR.

Authors: Chunming Ding; Charles R Cantor
Journal: Proc Natl Acad Sci U S A Date: 2003-06-11 Impact factor: 11.205

8. A block-free hidden Markov model for genotypes and its application to disease association.

Authors: Gad Kimmel; Ron Shamir
Journal: J Comput Biol Date: 2005-12 Impact factor: 1.479

9. Genomic analysis of Th1-Th2 cytokine genes in an AIDS cohort: identification of IL4 and IL10 haplotypes associated with the disease progression.

Authors: A Vasilescu; S C Heath; R Ivanova; H Hendel; H Do; A Mazoyer; E Khadivpour; F X Goutalier; K Khalili; J Rappaport; G M Lathrop; F Matsuda; J-F Zagury
Journal: Genes Immun Date: 2003-09 Impact factor: 2.676

10. Human growth hormone 1 (GH1) gene expression: complex haplotype-dependent influence of polymorphic variation in the proximal promoter and locus control region.

Authors: Martin Horan; David S Millar; Jürgen Hedderich; Geraint Lewis; Vicky Newsway; Neil Mo; Linda Fryklund; Annie M Procter; Michael Krawczak; David N Cooper
Journal: Hum Mutat Date: 2003-04 Impact factor: 4.878

69 in total

1. Genetic variations and risk of placental abruption: A genome-wide association study and meta-analysis of genome-wide association studies.

Authors: Tsegaselassie Workalemahu; Daniel A Enquobahrie; Bizu Gelaye; Sixto E Sanchez; Pedro J Garcia; Fasil Tekola-Ayele; Anjum Hajat; Timothy A Thornton; Cande V Ananth; Michelle A Williams
Journal: Placenta Date: 2018-04-16 Impact factor: 3.481

2. Regional Heterogeneity in Gene Expression, Regulation, and Coherence in the Frontal Cortex and Hippocampus across Development and Schizophrenia.

Authors: Leonardo Collado-Torres; Emily E Burke; Amy Peterson; JooHeon Shin; Richard E Straub; Anandita Rajpurohit; Stephen A Semick; William S Ulrich; Amanda J Price; Cristian Valencia; Ran Tao; Amy Deep-Soboslay; Thomas M Hyde; Joel E Kleinman; Daniel R Weinberger; Andrew E Jaffe
Journal: Neuron Date: 2019-06-04 Impact factor: 17.173

3. CSHAP: efficient haplotype frequency estimation based on sparse representation.

Authors: Yinsheng Zhou; Han Zhang; Yaning Yang
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

4. Abruptio placentae risk and genetic variations in mitochondrial biogenesis and oxidative phosphorylation: replication of a candidate gene association study.

Authors: Tsegaselassie Workalemahu; Daniel A Enquobahrie; Bizu Gelaye; Timothy A Thornton; Fasil Tekola-Ayele; Sixto E Sanchez; Pedro J Garcia; Henry G Palomino; Anjum Hajat; Roberto Romero; Cande V Ananth; Michelle A Williams
Journal: Am J Obstet Gynecol Date: 2018-09-05 Impact factor: 8.661

5. Cross-Cancer Pleiotropic Associations with Lung Cancer Risk in African Americans.

Authors: Carissa C Jones; Yuki Bradford; Christopher I Amos; William J Blot; Stephen J Chanock; Curtis C Harris; Ann G Schwartz; Margaret R Spitz; John K Wiencke; Margaret R Wrensch; Xifeng Wu; Melinda C Aldrich
Journal: Cancer Epidemiol Biomarkers Prev Date: 2019-03-20 Impact factor: 4.254

6. Identification of 12 new susceptibility loci for different histotypes of epithelial ovarian cancer.

Authors: Catherine M Phelan; Karoline B Kuchenbaecker; Jonathan P Tyrer; Siddhartha P Kar; Kate Lawrenson; Stacey J Winham; Joe Dennis; Ailith Pirie; Marjorie J Riggan; Ganna Chornokur; Madalene A Earp; Paulo C Lyra; Janet M Lee; Simon Coetzee; Jonathan Beesley; Lesley McGuffog; Penny Soucy; Ed Dicks; Andrew Lee; Daniel Barrowdale; Julie Lecarpentier; Goska Leslie; Cora M Aalfs; Katja K H Aben; Marcia Adams; Julian Adlard; Irene L Andrulis; Hoda Anton-Culver; Natalia Antonenkova; Gerasimos Aravantinos; Norbert Arnold; Banu K Arun; Brita Arver; Jacopo Azzollini; Judith Balmaña; Susana N Banerjee; Laure Barjhoux; Rosa B Barkardottir; Yukie Bean; Matthias W Beckmann; Alicia Beeghly-Fadiel; Javier Benitez; Marina Bermisheva; Marcus Q Bernardini; Michael J Birrer; Line Bjorge; Amanda Black; Kenneth Blankstein; Marinus J Blok; Clara Bodelon; Natalia Bogdanova; Anders Bojesen; Bernardo Bonanni; Åke Borg; Angela R Bradbury; James D Brenton; Carole Brewer; Louise Brinton; Per Broberg; Angela Brooks-Wilson; Fiona Bruinsma; Joan Brunet; Bruno Buecher; Ralf Butzow; Saundra S Buys; Trinidad Caldes; Maria A Caligo; Ian Campbell; Rikki Cannioto; Michael E Carney; Terence Cescon; Salina B Chan; Jenny Chang-Claude; Stephen Chanock; Xiao Qing Chen; Yoke-Eng Chiew; Jocelyne Chiquette; Wendy K Chung; Kathleen B M Claes; Thomas Conner; Linda S Cook; Jackie Cook; Daniel W Cramer; Julie M Cunningham; Aimee A D'Aloisio; Mary B Daly; Francesca Damiola; Sakaeva Dina Damirovna; Agnieszka Dansonka-Mieszkowska; Fanny Dao; Rosemarie Davidson; Anna DeFazio; Capucine Delnatte; Kimberly F Doheny; Orland Diez; Yuan Chun Ding; Jennifer Anne Doherty; Susan M Domchek; Cecilia M Dorfling; Thilo Dörk; Laure Dossus; Mercedes Duran; Matthias Dürst; Bernd Dworniczak; Diana Eccles; Todd Edwards; Ros Eeles; Ursula Eilber; Bent Ejlertsen; Arif B Ekici; Steve Ellis; Mingajeva Elvira; Kevin H Eng; Christoph Engel; D Gareth Evans; Peter A Fasching; Sarah Ferguson; Sandra Fert Ferrer; James M Flanagan; Zachary C Fogarty; Renée T Fortner; Florentia Fostira; William D Foulkes; George Fountzilas; Brooke L Fridley; Tara M Friebel; Eitan Friedman; Debra Frost; Patricia A Ganz; Judy Garber; María J García; Vanesa Garcia-Barberan; Andrea Gehrig; Aleksandra Gentry-Maharaj; Anne-Marie Gerdes; Graham G Giles; Rosalind Glasspool; Gord Glendon; Andrew K Godwin; David E Goldgar; Teodora Goranova; Martin Gore; Mark H Greene; Jacek Gronwald; Stephen Gruber; Eric Hahnen; Christopher A Haiman; Niclas Håkansson; Ute Hamann; Thomas V O Hansen; Patricia A Harrington; Holly R Harris; Jan Hauke; Alexander Hein; Alex Henderson; Michelle A T Hildebrandt; Peter Hillemanns; Shirley Hodgson; Claus K Høgdall; Estrid Høgdall; Frans B L Hogervorst; Helene Holland; Maartje J Hooning; Karen Hosking; Ruea-Yea Huang; Peter J Hulick; Jillian Hung; David J Hunter; David G Huntsman; Tomasz Huzarski; Evgeny N Imyanitov; Claudine Isaacs; Edwin S Iversen; Louise Izatt; Angel Izquierdo; Anna Jakubowska; Paul James; Ramunas Janavicius; Mats Jernetz; Allan Jensen; Uffe Birk Jensen; Esther M John; Sharon Johnatty; Michael E Jones; Päivi Kannisto; Beth Y Karlan; Anthony Karnezis; Karin Kast; Catherine J Kennedy; Elza Khusnutdinova; Lambertus A Kiemeney; Johanna I Kiiski; Sung-Won Kim; Susanne K Kjaer; Martin Köbel; Reidun K Kopperud; Torben A Kruse; Jolanta Kupryjanczyk; Ava Kwong; Yael Laitman; Diether Lambrechts; Nerea Larrañaga; Melissa C Larson; Conxi Lazaro; Nhu D Le; Loic Le Marchand; Jong Won Lee; Shashikant B Lele; Arto Leminen; Dominique Leroux; Jenny Lester; Fabienne Lesueur; Douglas A Levine; Dong Liang; Clemens Liebrich; Jenna Lilyquist; Loren Lipworth; Jolanta Lissowska; Karen H Lu; Jan Lubinński; Craig Luccarini; Lene Lundvall; Phuong L Mai; Gustavo Mendoza-Fandiño; Siranoush Manoukian; Leon F A G Massuger; Taymaa May; Sylvie Mazoyer; Jessica N McAlpine; Valerie McGuire; John R McLaughlin; Iain McNeish; Hanne Meijers-Heijboer; Alfons Meindl; Usha Menon; Arjen R Mensenkamp; Melissa A Merritt; Roger L Milne; Gillian Mitchell; Francesmary Modugno; Joanna Moes-Sosnowska; Melissa Moffitt; Marco Montagna; Kirsten B Moysich; Anna Marie Mulligan; Jacob Musinsky; Katherine L Nathanson; Lotte Nedergaard; Roberta B Ness; Susan L Neuhausen; Heli Nevanlinna; Dieter Niederacher; Robert L Nussbaum; Kunle Odunsi; Edith Olah; Olufunmilayo I Olopade; Håkan Olsson; Curtis Olswold; David M O'Malley; Kai-Ren Ong; N Charlotte Onland-Moret; Nicholas Orr; Sandra Orsulic; Ana Osorio; Domenico Palli; Laura Papi; Tjoung-Won Park-Simon; James Paul; Celeste L Pearce; Inge Søkilde Pedersen; Petra H M Peeters; Bernard Peissel; Ana Peixoto; Tanja Pejovic; Liisa M Pelttari; Jennifer B Permuth; Paolo Peterlongo; Lidia Pezzani; Georg Pfeiler; Kelly-Anne Phillips; Marion Piedmonte; Malcolm C Pike; Anna M Piskorz; Samantha R Poblete; Timea Pocza; Elizabeth M Poole; Bruce Poppe; Mary E Porteous; Fabienne Prieur; Darya Prokofyeva; Elizabeth Pugh; Miquel Angel Pujana; Pascal Pujol; Paolo Radice; Johanna Rantala; Christine Rappaport-Fuerhauser; Gad Rennert; Kerstin Rhiem; Patricia Rice; Andrea Richardson; Mark Robson; Gustavo C Rodriguez; Cristina Rodríguez-Antona; Jane Romm; Matti A Rookus; Mary Anne Rossing; Joseph H Rothstein; Anja Rudolph; Ingo B Runnebaum; Helga B Salvesen; Dale P Sandler; Minouk J Schoemaker; Leigha Senter; V Wendy Setiawan; Gianluca Severi; Priyanka Sharma; Tameka Shelford; Nadeem Siddiqui; Lucy E Side; Weiva Sieh; Christian F Singer; Hagay Sobol; Honglin Song; Melissa C Southey; Amanda B Spurdle; Zsofia Stadler; Doris Steinemann; Dominique Stoppa-Lyonnet; Lara E Sucheston-Campbell; Grzegorz Sukiennicki; Rebecca Sutphen; Christian Sutter; Anthony J Swerdlow; Csilla I Szabo; Lukasz Szafron; Yen Y Tan; Jack A Taylor; Muy-Kheng Tea; Manuel R Teixeira; Soo-Hwang Teo; Kathryn L Terry; Pamela J Thompson; Liv Cecilie Vestrheim Thomsen; Darcy L Thull; Laima Tihomirova; Anna V Tinker; Marc Tischkowitz; Silvia Tognazzo; Amanda Ewart Toland; Alicia Tone; Britton Trabert; Ruth C Travis; Antonia Trichopoulou; Nadine Tung; Shelley S Tworoger; Anne M van Altena; David Van Den Berg; Annemarie H van der Hout; Rob B van der Luijt; Mattias Van Heetvelde; Els Van Nieuwenhuysen; Elizabeth J van Rensburg; Adriaan Vanderstichele; Raymonda Varon-Mateeva; Ana Vega; Digna Velez Edwards; Ignace Vergote; Robert A Vierkant; Joseph Vijai; Athanassios Vratimos; Lisa Walker; Christine Walsh; Dorothea Wand; Shan Wang-Gohrke; Barbara Wappenschmidt; Penelope M Webb; Clarice R Weinberg; Jeffrey N Weitzel; Nicolas Wentzensen; Alice S Whittemore; Juul T Wijnen; Lynne R Wilkens; Alicja Wolk; Michelle Woo; Xifeng Wu; Anna H Wu; Hannah Yang; Drakoulis Yannoukakos; Argyrios Ziogas; Kristin K Zorn; Steven A Narod; Douglas F Easton; Christopher I Amos; Joellen M Schildkraut; Susan J Ramus; Laura Ottini; Marc T Goodman; Sue K Park; Linda E Kelemen; Harvey A Risch; Mads Thomassen; Kenneth Offit; Jacques Simard; Rita Katharina Schmutzler; Dennis Hazelett; Alvaro N Monteiro; Fergus J Couch; Andrew Berchuck; Georgia Chenevix-Trench; Ellen L Goode; Thomas A Sellers; Simon A Gayther; Antonis C Antoniou; Paul D P Pharoah
Journal: Nat Genet Date: 2017-03-27 Impact factor: 38.330

7. Polygenic risk score of sporadic late-onset Alzheimer's disease reveals a shared architecture with the familial and early-onset forms.

Authors: Carlos Cruchaga; Jorge L Del-Aguila; Benjamin Saef; Kathleen Black; Maria Victoria Fernandez; John Budde; Laura Ibanez; Yuetiva Deming; Manav Kapoor; Giuseppe Tosto; Richard P Mayeux; David M Holtzman; Anne M Fagan; John C Morris; Randall J Bateman; Alison M Goate; Oscar Harari
Journal: Alzheimers Dement Date: 2017-09-21 Impact factor: 21.566

Review 8. Haplotype phasing: existing methods and new developments.

Authors: Sharon R Browning; Brian L Browning
Journal: Nat Rev Genet Date: 2011-09-16 Impact factor: 53.242

9. Genome-wide association study of inhaled corticosteroid response in admixed children with asthma.

Authors: Natalia Hernandez-Pacheco; Niloufar Farzan; Ben Francis; Leila Karimi; Katja Repnik; Susanne J Vijverberg; Patricia Soares; Maximilian Schieck; Mario Gorenjak; Erick Forno; Celeste Eng; Sam S Oh; Lina Pérez-Méndez; Vojko Berce; Roger Tavendale; Lesly-Anne Samedy; Scott Hunstman; Donglei Hu; Kelley Meade; Harold J Farber; Pedro C Avila; Denise Serebrisky; Shannon M Thyne; Emerita Brigino-Buenaventura; William Rodriguez-Cintron; Saunak Sen; Rajesh Kumar; Michael Lenoir; Jose R Rodriguez-Santana; Juan C Celedón; Somnath Mukhopadhyay; Uroš Potočnik; Munir Pirmohamed; Katia M Verhamme; Michael Kabesch; Colin N A Palmer; Daniel B Hawcutt; Carlos Flores; Anke H Maitland-van der Zee; Esteban G Burchard; Maria Pino-Yanes
Journal: Clin Exp Allergy Date: 2019-02-15 Impact factor: 5.018

10. Developmental and genetic regulation of the human cortex transcriptome illuminate schizophrenia pathogenesis.

Authors: Andrew E Jaffe; Richard E Straub; Joo Heon Shin; Ran Tao; Yuan Gao; Leonardo Collado-Torres; Tony Kam-Thong; Hualin S Xi; Jie Quan; Qiang Chen; Carlo Colantuoni; William S Ulrich; Brady J Maher; Amy Deep-Soboslay; Alan J Cross; Nicholas J Brandon; Jeffrey T Leek; Thomas M Hyde; Joel E Kleinman; Daniel R Weinberger
Journal: Nat Neurosci Date: 2018-07-26 Impact factor: 24.884