Saber Qanbari1,2. 1. Leibniz Institute for Farm Animal Biology (FBN), Institute of Genetics and Biometry, Dummerstorf, Germany. 2. Animal Breeding and Genetics Group, Department of Animal Sciences, Center for Integrated Breeding Research, University of Göttingen, Göttingen, Germany.
Abstract
Given the importance of linkage disequilibrium (LD) in gene mapping and evolutionary inferences, I characterize in this review the pattern of LD and discuss the influence of human intervention during domestication, breed establishment, and subsequent genetic improvement on shaping the genome of livestock species. To this end, I summarize data on the profile of LD based on array genotypes vs. sequencing data in cattle and chicken, two major livestock species, and compare to the human case. This comparison provides insights into the real dimension of the pairwise allelic correlation and haplo-block structuring. The dependency of LD on allelic frequency is pictured and a recently introduced metric for moderating it is outlined. In the context of the contact farm animals had with human, the impact of genetic forces including admixture, mutation, recombination rate, selection, and effective population size on LD is discussed. The review further highlights the interplay of LD with runs of homozygosity and concludes with the operational implications of the widely used association and selection mapping studies in relation to LD.
Given the importance of linkage disequilibrium (LD) in gene mapping and evolutionary inferences, I characterize in this review the pattern of LD and discuss the influence of human intervention during domestication, breed establishment, and subsequent genetic improvement on shaping the genome of livestock species. To this end, I summarize data on the profile of LD based on array genotypes vs. sequencing data in cattle and chicken, two major livestock species, and compare to the human case. This comparison provides insights into the real dimension of the pairwise allelic correlation and haplo-block structuring. The dependency of LD on allelic frequency is pictured and a recently introduced metric for moderating it is outlined. In the context of the contact farm animals had with human, the impact of genetic forces including admixture, mutation, recombination rate, selection, and effective population size on LD is discussed. The review further highlights the interplay of LD with runs of homozygosity and concludes with the operational implications of the widely used association and selection mapping studies in relation to LD.
Linkage disequilibrium (LD) is the non-random assortment of alleles at different loci. The terms linkage and LD are often confused. As highlighted by Slatkin (2008), LD is one of those unfortunate terms that do not reveal its meaning. Indeed, LD means simply a correlation between alleles, and detecting LD does not ensure either linkage or a lack of equilibrium. This stems from the fact that mechanisms other than just physical proximity on a chromosome (linkage) such as mutation, genetic drift, and epistatic combinations might also cause (gametic phase) disequilibrium between unlinked markers. For example, admixing genetically distinct populations creates association between two loci with different allele frequencies even if they are unlinked. LD can also arise due to population stratification and cryptic relationships within a population that results in correlated allelic frequencies (reviewed in Hellwege et al., 2017).The pattern of LD is a powerful indicator of the genetic forces shaping a population. For example, knowledge of LD helps inferring a population’s effective size (Ne) and past demography. Populations with smaller Ne experience more genetic drift than larger populations. This genetic drift causes LD between alleles at independently-segregating loci, at a rate inversely proportional to Ne (Waples et al., 2016). This way, an estimate of contemporary Ne can be concluded from LD information (Sved, 1971; Hill, 1981). On the contrary, past Ne is a function of LD between physically-linked loci, given that the inter-loci recombination fractions are available (Sved, 1971). Accordingly, the closely-linked loci indicate population sizes over historical past, while loosely-linked loci signify Ne in the immediate past (Hill, 1981, Hayes et al., 2003). Unlike the non-model species, these methods can be applied in the populations of farm animals for which the high resolution genetic maps are becoming available (Tortereau et al., 2012; Ma et al., 2015a; Petit et al., 2017).LD between linked markers also determines the power and precision of association mapping studies,directly influencing our ability to localize genes and or loci responsible for economic traits in agriculture or inherited diseases in human (reviewed in Goddard and Hayes, 2009). Given the economic impact of domestic animals, understanding the dimension of LD enables planning and performing successful genomic breeding programs, when working towards global food security. This review aims to outline the definition of LD, summarize data on patterns of LD in the genome of farm animals, and discuss the various properties and implications that LD causes for gene mapping and evolutionary studies of livestock species.
A Historical Glance
The concept of LD was first introduced in Jennings (1917), and its quantification (D) was developed by Lewontin and Kojima (1960). LD became a hot topic in the last two decades once the usefulness of LD for gene mapping became evident and genotyping of large numbers of linked single-nucleotide polymorphism (SNP) became feasible through high-throughput technologies.The simple formulation of the commonly used LD measure D is the differencebetween the observed and the expected gametic haplotype frequencies comprising two loci A and Bunder linkage equilibrium (D=P–P). Besides D, several measures of LD (for example, D’, λ, δ, r
2 ρ
2, among others) have been suggested (Lewontin, 1964; Bengtsson and Thomson, 1981; Hill and Weir, 1994; Terwilliger, 1995; Zhao et al., 2005; Gianola et al., 2013). The merits, comparison, and methodologies of these metrics with the utilization of biallelic or multi-allelic loci have been extensively described in the literature (e.g., Jorde, 2000; Pritchard and Przeworski, 2001; Mueller, 2004; Sved, 2009). Choosing the appropriate LD measure depends on the objective of the study, and one may perform better than another in particular situations. The two widely used measures of LD are r and D’. r is indicative of the correlation that a marker might have with the gene of interest and is often preferred for association studies.
LD-Based Mapping of Genes
Identifying the genetics underlying phenotypic variation is the ultimate goal of most mapping studies. In general, there are two different, but to some extent, complementary methodologies to localize genes controlling traits. Both methodologies, outlined below, benefit from the properties of LD to accomplish the mapping task.Association mapping: is the most common approach of mapping quantitative trait loci (QTLs) that takes advantage of the historic LD to connect phenotypes to genotypes. This approach detects inherited markers in the vicinity of the genetic causatives or loci controlling the complex quantitative traits. It is often performed by scanning the entire genome for significant associations between a panel of SNPs and a particular phenotype (e.g., Hayes et al., 2010). Subsequent analyses will then be required to verify the realized association independently in order to confirm that it either directly controls the trait of interest, or is linked to (in LD with) a QTL that contributes to the trait of interest.Association analysis is based on the principle that an unbeknownst causative variant is located on a haplotype, and a marker allele in LD with the causative variant should signify (by proxy) an association with the trait of interest. Given the fact that SNPs are in LD with one another, if a common SNP affects a trait, one can probably genotype a SNP in LD with it (a “marker” SNP) and that marker will be correlated with the trait of interest.Quantifying the extent of LD is the essential first step to determine the number of markers required to cover the entire genome in an association study with succinct power and precision. Theoretically, extensive LD reduces the number of markers required to localize an association between marker and trait but in lower resolution. In contrast, when LD promptly decays within a short distance, many markers are needed to map a gene of interest.Although the LD-based association analysis is a powerful tool routinely applied for gene mapping, it has not been very successful for targeting genes of complex traits, especially where the causative variants are low in frequency. This is due to the fact that commercial genotyping arrays largely under-represent infrequent alleles (reviewed in Lee et al., 2014). For a detailed discussion, refer to the article by Goddard and Hayes (2009) reviewing the pros and cons of association analysis in farm animals. Here I stress the importance of LD in exploring the genetic variability underlying phenotype-genotype relationship. It is noteworthy that with the advancement of bioinformatics tools and high throughput sequencing technologies that provides the full profile of an individual’s genetic variation, it is now possible to test for the effects of every single DNA polymorphism on phenotypic variation, without requiring LD information. However, given the presence of confounding factors such as cryptic correlations in interpreting the GWAS results, LD remains useful as evidence for validation of a detected association (Bulik-Sullivan et al., 2015).Mapping selection: Selection generates LD between distant loci through a “hitch-hiking” effect (Smith and Haigh, 1974), which happens when a haplotype carrying the favored allele rises in frequency so fast and drags neighboring loci to higher frequencies. Scanning the genome for long unbroken haplotypes accompanied by extensive LD can reveal past selection responding to an adaptive quality (e.g., Sabeti et al., 2002). Domestic species have been intensively selected during the recent past through domestication, breed establishment and genetic improvement and as such, have achieved tremendous phenotypic changes. Consequently, genomic regions controlling traits of economic importance are expected to exhibit footprints of selective breeding (reviewed in Qanbari and Simianer, 2014a).
Dependency on Allelic Frequency
The widely used measure of LD in animal breeding and genome-wide association mapping is r. This metric has an allele frequency-dependent character (see
), as is quoted in Lewontin (1988) “there are generally no gene frequency independent measures of association between loci”. The dependence of r on allele frequencies affects the outcomes and interpretations of population genetics studies in several ways. For example, there are population characteristics that are related to the estimated value of LD, such as effective population size and pattern of recombination landscapes. This implies that the estimates of effective size or recombination maps developed based on expected values of r are frequency-dependent as well (e.g., Ober et al., 2013). Furthermore, in gene mapping studies, power to detect a causative variant using SNP markers is a function of r between the causative variant and the marker. Thus, if a SNP marker and a causative variant have different minor allele frequencies, then the power to detect an effect at the marker can be small since high values of r are not realized. This property of r becomes especially more significant in human models, where the most disease-causing variants are rare and genome-wide association studies should be adapted to target these variants.
Figure 1
Surface plot of the dependency of LD on allelic frequency of SNP pairs. The means of r
2 are plotted for 45 bins of 0.01 allele frequency each (from Qanbari et al., 2010a).
Surface plot of the dependency of LD on allelic frequency of SNP pairs. The means of r
2 are plotted for 45 bins of 0.01 allele frequency each (from Qanbari et al., 2010a).Even if a frequency independent measure of LD may not exist, it would be desirable to develop one which is less affected by frequencies than r. In a recent study (Gianola et al., 2013), we developed a new estimator of LD parameter (ρ) based on a metric proposed by Plackett (1965) that is a tetra-choric correlation (Pearson, 1901). Plackett (1965) introduced bivariate distributions indexed by a single parameter ψ that, in the case of the 2 x 2 table, takes the form . The relationship between the tetra-choric correlation and ψ is given bywhere, ρ is easy to compute and much less dependent on allele frequency than r (see
).
Figure 2
The behavior of LD as a function of inter-marker distance (Mb) and MAF interval (dMAF). The estimates of r
2 (left panel) and ρ
2 (right panel) are depicted as surface plots for SNP loci on chromosome 3 of the Italian Tuscan population in HapMap III (from Gianola et al., 2013).
The behavior of LD as a function of inter-marker distance (Mb) and MAF interval (dMAF). The estimates of r
2 (left panel) and ρ
2 (right panel) are depicted as surface plots for SNP loci on chromosome 3 of the Italian Tuscan population in HapMap III (from Gianola et al., 2013).We argue that ρ is a useful metric and potent to the further research and developments for applications in population and quantitative genetics. For instance, ρ can facilitate comparison of levels of LD among populations that are subjected to different allelic frequencies, whereas such comparisons are distorted by the frequency-dependent nature of r. Likewise, in the quantitative genetics context, the power analyses are formulated based on r in association studies or genomic selection programs. For example, the sample size in indirect association studies must be increased by roughly 1/r for detecting the causal mutation directly (Kruglyak, 1999; Pritchard and Przeworski, 2001). Similarly it is suggested that the required level of LD (r) for genomic selection to achieve an accuracy of 0.85 for genomic breeding values has to be 0.2 (Meuwissen et al., 2001). Perhaps, similar relationships can also be developed for ρ, which is a subject for future research.
The Extent of LD: Genotype vs. Sequence Data
The strength of LD is of crucial importance for the genome-based analysis of evolutionary history, fine-tuning of applications like association mapping, genomic selection and selection mapping. Most of the previous studies on LD in farm animals have used panels of ascertained genotypes of different densities available by SNP genotyping arrays. The availability of population sequencing for livestock species nowadays has provided the opportunity to figure patterns of LD in unprecedented resolution. With advances in high-throughput sequencing technologies, read lengths are becoming longer, an ideal situation for estimating LD, as longer reads allow direct phasing of double heterozygotes (Maruki and Lynch, 2014).The extent to which LD decays in the genome of farm animals has been extensively studied on the basis of genotypes from SNP arrays (Porto-Neto et al., 2014; Khanyile et al., 2015; Prieur et al., 2017; Marchiori et al., 2019; Mokhber et al., 2019; Muñoz et al., 2019, among others). While genotyping arrays exhibited LD extending at several hundreds of kilobases, a denser catalog of SNPs generated from genome re-sequencing reveals LD decaying at much shorter distances (see
). This is attributed to the SNP profile used to measure LD. As shown in
, the distribution of allele frequency drawn from sequence data is a decreasing function that involves a sizable fraction of infrequent alleles. In contrast, frequency distribution in genotyping arrays is rather an increasing function, as SNPs were mainly ascertained aiming at frequent alleles and coverage of the genome during the establishment of the array (also see Fu et al., 2015 and Makina et al., 2015). Given that LD, as measured by r depends on allele frequencies, the difference between the studies is partially due to the biased SNPs selection on the genotyping arrays. Other factors such as the influence of population sub-structuring in the sample composition or sequencing errors may also affect the allelic correlations. However, LD measures in this experiment were drawn from the identical set of samples for both array and sequence resolution and the differences between the two marker sets are too significant to be caused by sequencing errors. For further validation of this observation based on possible scenarios I refer to the experiments described in Qanbari et al. (2014b).
Figure 3
A schematic representation of decay of LD in domestic chicken. r values are plotted as a function of pair-wise inter-marker distances based on sequence (Seq) versus SNP50K (Array) data in a population of Lohmann brown layer line. The gray dots represent sequence-based r plotted for each chromosome separately, whereas LD based on array data was simply averaged genome-wide due to the lack of enough LD estimates in shorter distance bins. The black dashed line is fitted as mean LD in each distance bin across chromosomes. The r values representing sequence data are estimated for sub-samples of all pairwise estimates in macrochromosomes, but include all SNP by SNP relationships in microchromosomes.
Figure 4
Distribution of allelic frequency in domestic chicken. Histogram compares profile of minor allele frequency between 50K array and sequence data in a population of Lohmann brown layer.
A schematic representation of decay of LD in domesticchicken. r values are plotted as a function of pair-wise inter-marker distances based on sequence (Seq) versus SNP50K (Array) data in a population of Lohmann brown layer line. The gray dots represent sequence-based r plotted for each chromosome separately, whereas LD based on array data was simply averaged genome-wide due to the lack of enough LD estimates in shorter distance bins. The black dashed line is fitted as mean LD in each distance bin across chromosomes. The r values representing sequence data are estimated for sub-samples of all pairwise estimates in macrochromosomes, but include all SNP by SNP relationships in microchromosomes.Distribution of allelic frequency in domesticchicken. Histogram compares profile of minor allele frequency between 50K array and sequence data in a population of Lohmann brown layer.
LD Haplo-Blocks: Genotype vs. Sequence Data
A haplotype block is a set of closely linked markers on a chromosome with a strong LD between each other that tend to inherit together (Gabriel et al., 2002). The haplo-blocks could have been produced by interplay of several possible mechanisms, including domestication, population subdivision, founding events, selection, and recombination hotspots. These structures, when discovered, were of great practical importance for the gene mapping studies; as such, testing one SNP within each block for significant association with a trait might be sufficient to indicate association with every SNP in that block (Carlson et al., 2004). This could reduce the number of SNPs required to be tested in association studies.Haplotype blocks have been studied in human and other farm animals. Previous studies in farm animals based on array data have reported haplo-blocks extending to several hundreds of kilobasepairs (e.g., Qanbari et al., 2010a; Qanbari et al., 2010c; Al-Mamun et al., 2015, among others). The assembly of large LD blocks appearing in array-based analyses, however, breaks into series of shorter tracts when LD is assessed by sequence data in the cattle genome (
). Consistent with the reduced LD profile presented in
, resolving large haplo-blocks in sequence resolution is a consequence of shift in allele frequency spectrum towards infrequent alleles that are under-represented in the ascertained array genotypes. This way, a sizable number of pairwise LD estimates comprising infrequent alleles become smaller so that a reduced LD profile breaks stretched LD blocks formed in the array-based experiments.
Figure 5
The LD-block structuring as a function of SNP density. (Panel A) displays a LD block of length 29 Kb based on estimates of pair-wise D’ among 13 SNPs located on BTA25 in Fleckvieh cattle. (Panel B) displays LD structure in the same region in sequencing resolution consisting of 115 markers. The LD blocks are obtained using “confidence intervals” algorithm (Gabriel et al., 2002) in Haploview (Barrett et al., 2005). LD analysis has been conducted with a constant number of individuals.
The LD-block structuring as a function of SNP density. (Panel A) displays a LD block of length 29 Kb based on estimates of pair-wise D’ among 13 SNPs located on BTA25 in Fleckvieh cattle. (Panel B) displays LD structure in the same region in sequencing resolution consisting of 115 markers. The LD blocks are obtained using “confidence intervals” algorithm (Gabriel et al., 2002) in Haploview (Barrett et al., 2005). LD analysis has been conducted with a constant number of individuals.
To What Extent is LD in Farm Animals Influenced by Humans?
Addressing this question requires speculating about the possible influence of domestication,breed establishment and animal farming on genetic factors implicating LD. Principally, LD is influenced by several factors, including drift, admixture, mutation and recombination rates, selection, finite population size, population bottlenecks, or other genetic events which a population experiences (reviewed in Slatkin, 2008). For example, population admixture creates sizable LD, depending on the similarity of the allele frequency profiles in the admixed populations. LD due to crossbreeding of inbred lines is significant but, it could be small when crossing breeds have similar gene frequencies, and it erodes quickly and disappears after a limited number of generations. Mutation, due to its minor effect on changing gene frequencies, has a negligible impact on the LD in the time frame of domestication. Selection is probably a significant cause of LD, however, its effect is likely localized around specific (major) genes, and so has relatively little effect on the amount of LD averaged across the genome.While the buildup of LD can be a result of several population genetic forces, recombination isthe only primary mechanism to break it down. The absence of recombination between sites under selection can reduce the efficiency of selection in what is known as the ‘Hill-Robertson effect’ (Hill and Robertson, 1966). It is suggested that high rates of recombination during domestication have contributed to strong selection response (reviewed in Ross-Ibarra, 2004), but remains a debate since the evidences are ambiguous and inconclusive. The most recent study found no difference in the number and distribution of recombination breakpoints between dogs and wolves suggesting that both upper and lower bounds of crossover rates may be tightly regulated (Muñoz-Fuentes et al., 2015).The finite population size is generally thought to be the leading cause of LD as effectivepopulation size has been severely eroded for most domestic species. For example, our experimentbased on sequence data suggests that chicken has experienced a drastic decline inNe, evidencing a severe bottleneck most likely driven by domestication started inrecent past (see
). As shown, chicken hadthe largest effective population size 10,000 years ago which coincides with the generally accepted timing of chicken domestication (e.g., Xiang et al., 2014). The most recent Ne has dropped to a few hundred individuals and the Red Jungle Fowl (RJF) appears to have a larger population size present day in comparison to the commercial birds. A similar pattern of historical demography is observed in cattle (The Bovine HapMap Consortium, 2009). In human, the story is the opposite (The 1000 Genomes Project Consortium, 2015); improved agricultural productivity and industrialization have led to dramatic increases in population size. If LD is a result of the (current) finite population size, then the extent of LD should be many times more in livestock, as these species have Ne order of magnitude smaller (Leroy et al., 2013; Hall, 2016; Boitard et al., 2016) than the recent estimates reported for humans (Keinan and Clark, 2012; Browning and Browning, 2015). In reality, this is observed only for a portion of the marker pairs situated apart up to several hundreds of kilobases (Szyda et al., 2017). Instead, the observations based on full re-sequencing data revealed that the average genome-wide LD in chicken (see
) and cattle (Qanbari et al., 2014b) extends less than 40Kb, slightly greater than that in human populations. Since this is obtained from the full profile of polymorphisms, it represents the real strength of LD in these genomes, and far less than the extent previously reported.
Figure 6
A schematic illustration of historical Ne in chicken. The ancestral demography is inferred in sequence resolution for RJF and white (WL) and brown (BL) layers employing the Pairwise Sequentially Markovian Coalescent [PSMC, Li and Durbin (2011)] framework. The scale on the x-axis is years in the past and the scale on the y-axis represents the historical effective population numbers. Orange (RJF), brown (BL), and cyan (WL) lines represent inferred demography for different populations with bootstraps in lighter colors. Note that inferences of bootstraps are depicted only for one sample of each population.
A schematic illustration of historical Ne in chicken. The ancestral demography is inferred in sequence resolution for RJF and white (WL) and brown (BL) layers employing the Pairwise Sequentially Markovian Coalescent [PSMC, Li and Durbin (2011)] framework. The scale on the x-axis is years in the past and the scale on the y-axis represents the historical effective population numbers. Orange (RJF), brown (BL), and cyan (WL) lines represent inferred demography for different populations with bootstraps in lighter colors. Note that inferences of bootstraps are depicted only for one sample of each population.Indeed, the observation of nearly comparable strength of LD in human and livestock is aconsequence of a sizable amount of polymorphism preserved in the genome of livestock. We observe millions of SNPs in the genome of cattle (e.g., Daetwyler et al., 2014) and chicken (Qanbari et al., 2019), in line with the latest updates of the genome sequencing projects in other livestock populations, including horse (Jagannathan et al., 2019), pig (Rubin et al., 2012), and sheep (Naval-Sanchez et al., 2018) that identified tens of millions SNP variants. This is comparable to the polymorphism content found in the human genome on the basis of sequencing several hundreds of individuals (The 1000 Genomes Project Consortium, 2015).Hypothetically, the observed level of nucleotide diversity is much larger than a small population with Ne as low as several tens or hundreds is expected to generate or carry. This implies that chicken and cattle must have experienced much larger Ne in their history, which is indeed what exactly emerges from demographic inferences in these species. For example, analysis of sequence data suggests that chicken had a historical Ne around 25,000 at 1 million years ago that persisted for several hundreds of thousands years, before chicken population expanded starting from 50,000 to 100,000 years ago (see
). A somewhat similar picture of ancestral demography was also reported for the bovine genome (The Bovine HapMap Consortium 2009). Comparing the LD pattern across breeds of livestock species can reveal the influence of humans in shaping the genetic buildup. LD have been reported across breeds of cattle (Qanbari et al., 2011; Porto-Neto et al., 2014; Makina et al., 2015), sheep (Al-Mamun et al., 2015; Prieur et al., 2017), pig (Badke et al., 2012; Ai et al., 2013; Muñoz et al., 2019), buffalo (Deng et al., 2019; Mokhber et al., 2019), chicken (Khanyile et al., 2015; Hérault et al., 2018), and horse (Wade et al., 2009; McCue et al., 2012, Marchiori et al., 2019), among others. The general trend is that in local breeds or populations that experienced less intensive breeding programs, LD decays faster between distant markers than the commercial populations in which, LD extends for larger pairwise distances. For example, Holstein exhibits extensive LD than the other cattle breeds, despite having the largest contemporary population. In comparison, Indicine breeds have a lower LD than Taurine, suggestive of a larger ancestral population (e.g., Porto-Neto et al., 2014). The involvement of human in shaping genetic makeup of livestock is also evident in domesticchickens, where local breeds mostly exhibit shorter extent of LD (Khanyile et al., 2015) and among the commercials, the broilers presents faster decay of LD than layer populations (Pengelly et al., 2016; Seo et al., 2018 and Hérault et al., 2018). This is attributed to a more intensive selection scheme running over many generations during past several decades in layers resulting in a lower population haplotype diversity and a smaller Ne.Further to the comparable polymorphism content, a somewhat similar pattern of allele frequency spectra (SFS) emerges in human and livestock genomes from sequence data (see Qanbari et al., 2014b and Qanbari et al., 2019). The SFS in livestock follows a decreasing trend consistent with many other organisms, including human (e.g., Nielsen et al., 2012). The distinction in livestock is that the spectra are skewed towards a larger fraction of intermediate frequencies (
). This is most likely stemming from an extremely small effective population size in present day livestock species and substantiates the significant under-representation of infrequent alleles in commercial breeds (e.g., see Muir et al., 2008 and Qanbari et al., 2019).
Genome-Wide Variation in LD
Across the genome, every chromosome behaves as a unique linkage group and may experience independent demography. This is similar to the inter-species or inter-population scenarios, where it generates different profiles of LD for each unit. LD levels are also higher for sex chromosomes than autosomes because recombination on the sex chromosomes only occurs in females. Previous studies of measuring LD revealed a substantial difference among chromosomes of farm animals (e.g., Sargolzaei et al., 2008). In human models, evidence also exists for significant variation in LD across genome, between sexes and among populations (Vega et al., 2005; Baudat et al., 2010; Kong et al., 2010, among others). Besides the recombination landscape which is the primary mechanism in shaping genome-wide LD, other factors such as genetic drift, demographic forces, mutation rate, and selection play a role as well. This depicts how challenging predicting LD between two sets of polymorphism based solely on physical distance could be. The design of LD mapping experiments and placement of SNPs will, therefore, require a thorough understanding of the local interplay of these factors for precisely localizing a target locus.
The Decay of LD in Human and Livestock
LD persists for several hundreds of kilobases at least for a portion of marker pairs in the contemporary populations of chicken and cattle (Szyda et al., 2017; Hérault et al., 2018), which causes a slightly higher LD averaged over the genome compared to human. This is primarily stemming from the “family-based LD,” a representation of the large chunks of chromosomes of founder animals segregating in the population. The consanguine parents transmit these identical-by-descent segments to the progenies and create uninterrupted stretches of homozygous genotypes, known as “run of homozygosity” (ROH), the hallmark of these autozygous segments inherited from a recent common ancestor (reviewed in Peripolli et al., 2017; Ceballos et al., 2018). The frequency, size, and distribution of ROH in the genome provide insights into the inbreeding, past demography, and selection in livestock populations (e.g., Bosse et al., 2012; Purfield et al., 2012, among others). In general, the extent of ROH islands is a function of the number of generations to the common ancestor, so that longer ROH indicate recent inbreeding, whereas ROH of older origin are generally shorter. The livestock populations involve more recent inbreeding loops through assortative mating, therefore, are expected to carry longer ROH than outbred populations like human that hold a much larger effective population size and diverse population (Gibson et al., 2006). Although a direct comparison of ROH between species in previous studies is impractical due to the lack of a gold standard in defining ROH islands, the extent to which the genome is covered by ROH tracts is expected to be higher in domestic animals relative to their wild counterparts. The long unbroken homozygosity hold in ROH islands, therefore, gives rise to an extended LD in livestock than that in human.The unusually long ROH may also persist in outbred populations. These homozygosity islands may originate from the locally low mutation or recombination rates, or be a result of the positive selection for a favorable allele followed by the hitch-hiking of the polymorphism around the target locus (see section “Mapping selection”).
Implications for Gene Mapping Studies
LD in sequencing resolution decays more rapidly than previously reported using array data. This enables higher resolution mapping of a trait of interest in outbred populations employing either association or selection mapping strategies. This also implies that selection mapping using haplotype-based metrics demands a panel of denser SNPs arrays to efficiently reveal patterns generated by unusually long haplotypes than medium-density arrays. The low reproducibility of the results reported in some of the first genome-wide selection studies in farm animal populations (e.g., Qanbari et al., 2010b) based on medium-density SNP arrays (~50 k SNPs) may be due to the lack of power prompted by overestimating the extent of LD demonstrated here. This is backed by our recent study in which extensive simulations were used to investigate the power of combining selection signatures detected with multiple methods under different scenarios of marker density, sample size, and selection intensity (Ma et al., 2015b). The authors showed that a reasonable power to detect selection signatures is achieved with high marker density (>1 SNP/Kb). Ultimately, uncovering older selective sweeps that carry shorter haplotypes will need sequencing resolution.The extent of LD varies across the genomic regions, chromosomes, among populations and between species. In other words, genome-wide averaged estimates of the extent of LD may not adequately reflect LD patterns of specific regions or population groups. These observations have broader practical relevance in genomic studies of farm animals, as such the optimal number of samples and marker density in either genome-wide association or selection mapping studies may largely vary due to the extremely adverse pattern of LD within and among chromosomes. Finally, confounding population characteristics such as cryptic allelic correlations or stratification may have serious impact on pattern and structure of LD in livestock populations that need to be taken into consideration in conducting unbiased genome-wide association mapping (reviewed in Hellwege et al., 2017, also see Ma et al., 2012 and Bulik-Sullivan et al., 2015).
LD Assessment Software Tools
Estimating LD coefficients is computationally simple and can be performed using in-house scripts when the marker density is restricted to the genotypes of SNP arrays. r is particularly straightforward to achieve based on built-in commands as it corresponds the spearman correlation between SNPs pairs. Moreover, the standard population genetics programs, among them are Haploview (Barrett et al., 2005) and Arlequin (Excoffier et al., 2005), along with several R packages provide tools to estimate LD statistics. In sequence resolution, however, estimation LD coefficients can be computationally burdensome specifically for the mega reference panels such as genome sequencing consortiums of different livestock species. For example, a panel of 1000 genomes of a mammalian species sequenced may include over 35M shared variants, which corresponds to over 4 × 1011 pairwise LD coefficients within 1 Mbp windows genome-wide. A number of sophisticated programs to estimate LD statistics from sequencing data are freely available. PLINK is a widely used software toolkit for analyzing genetic data and is among the most computationally efficient tools for estimating LD (Purcell et al., 2007). VCFtools is another widely used software toolkit for manipulating and analyzing genetic data that provide utilities to estimate LD from the Variant Call Format (VCF) (Danecek et al., 2011). VCFtools works with compressed VCF files (VCF.gz) which require far less storage space than PLINK BED files; however, it can be computationally demanding for large data sets. M3VCFtools (Das et al., 2016), an extension of VCFtools uses a compact haplotype representation format called M3VCF, to estimate LD statistics. M3VCF requires far less storage than genotype formats. M3VCF toolkit provides more efficient querying and data processing and has option to convert a VCF file into M3VCf format.
Author Contributions
The author confirms being the sole contributor of this work and has approved it for publication.
Funding
This research is financially supported by the grants from the German Research Foundation (DFG, project ChickenSeq ID. QA55/1-1) and the Federal Ministry of Education and Research (BMBF, project CLARITY, ID. 031L0166). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Conflict of Interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Authors: Christopher S Carlson; Michael A Eberle; Mark J Rieder; Qian Yi; Leonid Kruglyak; Deborah A Nickerson Journal: Am J Hum Genet Date: 2003-12-15 Impact factor: 11.025
Authors: Richard A Gibbs; Jeremy F Taylor; Curtis P Van Tassell; William Barendse; Kellye A Eversole; Clare A Gill; Ronnie D Green; Debora L Hamernik; Steven M Kappes; Sigbjørn Lien; Lakshmi K Matukumalli; John C McEwan; Lynne V Nazareth; Robert D Schnabel; George M Weinstock; David A Wheeler; Paolo Ajmone-Marsan; Paul J Boettcher; Alexandre R Caetano; Jose Fernando Garcia; Olivier Hanotte; Paola Mariani; Loren C Skow; Tad S Sonstegard; John L Williams; Boubacar Diallo; Lemecha Hailemariam; Mario L Martinez; Chris A Morris; Luiz O C Silva; Richard J Spelman; Woudyalew Mulatu; Keyan Zhao; Colette A Abbey; Morris Agaba; Flábio R Araujo; Rowan J Bunch; James Burton; Chiara Gorni; Hanotte Olivier; Blair E Harrison; Bill Luff; Marco A Machado; Joel Mwakaya; Graham Plastow; Warren Sim; Timothy Smith; Merle B Thomas; Alessio Valentini; Paul Williams; James Womack; John A Woolliams; Yue Liu; Xiang Qin; Kim C Worley; Chuan Gao; Huaiyang Jiang; Stephen S Moore; Yanru Ren; Xing-Zhi Song; Carlos D Bustamante; Ryan D Hernandez; Donna M Muzny; Shobha Patil; Anthony San Lucas; Qing Fu; Matthew P Kent; Richard Vega; Aruna Matukumalli; Sean McWilliam; Gert Sclep; Katarzyna Bryc; Jungwoo Choi; Hong Gao; John J Grefenstette; Brenda Murdoch; Alessandra Stella; Rafael Villa-Angulo; Mark Wright; Jan Aerts; Oliver Jann; Riccardo Negrini; Mike E Goddard; Ben J Hayes; Daniel G Bradley; Marcos Barbosa da Silva; Lilian P L Lau; George E Liu; David J Lynn; Francesca Panzitta; Ken G Dodds Journal: Science Date: 2009-04-24 Impact factor: 47.728
Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937
Authors: Christopher A Jenkins; Ellen C Schofield; Cathryn S Mellersh; Luisa De Risio; Sally L Ricketts Journal: Anim Genet Date: 2021-07-12 Impact factor: 2.884