New technologies for rapidly assaying DNA sequences have revealed that the degree and nature of human genetic variation is far more complex then previously realized. These same technologies have also resulted in the identification of common genetic variants associated with more than 30 human diseases and traits.
New technologies for rapidly assaying DNA sequences have revealed that the degree and nature of human genetic variation is far more complex then previously realized. These same technologies have also resulted in the identification of common genetic variants associated with more than 30 human diseases and traits.
Human genetic variation was named "breakthrough of the year" by Science in 2007, reflecting the marked advances in understanding the genetic basis of normal human phenotypic diversity and susceptibility to a wide range of diseases. The human genome is composed of 3 billion nucleotides with approximately 0.5% of these nucleotides differing among individuals [1]. This genetic variation, the nucleotides that differ from person to person, affects the majority of human phenotypic differences, from eye color and height to disease susceptibility and responses to drugs.
Classification of genetic variants
Phenotypic variation in humans is a direct consequence of genetic variation, which acts in conjunction with environmental and behavioral factors to produce phenotypic diversity. Genetic variants are classified by two basic criteria: their genetic composition and their frequency in the population. In terms of composition, polymorphisms can be classified as sequence variants or structural variants. Sequence variants range from single nucleotide differences between individuals to 1 kilobase (kb)-sized insertions or deletions (indels) of a segment of DNA (Figure 1) [2]. Larger insertions and deletions, as well as duplications, inversions and translocations, are collectively called structural variants. These variants can range in size from 1 kb to those spanning more than 5 megabases (Mb) of DNA [3].
Figure 1
Classification of genetic variants by composition. Schematic of sequence and structural variants compared to reference sequence. Sequence variation (indicated by red line) refers to single-nucleotide variants and small (less than 1 kb) indels. Structural variation includes inversions, translocations and copy-number variants, which result in the presence of a segment of DNA in variable numbers compared to the reference sequence, as in duplications, deletions or insertions. Adapted from [4].
Classification of genetic variants by composition. Schematic of sequence and structural variants compared to reference sequence. Sequence variation (indicated by red line) refers to single-nucleotide variants and small (less than 1 kb) indels. Structural variation includes inversions, translocations and copy-number variants, which result in the presence of a segment of DNA in variable numbers compared to the reference sequence, as in duplications, deletions or insertions. Adapted from [4].Genetic variants are also classified in terms of their frequency within the population, with common variants defined as those in which the minor allele is present at a frequency of greater than 5% in the population, while for rare variants it is present at a frequency of less than 5%. The fundamental source of genetic variation is mutation, and the majority of common genetic variants arose once in human history and are shared by many individuals today through descent from common ancient ancestors. A polymorphism is, by convention, defined as a genetic variant that is present in at least 1% of the population and thereby excludes rare variants that may have arisen in relatively recent human history. Much of the study of genetic variation to date has focused on characterizing the 10 million estimated single nucleotide polymorphisms (SNPs), as they comprise approximately 78% of human variants, thus accounting for most genetic diversity. SNPs are located, on average, every 100 to 300 bases in the genome. Structural variants account for only an estimated 22% of all variants in the genome, but they comprise an estimated 74% of the nucleotides that differ between individuals [1]. As a result of technological advances that enable their detection, there has been a flurry of recent efforts to catalogue structural polymorphisms on a genomic scale [4-6].The study of inheritance of genetic variation depends on two key concepts: genetic linkage and linkage disequilibrium (Figure 2). Two loci are in genetic linkage if they are physically close enough to one another such that recombination occurs between them with a less than 50% probability in a single generation, resulting in their co-segregation more often than if they were independently inherited (Figure 2a,b). Recombination frequency is measured in units of centimorgans, with 1 centimorgan equal to a 1% chance that two loci will segregate independently due to recombination in a single generation. One centimorgan is, on average, equivalent to 1 million base pairs (bp) in the human genome.
Figure 2
Identification of genetic variation underlying human disease using linkage analysis and genome-wide association studies. (a) Rare Mendelian traits, such as a monogenic disease with autosomal dominance inheritance, can be studied using linkage analysis in a family. The disease status is followed within a pedigree (seven affected individuals depicted in red). (b) The disease loci (red bar) co-segregates with the genetic marker (blue bar), located 10 centimorgans (cM) apart. Each of the seven individuals with the disease carries the blue genetic marker, both inherited from the affected 'parent' chromosome (yellow). (c) Genetic variants underlying common diseases can be statistically identified by using SNP-based linkage disequilibrium (LD) maps. The frequency of a causative variant (red diamond) will be higher (62%) among those with the disease when compared with a control population (50%). (d) LD map of 11 variants cluster into three blocks of correlation r2 > 0.8 (red scale correlation matrix). The LD between polymorphisms needs to be empirically determined by genotyping a population and calculating the correlation.
Identification of genetic variation underlying human disease using linkage analysis and genome-wide association studies. (a) Rare Mendelian traits, such as a monogenic disease with autosomal dominance inheritance, can be studied using linkage analysis in a family. The disease status is followed within a pedigree (seven affected individuals depicted in red). (b) The disease loci (red bar) co-segregates with the genetic marker (blue bar), located 10 centimorgans (cM) apart. Each of the seven individuals with the disease carries the blue genetic marker, both inherited from the affected 'parent' chromosome (yellow). (c) Genetic variants underlying common diseases can be statistically identified by using SNP-based linkage disequilibrium (LD) maps. The frequency of a causative variant (red diamond) will be higher (62%) among those with the disease when compared with a control population (50%). (d) LD map of 11 variants cluster into three blocks of correlation r2 > 0.8 (red scale correlation matrix). The LD between polymorphisms needs to be empirically determined by genotyping a population and calculating the correlation.Linkage disequilibrium is a measure of the co-occurrence in a population of a particular allele at one locus with a particular allele at a second locus at a higher frequency than would be predicted by random chance. Linkage disequilibrium is created when a new mutation occurs in a genomic interval that already contains a particular variant allele, and is eroded over the course of many generations by recombination. Various statistics have been used to measure the amount of linkage disequilibrium between two variant alleles, one of the most useful being the coefficient of correlation r2. When r2 = 1 the two variant alleles are in complete linkage disequilibrium, whereas values of r2 < 1 indicate that the ancestral complete linkage disequilibrium has been eroded. Thus, while genetic linkage results from recombination in the last two to three generations and measures co-segregation in a pedigree, linkage disequilibrium depends on the association of variant alleles within a population of unrelated individuals and reflects evolutionary history (Figure 2c,d).
Advances in identification of genetic variants underlying human traits
The first disease traits to be ascribed to particular genes were Mendelian traits, which are controlled by a single gene and follow well defined models of inheritance, such as autosomal dominant, autosomal recessive, and X-linked (Figure 2a). Genetic variants underlying Mendelian diseases are highly penetrant by definition (that is, the variant is associated with a very high relative risk of having the disease) and, as a result of negative selection, they tend to be rare (Figure 3).
Figure 3
The allelic spectrum of disease is dependent on the number of genetic variants, their frequency in a population and on the size of their phenotypic effect. Family-based linkage studies have proved successful in identifying causative genetic variants in rare Mendelian disorders, which are, by definition, caused by highly penetrant variants that have a low frequency in the population. Complex diseases are caused by multiple genetic variants that confer incremental risk of disease. Genome-wide association studies have sufficient power to detect genetic variants with modest phenotypic effects, provided that they occur at a high frequency in the population. Adapted from [92].
The allelic spectrum of disease is dependent on the number of genetic variants, their frequency in a population and on the size of their phenotypic effect. Family-based linkage studies have proved successful in identifying causative genetic variants in rare Mendelian disorders, which are, by definition, caused by highly penetrant variants that have a low frequency in the population. Complex diseases are caused by multiple genetic variants that confer incremental risk of disease. Genome-wide association studies have sufficient power to detect genetic variants with modest phenotypic effects, provided that they occur at a high frequency in the population. Adapted from [92].In the 1980s and 1990s, the creation of genetic-linkage maps was based on sequence-dependent data such as restriction-fragment length polymorphisms [7,8] and microsatellite markers [9]. These techniques established genetic-linkage analysis as the traditional method for identifying genetic variation underlying monogenic genetic disorders. Linkage studies consisted of mapping broad genetic regions that segregate with a disease in families and then using positional cloning to narrow down the candidate region in order to isolate disease-causing genes or variants. Linkage analyses were successful in identifying genetic variants in genes responsible for many notable Mendelian diseases, including cystic fibrosis [10], for which the major disease variant has a deletion of a single amino acid, Charcot-Marie-Tooth Disease Type 1A [11], for which the underlying genetic variant is a DNA duplication, and Huntington's disease [12], which is a trinucleotide repeat disorder. By 1995, genetic linkage mapping had been used to uncover variants underlying hundreds of human Mendelian traits and diseases. Thus, almost a decade before the elucidation of the human genome sequence, it was fully appreciated that DNA variants of all classes, both common and rare as well as sequence and structural, play important roles in single-gene traits and rare Mendelian diseases.The next, and more difficult, stage was to determine genes associated with the far more common complex (multigene) diseases such as diabetes, heart disease and cancer. The conceptual framework for statistical association studies to identify common genetic variants underlying common diseases was established by Risch and Merikangas in 1996 [13], and is now referred to as the common disease/common variant (CD/CV) hypothesis. This hypothesis states that common diseases are caused by multiple genetic variants that are present at a high frequency in the population and confer cumulative incremental effects on disease risk (Figure 3) [14,15]. It is thought that due to the low penetrance and modest risk associated with these common variant alleles, they do not undergo the same strong negative selection as highly penetrant rare variants underlying Mendelian diseases. In addition, environment and behavior are believed to contribute over 70% of the susceptibility to diseases such as cancer, coronary heart disease and type 2 diabetes [16]. On the basis of these assumptions in the CD/CV model, it was posited that to identify variant that occur at a high frequency in the population yet confer a small risk for disease, it would be feasible to use SNP-based linkage disequilibrium maps to survey the common genetic variation present in the entire genomes of a large number of individuals.Several key technological advances laid the foundation for the eventual successful implementation of genome-wide association studies in identifying common genetic variants underlying complex traits. The first was the completion of the 3 billion bp human genome sequence in 2001, which served as a reference sequence to which genotype or sequence information from individuals could be compared [17,18]. Then, large-scale efforts led to the discovery of a substantial fraction of the 10 million estimated SNPs in the human population. By genotyping millions of these SNPs in hundreds of individuals, the International HapMap Project created SNP linkage disequilibrium maps, reducing the vast majority of common genetic variation in the 3 billion bp human genome to around 500,000 tag SNPs that are proxies for other SNPs in high linkage disequilibrium [19]. This resource has driven a wave of critical technological advances in the design of genome-wide SNP arrays that allow the rapid and cost-effective genotyping of hundreds of thousands to millions of tag SNPs in each individual, thus allowing the examination of common genetic variation across the genome.Genome-wide association studies using SNP-based arrays compare the frequency of SNP alleles in the genomes of a group of individuals with a complex trait (the cases) to a control group (Figure 2c). This approach allows the identification of common genetic variants that are either causative or in linkage disequilibrium with a causative allele. In reviewing the design of successful genome-wide association studies, three key features become clear. First, because of the moderate risk conferred by many common genetic variants, it is imperative to design an adequately powered study with large sample sizes that are carefully controlled to minimize bias [20-22]. Second, SNP selection and detection is critical, and there is an ongoing effort to catalog more SNPs across the genome and to create methods to assay SNP genotypes more densely. Finally, even statistically convincing associations require validation by replication in an independent cohort.
During 2007, the first wave of genome-wide association studies using tag SNPs resulted in the identification of common genetic variants associated with a broad range of common diseases and traits, including cancer, metabolic diseases, immune-mediated diseases and neurodegenerative diseases (Table 1). The findings of these genome-wide scans can best be reviewed by discussing the results of studies investigating specific complex diseases and traits. Gout and its associated serum uric acid concentration has been studied in two genome-wide association studies [23,24], resulting in the identification of variants in the gene SLC2A9 (solute carrier family 2 member 9). SLC2A9 variants were associated with high concentration of uric acid in the serum (between 1.7% and 5.3% increase) and the expression level of the isoform 2 of SLC2A9 was correlated with serum uric acid concentration [24]. This isoform encodes the protein Glut9ΔN, a putative fructose transporter expressed in kidney. As fructose is upstream in the pathway generating uric acid, an impaired expression of this protein possibly leads to the increased level of serum uric acid observed in gout [23,24].
Table 1
Genetic loci associated with disease and phenotypic variation
Genetic loci associated with disease and phenotypic variationMultiple genome-wide association studies investigating coronary artery disease have independently identified a strong association with SNPs in a chromosomal region at 9p21. Individuals homozygous for the 9p21 risk allele have a 1.9 higher relative risk of suffering from coronary artery disease than individuals homozygous for the non-risk alleles [22,25-28]. Interestingly, this region does not harbor any known genes, and the underlying biological reason for the association is unknown. Beyond diseases, genome-wide scans have identified variants associated with human height: HMG2A (a transcription factor) and GDF5-UQCC (a locus associated with osteoarthritis) [29,30]. In addition, variants in FTO (fat mass and obesity associated gene) have been associated with obesity: adults homozygous for the risk allele have an increased relative risk of 1.67 for being obese compared with the non-risk allele carriers [31].In spite of the exciting successes of recent SNP-based genome scans, the results of studies investigating specific complex diseases indicate that the approach frequently identifies common variants that account for only a small fraction (less than 10%) of the heritable component of the disease [32]. Most of the associated SNPs typically result in an increased relative risk of around 1.2 for heterozygotes and for many diseases only a few SNPs have been identified. Thus, we are left asking where is the remaining genetic variance underlying these heritable diseases? It is likely that some of this missing variation is accounted for by common variants with very small effects, which the current studies, despite the rather large cohorts used, are not powerful enough to capture. The additive or even multiplicative integrated effect of common SNPs may be important, as recently shown with five SNPs that increase susceptibility to prostate cancer [33]. Such gene-gene interactions are typically not accounted for in the analysis of genome scans. It is well established that SNP-based genome scans have limited power to capture the association of rare variants, which are likely to be important contributors to complex diseases. Structural variants have been demonstrated to underlie phenotypic diversity of complex traits [34,35] but have not generally been captured with current SNP-centric platforms for ultra-high throughput genotyping. Recent studies have shown that this class of variants is enriched in segmentally duplicated regions of the genome, in which there is a paucity of tag SNPs because of technical difficulties [36]. Thus, the missing variation in SNP-based genome scans indicates that systematically examining these other types of variants for their contribution to complex diseases is important.
Functional annotation of genetic variants
Although the discoveries of SNP-based genome-wide association studies are exciting, it is important to note that they are limited to the statistical association of DNA variants with common diseases and that the biological mechanisms underlying most of these findings are not yet known. For example, multiple studies have shown that three SNPs on chromosome 16p13 in the vicinity of KIAA0350 are unequivocally associated with type 1 diabetes, but it is unclear how the risk and non-risk alleles differ; is it in expression, alternative splicing patterns, or the function of the protein encoded by KIAA0350? [37] This uncertainty in the underlying biological cause of an association is especially pronounced when the variant lies in a chromosomal interval that does not contain a gene, such as the association of the 9p21 interval with coronary artery disease. Therefore, the findings of most association studies currently can only be used for crude predictions of the likelihood that an individual will develop a certain disease.To translate the findings of SNP-based genome scans into clinical practice to improve human health, it is necessary to establish new, highly innovative approaches for assaying intervals containing associated variants for functional differences between the risk and non-risk alleles. This will require access to diverse and large patient populations to obtain biological samples. Each genomic interval has a different landscape of functional sequences, and this, together with the fact that each disease affects different biological processes, makes it impossible to develop a 'one-size-fits-all' strategy to annotate associated sequences for functional differences between risk and non-risk alleles. Thus, it is also essential to make use of diverse experimental methods and technologies in all the various biological 'omics': genomics, proteomics, epigenomics, metabolomics, structural genomics and glycomics.Several public and private initiatives are developing 'next generation' sequencing technologies based on pyrosequencing (Roche-454) [38], sequencing by synthesis (Illumina-Solexa) [39] or sequencing by ligation (ABI-SOLiD). These technologies, capable of the cost-effective generation of massive amounts of DNA sequence, are already being used to sequence targeted regions, and in the near future will be capable of sequencing whole genomes of individuals to simultaneously examine SNPs and other genetic variants for associations with specific diseases. The statistical analysis methods for assessing the relationship between rare genetic variants identified in sequence data and complex traits are beginning to be developed. Results of sequence-based studies conducted so far suggest that associated intervals will be identified on the basis that the frequency of rare genetic variants with functional consequences will be greater in individuals with the complex disease versus controls. Thus, next-generation sequencing technologies, by detecting a myriad more SNPs and other types of variation associated with complex disease, will increase the difficulty and at the same time, the importance of functional annotation of genetic variants. At this point, it appears that we are just beginning to appreciate the extent of human genomic variation. Projects like the '1000 Genomes' and large-scale efforts to perform deep-coverage sequencing in both healthy patients and those with complex traits will help propel this exciting field further.
Authors: Travis Dunckley; Matthew J Huentelman; David W Craig; John V Pearson; Szabolcs Szelinger; Keta Joshipura; Rebecca F Halperin; Chelsea Stamper; Kendall R Jensen; David Letizia; Sharon E Hesterlee; Alan Pestronk; Todd Levine; Tulio Bertorini; Michael C Graves; Tahseen Mozaffar; Carlayne E Jackson; Peter Bosch; April McVey; Arthur Dick; Richard Barohn; Catherine Lomen-Hoerth; Jeffrey Rosenfeld; Daniel T O'connor; Kuixing Zhang; Richard Crook; Henrik Ryberg; Michael Hutton; Jonathan Katz; Ericka P Simpson; Hiroshi Mitsumoto; Robert Bowser; Robert G Miller; Stanley H Appel; Dietrich A Stephan Journal: N Engl J Med Date: 2007-08-01 Impact factor: 91.245
Authors: Hakon Hakonarson; Struan F A Grant; Jonathan P Bradfield; Luc Marchand; Cecilia E Kim; Joseph T Glessner; Rosemarie Grabs; Tracy Casalunovo; Shayne P Taback; Edward C Frackelton; Margaret L Lawson; Luke J Robinson; Robert Skraban; Yang Lu; Rosetta M Chiavacci; Charles A Stanley; Susan E Kirsch; Eric F Rappaport; Jordan S Orange; Dimitri S Monos; Marcella Devoto; Hui-Qi Qu; Constantin Polychronakos Journal: Nature Date: 2007-07-15 Impact factor: 49.962
Authors: Robert M Plenge; Mark Seielstad; Leonid Padyukov; Annette T Lee; Elaine F Remmers; Bo Ding; Anthony Liew; Houman Khalili; Alamelu Chandrasekaran; Leela R L Davies; Wentian Li; Adrian K S Tan; Carine Bonnard; Rick T H Ong; Anbupalam Thalamuthu; Sven Pettersson; Chunyu Liu; Chao Tian; Wei V Chen; John P Carulli; Evan M Beckman; David Altshuler; Lars Alfredsson; Lindsey A Criswell; Christopher I Amos; Michael F Seldin; Daniel L Kastner; Lars Klareskog; Peter K Gregersen Journal: N Engl J Med Date: 2007-09-05 Impact factor: 91.245
Authors: Frida Lundmark; Kristina Duvefelt; Ellen Iacobaeus; Ingrid Kockum; Erik Wallström; Mohsen Khademi; Annette Oturai; Lars P Ryder; Janna Saarela; Hanne F Harbo; Elisabeth G Celius; Hugh Salter; Tomas Olsson; Jan Hillert Journal: Nat Genet Date: 2007-07-29 Impact factor: 38.330
Authors: David A Hafler; Alastair Compston; Stephen Sawcer; Eric S Lander; Mark J Daly; Philip L De Jager; Paul I W de Bakker; Stacey B Gabriel; Daniel B Mirel; Adrian J Ivinson; Margaret A Pericak-Vance; Simon G Gregory; John D Rioux; Jacob L McCauley; Jonathan L Haines; Lisa F Barcellos; Bruce Cree; Jorge R Oksenberg; Stephen L Hauser Journal: N Engl J Med Date: 2007-07-29 Impact factor: 91.245
Authors: John R W Yates; Tiina Sepp; Baljinder K Matharu; Jane C Khan; Deborah A Thurlby; Humma Shahid; David G Clayton; Caroline Hayward; Joanne Morgan; Alan F Wright; Ana Maria Armbrecht; Baljean Dhillon; Ian J Deary; Elizabeth Redmond; Alan C Bird; Anthony T Moore Journal: N Engl J Med Date: 2007-07-18 Impact factor: 91.245
Authors: Christopher E Lowe; Jason D Cooper; Todd Brusko; Neil M Walker; Deborah J Smyth; Rebecca Bailey; Kirsi Bourget; Vincent Plagnol; Sarah Field; Mark Atkinson; David G Clayton; Linda S Wicker; John A Todd Journal: Nat Genet Date: 2007-08-05 Impact factor: 38.330
Authors: Jacques Fellay; Kevin V Shianna; Dongliang Ge; Sara Colombo; Bruno Ledergerber; Mike Weale; Kunlin Zhang; Curtis Gumbs; Antonella Castagna; Andrea Cossarizza; Alessandro Cozzi-Lepri; Andrea De Luca; Philippa Easterbrook; Patrick Francioli; Simon Mallal; Javier Martinez-Picado; José M Miro; Niels Obel; Jason P Smith; Josiane Wyniger; Patrick Descombes; Stylianos E Antonarakis; Norman L Letvin; Andrew J McMichael; Barton F Haynes; Amalio Telenti; David B Goldstein Journal: Science Date: 2007-07-19 Impact factor: 47.728
Authors: Simon G Gregory; Silke Schmidt; Puneet Seth; Jorge R Oksenberg; John Hart; Angela Prokop; Stacy J Caillier; Maria Ban; An Goris; Lisa F Barcellos; Robin Lincoln; Jacob L McCauley; Stephen J Sawcer; D A S Compston; Benedicte Dubois; Stephen L Hauser; Mariano A Garcia-Blanco; Margaret A Pericak-Vance; Jonathan L Haines Journal: Nat Genet Date: 2007-07-29 Impact factor: 38.330
Authors: Meredith Yeager; Nianqing Xiao; Richard B Hayes; Pascal Bouffard; Brian Desany; Laura Burdett; Nick Orr; Casey Matthews; Liqun Qi; Andrew Crenshaw; Zdenek Markovic; Karin M Fredrikson; Kevin B Jacobs; Laufey Amundadottir; Thomas P Jarvie; David J Hunter; Robert Hoover; Gilles Thomas; Timothy T Harkins; Stephen J Chanock Journal: Hum Genet Date: 2008-08-14 Impact factor: 4.132