| Literature DB >> 23138309 |
Lucas D Ward1, Manolis Kellis.
Abstract
Association studies provide genome-wide information about the genetic basis of complex disease, but medical research has focused primarily on protein-coding variants, owing to the difficulty of interpreting noncoding mutations. This picture has changed with advances in the systematic annotation of functional noncoding elements. Evolutionary conservation, functional genomics, chromatin state, sequence motifs and molecular quantitative trait loci all provide complementary information about the function of noncoding sequences. These functional maps can help with prioritizing variants on risk haplotypes, filtering mutations encountered in the clinic and performing systems-level analyses to reveal processes underlying disease associations. Advances in predictive modeling can enable data-set integration to reveal pathways shared across loci and alleles, and richer regulatory models can guide the search for epistatic interactions. Lastly, new massively parallel reporter experiments can systematically validate regulatory predictions. Ultimately, advances in regulatory and systems genomics can help unleash the value of whole-genome sequencing for personalized genomic risk assessment, diagnosis and treatment.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23138309 PMCID: PMC3703467 DOI: 10.1038/nbt.2422
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
The diversity of genetic architectures underlying human phenotypes.
| Architecture | Notes | Role of computational and regulatory genomics |
|---|---|---|
| Classic monogenic traits | The earliest human genes characterized were those leading to inborn errors in metabolism, which were shown by Garrod in the early 1900s to follow Mendelian inheritance[ | As the underlying mutations tend to alter protein structure, the computational challenge in predicting their effect lies in molecular modeling and structural studies. |
| Monogenic traits with multiple disease alleles | Even monogenetic diseases differ greatly in the extent to which a single risk allele predominates among affected individuals (allelic heterogeneity). On one end of the spectrum, the F508del allele of | As noted above, for protein-coding mutations, the relevant problem is predicting the biochemical effect of the amino acid substitution. In cases of allele heterogeneity, the observed substitutions may be too numerous to characterize experimentally, necessitating computational models ( |
| Multiple loci with independent contributions (“oligogenetic”) | Many variants increase or decrease the risk of a disease, with the final phenotype relying on the genotype at many loci (locus heterogeneity). One example well-studied through linkage analysis is Hirschprung disease, a complex disorder with low sex-dependent penetrance for which at least ten genes are involved, including the tyrosine kinase receptor | Oligogenetic traits, in which a handful of well-characterized loci contribute to the phenotype, may present the best opportunity to observe and quantify epistatic interactions. In cases where non-coding regions are implicated, these haplotypes can be functionally mapped to isolate the most likely causal variants ( |
| Large numbers of variants jointly contributing weakly to a complex trait | GWAS on complex traits are also discovering many weakly-contributing loci. For example a recent meta-analysis of several height studies found 180 loci reaching genome-wide significance[ | In contrast to the variants underlying monogenic traits, the variants involved in complex traits are overwhelmingly not associated with missense or nonsense coding mutations, suggesting that their mechanisms are primarily regulatory[ |
| Variants regulating a “molecular trait” with unknown effect on organismal phenotype or fitness | Variants are rapidly being discovered that directly affect molecular quantitative traits, such as gene expression or chromatin state, many of which may have no effect on organismal phenotype or fitness[ | QTL and allele-specific analyses are needed to characterize these variants ( |
| Variants causing no known molecular phenotype and no effect on organismal phenotype or fitness | The idea that the majority of mutations are neutral from an adaptive perspective was controversial when first proposed, and now is widely accepted[ | Although it is straightforward to calculate from the genetic code what fraction of protein-coding mutations will cause an amino acid change, an analogous estimate for other molecular phenotypes is far more challenging and requires comprehensive regulatory models at the nucleotide level. |
| Private and somatic variants | Somatic mutations within an organism are frequent driver mutations selected in cancer formation[ | The interpretation of private and somatic variations ( |
Figure 1Four types of next-generation association tests
(a) Genetic association with organismal traits is performed in genome-wide association studies (GWAS); at the locus shown, the G allele is associated with disease. The effect of GWAS-discovered variants is mediated through many layers of molecular processes, some of which can also be interrogated at a genomewide scale. (b) Rather than organismal traits, molecular traits can be used, leading to the discovery of local regulatory variants such as expression quantitative trait loci (eQTLs). In this example a local molecular signal, such as a region of open chromatin, varies across the individuals, and is shown to co-vary with presence of the T allele; this allele may influence a cis-regulatory motif of chromatin. (c) Heterozygous sites in individual cells can be used to interrogate allele-specific effects; unlike molecular QTLs discovered across individuals, these studies control for variation in trans genetic background. In this example, the G allele is not only associated with the presence of a TF binding peak at that locus, but in heterozygous individuals is over-represented in ChIP-seq reads originating from that locus, suggesting that the TF binds specifically to the G allele. (d) Functional genomics data can be directly compared between cases and controls to discover biomarkers for disease, without necessarily attributing genetic causes to these molecular changes. Indeed, these biomarkers may be caused by trans genetic factors, environmental factors, or by the disease itself.
Computational tools for association analyses.
| Class of analysis | Tool | Notes |
|---|---|---|
| Genome-wide association between genotype and phenotype (GWAS) | SNPTEST[ | Incorporates imputation |
| Bim-Bam[ | Bayesian regression approach combining imputation and association probabilities | |
| EIGENSTRAT[ | Models ancestry differences between cases and controls using principal components analysis | |
| PLINK[ | Large package including tools to impute, control for population stratification, and hybrid methods such as family-based association and population-based linkage | |
| Local association between genotype and molecular trait (e.g., eQTL) | eQTNMiner[ | Tests a Bayesian hierarchical model incorporating priors based on TSS distance |
| Matrix eQTL[ | Fast association testing of continuous or categorical genotype values with expression | |
| Allele-specific expression and binding | ChIP-SNP[ | For ChIP-chip data |
| AlleleSeq[ | For ChIP-seq and RNA-seq data | |
| Genome-wide association between molecular trait and phenotype (e.g., differential expression, EWAS) | limma[ | For expression microarray data |
| edgeR[ | For RNA-seq data |
Note: analyses using genotype information require tools to call variants, such as BirdSeed[152] on array data or GATK[153] on sequencing data, and tools to impute genotypes, such as MaCH[154].
Mechanisms through which non-coding variants influence human disease.
| Non-coding element disrupted | Molecular function and effect of mutations. | Disease association |
|---|---|---|
| Splice-junction and splicing-enhancer | Splicing is constitutive for some transcripts and highly tissue-specific for others, relying on both canonical sequences at the exon-intron junction as well as weakly-specified sequence motifs distributed throughout the transcript. Mutations affecting constitutive splice sites can have an effect similar to nonsense or missense mutations, resulting in aberrantly included introns or skipped exons, sometimes resulting in nonsense-mediated decay (NMD). | Splicing regulatory variants are implicated in several diseases[ |
| A recent analysis suggests that the majority of disease-causing point mutations in OMIM may exert their effects through splicing[ | ||
| Alternative splice site variants in the | ||
| Skipping of exon 7 of the | ||
| Sequences regulating translation, stability, and localization | Sequences in the 5′-untranslated regions (UTRs) of mRNAs can influence translation regulation, such as upstream ORFs, premature AUG or AUC codons, and palindromic sequences that form inhibitory stem loops[ | Loss-of-function mutations in the 5′-UTR of |
| A rare mutation that creates a binding site for the miRNA hs-miR-189 in the transcript of the gene | ||
| Genes encoding trans-regulatory RNA | Non-coding RNAs participate in a panoply of regulatory functions, ranging from the well-understood transfer and ribosomal RNA to the recently-discovered long non-coding RNAs[ | Both rare and common mutations in the gene |
| Non-coding RNA mutations can cause many other diseases[ | ||
| Promoter | Promoter regions are an essential component of transcription initiation and the assembly of RNA polymerase and associated regulators. Mutations can affect binding of activators or repressors, chromatin state, nucleosome positioning, and also looping contacts of promoters with distal regulatory elements. Genes with coding disease mutations can also harbor independently-associated regulatory variants that correlate with expression, are bound by proteins in an allele-specific manner, and disrupt or create regulatory motifs[ | Mutations in the promoter of the HIV1-progression associated gene |
| Heme oxygenase-1 ( | ||
| Enhancer | Enhancers are distal regulatory elements that often lie 10,000 to 100,000 nucleotides from the start of their target gene. Mutations within them can disrupt sequence motifs for sequence-specific transcription factors, chromatin regulators, and nucleosome positioning signals. Structural variants including inversions and translocations can disrupt their regulatory activity by moving them away from their targets, disrupting local chromatin conformation, or creating interactions with insulators or repressors that can hinder their action. While it is thought that looping interactions with promoter regions play a role, the rules of enhancer-gene targeting are still poorly understood. | The role of distal enhancers in disease was suggested even before GWAS by many Mendelian disorders for which some patients had translocations or other structural variants far from the promoter[ |
| In one early study, point mutations were mapped in an unlinked locus in the intron of a neighboring gene, a million nucleotides away from the developmental gene | ||
| A number of GWAS hits have been validated as functional enhancers[ | ||
| Synonymous mutations within protein-coding sequences | All of the aforementioned regulatory elements can also be encoded within the protein-coding exons themselves. Thus, synonymous mutations within protein-coding regions may be associated with non-coding functions, acting pre-transcriptionally at the DNA level, or post-transcriptionally at the RNA level. | A synonymous variant in the dopamine receptor gene |
Figure 2Dissecting haplotypes discovered through association tests
These three examples are ways to annotate loci containing several linked SNPs (in this case, three) to discover those most likely to be causal. (a) Functional genomics techniques are being developed to discover putative regulatory elements and link these elements to their target genes. Here, the middle SNP lies in an enhancer in Tissue 1 and Tissue 3, and regulates a gene to its left. (b) Regulatory genomics information leads to prediction of sequence motifs active in classes of enhancers, and this can be combined with the motif creation/disruption caused by variants. In this case, the middle SNP deletes a match to motif B, which is predicted to be active in enhancers found in both Tissue 1 and Tissue 3. (c) Comparative genomics identifies regions of evolutionary constraint in non-coding sequence. Here, sequence surrounding only the middle SNP is constrained across mammals.
Comparison of recent tools to systematically annotate variants
Many such tools have been released as databases or software in the past decade; listed below are a sampling of the most recent.
| Tool | Type | Input method | Protein annotation | Regulatory annotation | Other |
|---|---|---|---|---|---|
| SeattleSeq[ | server | variants | deleteriousness scores | conservation scores | dbSNP clinical association data |
| ANNOVAR[ | software | variants, regions | User-defined: user downloads desired variation, conservation, coding and non-coding functional annotations | ||
| ENSEMBL VEP[ | server | variants, regions | deleteriousness scores | regulatory motif alteration scores | OMIM, GWAS data |
| VAAST[ | software | variants | deleteriousness scores | conservation scores | Aggregation to discover rare variants in case-control |
| HaploReg[ | server | variants, studies | dbSNP consequence data | chromatin state, protein binding, DNase, conservation, regulatory motif alteration scores | GWAS data, eQTL, LD calculation, enrichment analysis per study |
| RegulomeDB[ | server | variants, regions | Histone modification, protein binding, DNase, conservation, reguatory motif alteration scores | eQTL, reporter assays, combined score analysis per variant | |
Figure 3Systems-level analyses beyond isolated common haplotypes. (a) Gene-based enrichment analysis of genetic architecture
A typical analysis of GWAS results will compare the set of genes near associated loci with prior knowledge about those genes, leading to hypotheses about the pathways involved (in this example, process A but not process B). (b) Non-coding enrichment analysis of genetic architecture using regulatory annotations. High-resolution maps of diverse regulatory annotations can also be intersected with GWAS results. Examples are shown where tissue-associated enhancers, eQTLs, DNAse peaks, or allele-specific polymerase binding are enriched among the results of a GWAS. In addition, regulatory annotations can be combined with gene-based annotations and linking information, in this case discovering an enrichment for enhancers linked to the genes involved in process A. (c) Interpreting linked loci exhibiting high allelic heterogeneity. In some cases only rare mutations at a locus contribute to its genetic mechanism, and these regions will only be discovered through classical linkage analysis. These regions can now be interrogated through WES/WGS, and an imbalanced burden of putatively deleterious alleles can be observed in cases (as in the left example). With regulatory annotations, these burden tests can now be extended to non-coding regions (as in the right example.) (d) Interpreting causal variants in whole genomes. Personal genomes pose the challenge of exposing potentially causal variants that were too rare or low-penetrance to have been associated with a phenotype through association or linkage studies. For coding alleles, prior knowledge is currently used in several ways when analyzing personal genomes: knowledge of the genetic code (to filter on nonsynonymous variants), inference of negative selection from population panels (to filter out common variants), and models developed from biophysical principles (to focus on those amino acid substitutions most likely to alter protein structure and function.) Similar pipelines will need to be developed for regulatory regions. We propose using both population-level and cross-species signals of selection (to filter out not only common variants, but those that are not constrained across mammals), and all of the regulatory models previously mentioned (predicted regulatory elements and the motifs active within them, molecular trait associations such as eQTLs, etc.) Such a pipeline will be crucial to interpreting the flood of sequencing data that will be collected in both clinical and research settings.
Examples of regulatory enrichment analyses of genetic associations.
| Class of test | Finding | Computational tools used |
|---|---|---|
| Gene set enrichment near associated loci | Regulatory network of five proteins implicated in Kawasaki disease[ | Ingenuity Pathway Analysis (closed-source) |
| Genes differentially expressed in adipose overlap with genetic associations with obesity[ | Microarray analysis of differential expression | |
| TGF-β pathway, Hedgehog signaling pathway are enriched among height GWAS loci[ | GSEA using MAGENTA[ | |
| Concordance with eQTL results | eQTL prioritization during replication facilitated validation of two Crohn’s disease susceptibility loci[ | eQTL enrichment |
| GWAS involving immune system show enrichment for lymphoblastoid eQTL[ | eQTL enrichment (RTC[ | |
| Chromatin state enrichment | Many GWAS show enrichment for enhancers in biologically-relevant cell types[ | ChromHMM to define discrete chromatin states[ |
| TF binding site and DNase hypersensitivity enrichment | Many GWAS show enrichment for ENCODE-annotated DNAse and ChIP sites[ | Enrichment analysis |
| Many GWAS show enrichment for DNAse in biologically-relevant cell types[ | Hotspot algorithm to define discrete hypersensitive sites[ | |
| FOXA1 and estrogen receptor binding sites are enriched among breast cancer GWAS loci[ | Variant Set Enrichment (VSE[ |