| Literature DB >> 34741135 |
Shirleny Romualdo Cardoso1, Andrea Gillespie1, Syed Haider1, Olivia Fletcher2.
Abstract
Genome-wide association studies coupled with large-scale replication and fine-scale mapping studies have identified more than 150 genomic regions that are associated with breast cancer risk. Here, we review efforts to translate these findings into a greater understanding of disease mechanism. Our review comes in the context of a recently published fine-scale mapping analysis of these regions, which reported 352 independent signals and a total of 13,367 credible causal variants. The vast majority of credible causal variants map to noncoding DNA, implicating regulation of gene expression as the mechanism by which functional variants influence risk. Accordingly, we review methods for defining candidate-regulatory sequences, methods for identifying putative target genes and methods for linking candidate-regulatory sequences to putative target genes. We provide a summary of available data resources and identify gaps in these resources. We conclude that while much work has been done, there is still much to do. There are, however, grounds for optimism; combining statistical data from fine-scale mapping with functional data that are more representative of the normal "at risk" breast, generated using new technologies, should lead to a greater understanding of the mechanisms that influence an individual woman's risk of breast cancer.Entities:
Mesh:
Year: 2021 PMID: 34741135 PMCID: PMC8980003 DOI: 10.1038/s41416-021-01612-6
Source DB: PubMed Journal: Br J Cancer ISSN: 0007-0920 Impact factor: 9.075
Definitions.
| SNP | Single-nucleotide polymorphism: variation at a single nucleotide in the DNA sequence; differs between individuals within a population. By definition, a polymorphism occurs at a frequency greater than 1% in the population. |
| Germline variation | Variants that are inherited from the parents and by definition, therefore, present in a reproductive cell (ovum or sperm) in one parent. |
| Somatic mutation | A variant that occurs de novo in somatic cells of an individual (all cells of the body except the gametes). |
| Copy-number variation | A type of structural variation; specifically, a duplication or deletion event that affects a considerable number of base pairs. |
| Cancer genes | Genes which, based on sequencing of matched “normal” (usually from blood) and tumour DNA, confer a growth advantage to the cancer cells due to somatic and/or germline mutations. |
| Linkage disequilibrium (LD) | The non-random association of alleles at different loci in a population; i.e., the correlation structure between individual variants that map proximal to each other and are, therefore, co-inherited. Linkage disequilibrium is population-specific. |
| GWAS | Genome-wide association study: a population-scale study in which variants that are specifically selected to capture to common variation across the genome (through linkage disequilibrium), are genotyped in individuals with and without a phenotype of interest. |
| Fine-scale mapping | Fine-scale mapping refers to the process by which a GWAS association signal is refined. Specifically, at a given region, a dense panel of variants are selected to be genotyped or imputed and tested for association with outcome. |
| Credible causal variants (CCVs) | Originally defined in Udler et al. [ |
| Functional variant | A variant for which there is evidence (statistical and/or biological) of a causal association (rather than a correlative association, below) with outcome. |
| Correlated variant | A variant which is associated with outcome through correlation (by linkage disequilibrium) with a “functional” variant. |
| eQTL | Expression quantitative trait loci: genomic loci which harbour a variant/variants that show an association between genotype (AA/Aa/aa) and levels of expression of a gene (usually quantified as steady-state mRNA levels). |
| Intermediate phenotype | A quantitative biological trait reflecting the pathway to disease development. Sometimes used as a statistically efficient alternative to a disease outcome. |
| Cis-association | In the context of an eQTL, a cis-association is an association between genotype and levels of expression of a gene that maps proximal to the genetic variant. |
| Trans-association | In the context of an eQTL, a trans-association is an association between genotype and levels of expression of a gene that maps distal to, or on a different chromosome from, the genetic variant. |
| 3’ and 5’ UTR | Untranslated regions: UTRs map upstream of the first codon for translation (5’ UTR) and downstream of the last codon for translation (3’ UTR). The 5’ UTR is important for regulating transcription and the 3’ UTR is important for post-translational regulation of the gene. |
| Promoter | A DNA sequence that binds proteins (including RNA polymerase) that are required to initiate transcription; usually located at the 5’ end of the gene just upstream of the transcription start site. |
| Transcription start site (TSS) | The location at the 5’-end of a gene sequence at which transcription begins. |
| Splice donor and acceptor sites | Recognition sites for mRNA processing; donor-splice is the splicing site at the beginning of an intron (5’ end) and acceptor splice is the splicing site at the end of an intron (3’ end). |
| Enhancer | Regulatory DNA sequence that, when bound by transcription factors, increases gene transcription. Can act in an orientation independent manner (ie an enhancer can be located upstream or downstream of the TSS) and can act over large distances (up to 1 Mb or possibly more). |
| Transcription factor (TF) | Sequence-specific DNA-binding proteins that regulate transcription of a gene by binding to enhancers or promoters. |
| eRNA | Enhancer-derived RNAs: non-coding RNA transcripts originating from genomic regions that carry active histone modifications (H3K27ac, H3K4me1, H3K4me3) indicative of an active enhancer element. eRNAs can be unidirectional or bidirectional. |
| Epigenetics | The study of changes in phenotypes caused by modification of gene expression rather than alteration of the genetic code itself. |
| Promoter hypermethylation | DNA methylation is an epigenetic modification of DNA in which methyl groups are added to the DNA. Methylation can change the activity of a gene without changing the sequence, in particular hypermethylation of CpG islands that map 5’ to a gene promoter is associated with gene silencing. |
| Episomal | Autonomously replicating extrachromosomal DNA; in the context of the methods described in this review, the important point is that the DNA is not integrated into the genome. |
| Pluripotent stem cell | Cells that can self-renew and differentiate into any cell in the body. |
| Cell autonomous | Acting only within the cell in which the gene is expressed, as opposed to influencing the behaviour of surrounding cells. |
| Exome sequencing | Genomic sequencing of the exons in a genome. |
| 3C | Chromosome-conformation capture: a technique for analysing the spatial organisation of chromatin in the nucleus. 3C is a “one-by-one” technique testing for an excess of interactions between two pre-defined regions of interest. |
| Hi-C | Genome-wide version of 3C; the “all-by-all” technique for quantifying all possible pairs of interactions across the genome. |
| DNase-seq | A technique for identifying regions of open chromatin on the basis that nucleosome-depleted DNA at active regulatory regions (promoters and enhancers) is more sensitive to cleavage by DNase I, creating regions of DNase-I hypersensitivity. |
| FAIRE-seq | Formaldehyde-assisted isolation of regulatory elements: a technique for identifying regions of open chromatin on the basis that formaldehyde cross-linking is less efficient in active nucleosome-depleted DNA than in nucleosome-bound DNA. |
| ATAC-seq | Assay for transposase-accessible chromatin: a technique for identifying regions of open chromatin on the basis that a hyperactive transposase (Tn5) preferentially cleaves and tags (tagments) regions of open chromatin. |
| Active histone modifications | Histones can be post translationally modified by methylation, phosphorylation, acetylation, ubiquitylation or sumoylation. Histone modifications are correlated with specific states of activity; acetylation of K27 and mono-methylation of K4 on histone H3 (H3K27ac and H3K4me1) are active enhancer marks, and tri-methylation of K4 on histone H3 (H3K4me3) is an active promoter mark. |
| CTCF | CCCTC-binding factor: a DNA-binding protein that performs a structural role in genome organisation. Depending on the context, CTCF can also recruit histone acetyltransferase-containing complexes or histone deacetylase-containing complexes and function as a transcriptional activator or repressor, respectively. |
| ESR1 | Oestrogen receptor 1: an oestrogen receptor and ligand-activated transcription factor. One of the transcription factors that define the transcriptome in oestrogen-receptor-positive breast cancer cells. |
| FOXA1 | Forkhead box A1: a pioneer factor that can directly bind condensed chromatin and recruit transcription factors (including ESR1 and GATA3) and histone-modification enzymes. One of three transcription factors that define the transcriptome in oestrogen-receptor-positive breast cancer cells. |
| GATA3 | GATA binding protein 3: a transcription factor originally identified in the regulation of T-cell development. One of three transcription factors that defines the transcriptome in oestrogen-receptor-positive breast cancer cells. |
Resources.
| ENCODE | The Encyclopedia of DNA Elements (ENCODE) Consortium maintains a portal of publicly available epigenetic datasets from a wide range of assays for identification of functional and regulatory elements, including many variations of RNA-seq, ChIP-seq, DNase-seq and DNA methylation arrays. |
| Roadmap Epigenomics | The NIH Roadmap Epigenomics Mapping Consortium is a resource that comprises publicly available epigenomic data from primary cells generated using a number of methods, such as histone modification ChIP-seq, RNA-seq and DNA methylation assays. |
| Viestra.org | Digital genomic footprinting providing a high-resolution genome-wide consensus transcription-factor footprint index in 243 human cell and tissue types. Accessible through the ENCODE portal and UCSC browser. |
| Descartes | Single-cell ATAC-seq and gene expression data generated in a broad range of human foetal tissues (53 samples representing 15 organs), to create an atlas of linked cell-type-specific enhancers and genes. |
| IHEC | The International Human Epigenome Consortium provides public access to high-resolution reference human epigenome maps via a data portal bringing together ENCODE, Roadmap Epigenomics, CEEHRC (Canadian Epigenetics, Environment and Health Research Consortium), and other data resources. It interfaces with UCSC, Ensembl and WashU browsers as well as Galaxy for data processing. |
| UCSC genome browser | This widely used browser has many tracks which are useful for annotation; multiple SNP and variant tracks as well as tracks for resources such as ENCODE-integrated regulation and GTEx gene expression. |
| Ensembl genome browser | An extensive resource of publicly available downloadable data along with a genome browser containing regulatory annotations, again including multiple ENCODE data tracks. |
| WashU Epigenome Browser | A browser specifically designed for epigenetic data; the usual SNPs, variation and ENCODE data are available, as well as additional epigenomic datasets from IHEC. |
| GTEx | The Genotype Tissue Expression project is a database of tissue-specific gene expression and regulation data with downloadable and browsable QTLs, levels of expression, H3K27ac ChIP-seq and DNA methylation data. |
| GEO | Gene Expression Omnibus is a public functional genomics data repository supporting Minimum Information About a Microarray Experiment (MIAME)-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles. |
| METABRIC | The Molecular Taxonomy of Breast Cancer International Consortium is a large dataset of breast tumours and matched normal tissue with clinical, gene expression, copy-number aberrations (CNA), and SNP data available via cBioPortal. |
| TCGA | The Cancer Genome Atlas is a conglomeration of over 20,000 primary tumours and matched normal tissue across 33 cancer types with datasets encompassing clinical, whole exome, whole genome, DNA methylation, gene expression, microRNA and proteomic profiles. |
| ICGC | International Cancer Genome Consortium is a collection of 86 cancer genome profiling projects, including datasets generated by the TCGA consortium. These datasets include clinical, whole exome, whole genome, DNA methylation, gene expression, microRNA and proteomic profiles. |
| PCAWG | The Pan-Cancer Analysis of Whole Genomes from ICGC and TCGA includes more than 2600 cancer whole genomes across 38 cancer types explored for somatic and germline variation with particular emphasis on non-coding RNAs, cis-regulatory sites and large structural alterations. The data portal contains somatic and germline mutations (controlled access), DNA methylation, gene expression and clinical data. |
| CCLE | The Cancer Cell Line Encyclopedia is a data portal including 1457 cancer cell lines encompassing gene and protein expression, DNA methylation, miRNA, mutation and CNA data. |
Methods for identifying putative target genes and functional variants.
| Method: summary | Advantages | Disadvantages |
|---|---|---|
| In silico alignment: Alignment of “local” genes and credible variants with markers of open chromatin, active histone marks and/or transcription factors. Reviewed in Klein and Hainer [ | High-throughput in silico analysis Multiple data sources, widely available through, for example ENCODE and Roadmap Epigenomics Project (box 2). Primary cell data available through Roadmap Epigenomics Project. Can be combined into an algorithm. | The relevant tissue and/or cell type is not necessarily known. Biased towards cell lines (MCF-7, MCF 10A and T-47D) and tissue (breast epithelium) rather than primary cells (Fig. Limited markers/TF in primary cells (Fig. By combining data sources, algorithms lose granularity; can use a weighting scheme for different data types but these by definition require a series of assumptions about the hierarchy of data sources. |
MPRA: Massively Parallel Reporter Assay [ CRS are placed upstream of a reporter gene driven by a minimal promoter and barcodes are inserted in the 3’UTR of the reporter gene. The activity of the CRS is measured by pairing its RNA expression to the transcribed barcodes. | High-throughput functional readout of CRS and variants within those sequences across the whole genome. | Limited to cells that can be easily transfected. The length of the sequences tested is restricted by the length of oligos that can be synthesised (~200 bp). Episomal assay. May be confounded by possible effects from promoter-binding proteins. |
| lenti-MPRA [ | Broadens the range of cells and tissue types that can be used, to include hard-to-transfect cell types. Barcodes cloned into the 5’ UTR to reduce the distance between the CRS and barcode and hence, the risk of CRS-barcode swapping. Integration of viral vector provides “in-genome” readout. Using on average >50 barcodes per CRS reduces the impact of binding of RNA-associated factors and RNA stability on the results. | The length of the sequences tested is restricted by the length of oligos that can be synthesised (~200 bp). May be confounded by possible effects from promoter-binding proteins. |
STARR-seq [ CRS are cloned downstream of the reporter gene in the 3’UTR. The activity of the CRS is measured by comparing the amount of RNA produced relative to the amount of genomic DNA in the STARR-seq library. | The elimination of barcodes simplifies the library and allows screening of complex libraries. CRS are cloned rather than synthesised; the length of CRS are limited only by cloning efficiency and a range of 150–1500 bp is possible. | Enhancer activity may be confounded by effects from the binding of RNA-associated factors and the stability of the assayed RNA sequence. Episomal assay. Limited applicability to mammalian genomes due to their size and complexity; has been applied to human cells using selected bacterial artificial chromosomes. |
| CapStarr-seq [ | Overcomes limited applicability to mammalian genomes by incorporating a sequence capture step to focus on regions of interest. | Enhancer activity may be confounded by effects from the binding of RNA-associated factors and the stability of the assayed RNA sequence. Episomal assay. |
| GRO-seq [ | Assesses transcriptional regulation and activity across the whole genome. Sensitive, with a resolution of 10 bp. Robust nascent transcriptome profiles, including short-lived enhancer RNAs Capable of assessing RNAPI, RNAPII, and RNAPIII dynamics and processing properties. Generates precise quantification of promoter-proximal RNA polymerases. Low contamination of processed RNA. | Laborious assay. Requires a high input of cells (~1 × 107). In vitro assay. Regulatory factors bounded to the polymerase might be eliminated by the use of sarkosyl to prevent de novo initiation of transcription. |
| fastGRO-seq [ | More efficient assay time wise and in terms of cell input (0.5 × 106) cells required. Can be used to analyse tissue and primary cells. Highly reproducible. Low contamination of processed RNA. | In vitro assay |
| PRO-seq [ | High resolution (single nucleotide) Low contamination of processed RNA. | Laborious assay. Requires a high input of cells (~1 × 107). In vitro assay. The RNA polymerase position at the beginning of transcription is mostly lost and so, it may not generate a precise quantification of promoter-proximal RNA polymerases. |
| TTchem-seq [ | In vivo assay, based on metabolic labelling of RNA which minimises any variability or cellular stress. 4SU labelling is relatively easy to perform and control which is important when handling multiple samples. Highly reproducible. | Identification of regions of active transcription is limited to a resolution of 20–500 nucleotides which is the RNA fragment size range obtained after fragmentation. High contamination of processed RNA |
| eQTL [ | Direct test of genotype–phenotype association. Can test local (generally defined as ≤1 to 2 Mb) and distant (>1 to 2 Mb) genes. | The relevant tissue and/or cell type is not necessarily known Limited availability of appropriate tissue and/or primary cell data, particularly large series of “normal” tissue/cells Steady state mRNA levels may not be relevant phenotype. |
| Colocalization [ | Reduces false positives by comparing distributions of summary statistics (as opposed to individual variants). By using gene expression data from multiple tissues, can be informative regarding “causal tissues”. | Limited availability of appropriate tissue and/or primary cell data, particularly large series of “normal” tissue/cells Steady state mRNA levels may not be relevant phenotype. |
| LDSC-SEG [ | Requires gene expression but not eQTL data (i.e., does not require genotypes to be associated with the gene expression). Can help to inform relevant tissue or cell type for in vitro experiments. | Assumes that driver genes will be relatively highly expressed in the most disease-relevant tissue types LDSC-SEG additionally assumes that SNPs near such driver genes will be enriched for heritability Limited by the availability of gene expression data in relevant tissues or cell types Steady-state mRNA levels may not be relevant phenotype. |
| Transcriptome-wide association studies (TWAS [ | Informative both for discovery (new risk loci) and for inferring target genes at “known” GWAS loci. Can help to inform relevant tissue or cell type for in vitro experiments. | Limited availability of appropriate tissue and/or primary cell data, particularly large series of “normal” tissue/cells Steady state mRNA levels may not be relevant phenotype. |
| Comparison with somatically mutated cancer genes (boxes 1 and 2): in silico analysis of somatic variation in tumours using whole genome or exome sequences. | Provides robust evidence for a functional role in cancer either on an ad hoc basis or by comprehensively comparing genes that are local (generally within 1 Mb of a locus) with lists of somatically mutated genes. | Undermines the “discovery” aspect of GWAS; only provides confirmation that the concept of an unbiased GWAS approach is sound. |
| CHi-C [ | High throughput Potentially two-sided (i.e., either GWAS loci or the promoters of putative target genes can be used as “baits”). Agnostic | CHi-C interaction peaks will include interactions that are structural (e.g. driven by CTCF and/or cohesion) rather than regulatory in situ CHi-C requires large numbers of cells (new Hi-C kits are reducing the numbers of cells required). Most data have been generated in cell lines, not primary cells—in part due to the requirement for large numbers of cells Interaction peaks are defined by a viewpoint—i.e., linkage-disequilibrium blocks or promoters. |
| ChIA-PET [ | High-throughput two-sided, but only when both ends of the interaction are captured (i.e., they both involve the TF or histone modification of choice). | ChIA-PET requires large numbers of cells; HiChIP less so, particularly with new HiChIP kits Very little published data – ChIA-PET data generated in MCF-7 for ESR1, MCF-7, and POLR2A as part of ENCODE. Interaction peaks are defined by a viewpoint—the TF or histone modification used for the immunoprecipitation. |
| CRISPR-Cas9 | In genome (as opposed to episomal) assay Genome can be precisely manipulated by the CRISPR system’s ability to introduce specific changes. Relatively simple assay to design and perform. | Random modifications can occur in off-target sequences. It is not suitable for all cells; some do not use homologous directed recombination as their main repair pathway, some cells are non-diploid due to genome instability. HDR efficiency is relatively low; for GWAS CCVs where a single base change is often required, base editing approaches may provide an alternative (reviewed in ref. [ |
| CRISPRi (CRISPR interference), CRISPRa (CRISPR activation [ | Highly specific assays, multiple target genes can be modulated simultaneously and the introduced genomic changes are potentially reversible. | Can be challenging to design sgRNA proximal to the region of interest. It is important to design multiple sgRNA for each target as they have variable efficiency. |
Fig. 1Summary of data generated in breast-relevant cell lines, tissue and primary cells that are available through ENCODE and Roadmap Epigenomics.
Datasets that are available through (a) ENCODE and (b) Roadmap Epigenomics are summarised as bar plots. Different data types are colour-coded as indicated in the keys. The cell or tissue types in which the data were generated are shown on the x axis with the number of datasets available in each of these cell or tissue types on the y axis.