Literature DB >> 28426282

Emerging roles of macrosatellite repeats in genome organization and disease development.

Gabrijela Dumbovic¹, Sonia-V Forcales¹, Manuel Perucho^1,2.

Abstract

Abundant repetitive DNA sequences are an enigmatic part of the human genome. Despite increasing evidence on the functionality of DNA repeats, their biologic role is still elusive and under frequent debate. Macrosatellites are the largest of the tandem DNA repeats, located on one or multiple chromosomes. The contribution of macrosatellites to genome regulation and human health was demonstrated for the D4Z4 macrosatellite repeat array on chromosome 4q35. Reduced copy number of D4Z4 repeats is associated with local euchromatinization and the onset of facioscapulohumeral muscular dystrophy. Although the role other macrosatellite families may play remains rather obscure, their diverse functionalities within the genome are being gradually revealed. In this review, we will outline structural and functional features of coding and noncoding macrosatellite repeats, and highlight recent findings that bring these sequences into the spotlight of genome organization and disease development.

Entities: Chemical Disease Gene Species

Keywords: DNA repeats; epigenetics; genome organization; macrosatellite repeats; noncoding genome

Mesh：

Substances：
DNA, Satellite

Year: 2017 PMID： 28426282 PMCID： PMC5687341 DOI： 10.1080/15592294.2017.1318235

Source DB: PubMed Journal: Epigenetics ISSN： 1559-2294 Impact factor: 4.528

Introduction

Recently, there has been substantial progress in understanding genome content and what is considered a functional DNA sequence, moving away from the classical dogma centered on protein coding genes. Even though repeats have been traditionally considered as junk DNA because their functionality was elusive, several repeat families have been recognized as important players in genome structure, evolution and diversity. Nevertheless, DNA repeats still remain one of the most puzzling components of the genome. As a constitutive part of the genome, these sequences are replicated and maintained through the individual's successive generations. They fulfill the concept of double helix “selfish” replicators, having their own survivability pressure during evolution. Leaving aside their own existence as individual “parasite” replicator entities, the scope of this review is to describe their structural and functional features in the context of their host human genome and their impact on disease development. The human genome contains a large portion of repetitive DNA. While the current version of RepeatMasker identifies around 56% of the human genome as repetitive, recent studies propose even higher numbers, with estimates of up to 69%. The vast majority of repeats are still poorly investigated due to extensive computational and experimental limitations. Repeat-rich regions are difficult to align and assemble, thus they are frequently absent from the reference genome or not placed in their correct genomic context. Furthermore, high-throughput genome-wide studies, such as ChIP-seq and RNA-seq, which became essential tools in molecular biology research, are limited in analyzing repeat-derived reads as they present ambiguities in alignment to the reference genome. For all these reasons, the study of the genomic implications of repeat alterations at both DNA and RNA level is a difficult task. Although these drawbacks have significantly hindered the progress in understanding the role of repeats in genome stability and disease development, repeated DNA sequences are gaining attention as research on the noncoding genome is steadily growing. Continuous work trying to illuminate the role of repeats in the genome has put forward new perspectives on the mechanisms by which they impact genome stability. Based on the pattern of their distribution, DNA repeats can be classified as interspersed repeats or tandem repeats (Fig. 1). Interspersed repeats are dispersed across the genome and include retro(pseudo)genes, tDNA, transposons, and local repeats. Tandem repeats are organized in a head-to-tail orientation and include ribosomal DNA (rDNA) and satellite repeats. Based on the size of each repeat unit, satellite repeats can be further divided in microsatellites with units of 1 to 6–10 bp, minisatellites with repeat units from 10 bp to few hundred bp, and macrosatellites with repeat units of several kb in length.

Figure 1.

Repetitive DNA in the human genome. The diagram shows various classes of DNA repeats in the human genome, classified according to their pattern of occurrence.

Repetitive DNA in the human genome. The diagram shows various classes of DNA repeats in the human genome, classified according to their pattern of occurrence. It is estimated that satellite repeats cover around 3% of the human genome, with microsatellites being the most abundant. Both, micro- and mini-satellites, display notable instability and dynamics. Microsatellites are altered with a relative high frequency both ontogenetically, and especially phylogenetically, during DNA replication in mitosis due to slippage by strand misalignment. Minisatellites also exhibit extreme polymorphisms in the form of copy number, length, and sequence composition. Unlike microsatellites, minisatellites can undergo alterations during meiosis, which made them suitable for DNA fingerprinting and population studies. However, unlike the larger minisatellites, microsatellite-containing DNA fragments are usually small enough to be amplified by PCR, and hence microsatellites have almost completely replaced minisatellites as genetic markers. Changes in mini- and micro-satellites correlate with various diseases including cancer, and have been extensively studied. For instance, microsatellite instability (MSI) is the landmark of hereditary non-polyposis colorectal cancer (HNPCC), and also accounts for around 10% of non-hereditary (sporadic) colorectal cancers. Another example of the involvement of microsatellites in pathogenesis are the expansions of triplet repeat motifs that are recognized as a cause of several neurologic and neuromuscular diseases. Triplet repeat expansion disorders include common inherited diseases, such as Huntington's disease, myotonic dystrophy, and fragile X syndrome. For example, expansion of the cytosine-adenine-guanine (CAG) repeat is the underlying cause of triplet repeat disorders collectively known as polyglutamine diseases, one of which is Huntington's disease. Extended CAG tracts are translated into a series of uninterrupted glutamine residues, which are prone to aggregation, thus causing cellular toxicity. Accumulating evidence also reveals interesting aspects of repeats in gene regulation. For instance, changes in the length of GA and CA microsatellite dinucleotide repeats in gene promoters were associated with differences in gene expression. Recently, this phenomenon was attributed to dinucleotide repeat motifs having an effect on enhancer activity. Dinucleotide repeat motifs are highly enriched in enhancers, particularly in those that are broadly active across different cell types. The importance of these motifs in enhancer function was demonstrated by inserting these repeat motifs in an inactive sequence that became a de novo active enhancer. Moreover, repeats were shown to have a role in the regulation of long noncoding RNA (lncRNA) expression, protein interactions, and subcellular location. For instance, the X-inactive specific transcript lncRNA (Xist), which inactivates the female inactive X chromosome, was demonstrated to recruit Polycomb repressive complex 2 (PRC2) through the use of its repeat motifs located at the 5′ end of Xist RNA, known as Repeat A region. Another repeat motif in the first exon of Xist RNA, known as Repeat C, was shown to be necessary for Xist RNA loading on the inactive X chromosome by binding YY1, which bridges the interaction of Xist RNA to DNA. The recently discovered long intergenic noncoding RNA Firre (Functional Intergenic Repeating RNA Element) is a strictly nuclear RNA that plays a role in adipogenesis by mediating trans-chromosomal interactions. A local repeating RNA domain (named RRD) in the lincRNA Firre was shown to act as a ribonucleic nuclear retention signal, without which the Firre RNA location shifts from nuclear to cytoplasmic. Other examples demonstrate that transposable elements can regulate the expression of lncRNAs, dividing them into cell type-specific classes and possibly regulating their evolution. Collectively, these data suggest an important and unforeseen role of distinct repeat classes as RNA and DNA regulatory elements. More recently, macrosatellite repeats (MSRs) are emerging as unique structures in the human genome. Although MSRs are sequence-unrelated, they share some features (Table 1). These include, spanning in tandem over hundreds of kilobases covering significant portions of the genome, being rich in CpGs, thus often regulated by DNA methylation, and frequently expressing noncoding and coding RNAs. Taken together, it is now well accepted that MSRs have a structural and regulatory role in the organization of the chromatin in the nucleus.

Table 1.

Main characteristics of some of the best-described macrosatellites in the human genome.

Name	Repeat length (kb)	CNV	Location (hg38)	GC content (%)	Methylation changes in disease	Associated disease	Encoded product	ncRNA	Refs.
D4Z4	3.3	1–150	4q35	71%	DNA hypomethylation	FSHD, ICF syndrome	DUX4	Long sense transcript (DBE-T), long sense and antisense transcripts originating within each repeat unit, siRNAs, miRNAs	^43-60,81
DXZ4	3	12–100	Xq23	62%				Long sense and antisense transcript and small antisense RNA	^61,65
NBL2	1.4	not determined	21p11.2	62%	DNA hypomethylation^*	Ovarian, colorectal, breast, gastrointestinal and hepatocelular cancer, neuroblastoma, ICF syndrome			^76-81,86
RS447	4.7	20–103	4p16.1	50%			USP17	Long antisense transcript	^89,91,92
RNU2	6.1	5–82	17q21-q22	65%				U2 snRNA	^36,98,99
TAF11-Like	3.4	10–98	5p15.1	50%		Possible role in schizophrenia	TAF11		^35,109
CT47	4.8	4–17	Xq24	48%			CT47		^110,111

Besides their frequent hypomethylation in cancer, in ovarian cancer, and Wilms tumors, NBL2/SST1 repeats were reported to be more frequently hypermethylated at HhaI site, than hypomethylated at NotI site77.

Main characteristics of some of the best-described macrosatellites in the human genome. Besides their frequent hypomethylation in cancer, in ovarian cancer, and Wilms tumors, NBL2/SST1 repeats were reported to be more frequently hypermethylated at HhaI site, than hypomethylated at NotI site77.

Coding, noncoding, and architectural roles of macrosatellite repeats in genome organization and disease development

MSRs epigenetic and/or genetic alterations are associated with several human diseases, including cancer. The mechanistic contribution of MSRs to disease development has been analyzed in detail for some MSRs, such as D4Z4, while for others it is unclear to which extent they may contribute to disease development and genome stability. The number of macrosatellite repeats in a tandem array is known to vary between different individuals, from several to hundreds of copies, contributing to significant copy number variation (CNV) that may be related to disease. The true copy number of many macrosatellites is probably underestimated. It has been shown that repetition of a sequence in tandem triggers automatic heterochromatization in cis in a copy number dependent manner. In 1998 Garrick et al. demonstrated that higher copy number of a transgene is associated with its hypermethylation and adaptation of a repressive local chromatin configuration, resulting in transcriptional silencing of the transgene. On the contrary, reduction of transgene number to just a few copies resulted in high transgene expression and more accessible local chromatin structure. This repeat feature has been proposed to serve as protection against the consequences of parasitic sequence elements integrated in the genome in high copy number, namely viruses and transposons. It has also been proposed that during evolution this putative general feature of repeat elements was adapted to regulate expression of adjacent genes, mostly to induce silencing.

D4Z4 macrosatellite regulates local chromatin structure in a copy number dependent manner

In somatic cells, many MSRs display features of heterochromatin, with high DNA methylation and repressive histone marks, such as trimethylation of the lysine 9 residue of histone H3 (H3K9me3) and in some cases trimethylation of lysine 27 of histone H3 (H3K27me3). Accordingly, it has been shown that high copy number of some MSRs can have an effect on the chromatin structure in cis and on the regions immediately proximal, and thus contribute to genome stability by triggering gene silencing. D4Z4 is a 3.3 kb repeat located at the subtelomeric regions of chromosomes 4q35 and 10q26. It has been a major research focus due to the link of the D4Z4 array from chromosome 4q35 with the development of facioscapulohumeral muscular dystrophy (FSHD). FSHD is characterized by progressive wasting of muscles in the face, shoulders, and upper arms. The most common form of FSHD is FSHD1, accounting for 95% of cases. FSHD1 is an autosomal dominant disease, with the only detectable genetic defect being the reduction in the copy number of D4Z4 repeats to less than 11 units within the 4q35 subtelomeric repeat array, with the presence of at least 1 repeat unit necessary for FSHD1 development. Healthy individuals carry between 11 to 150 copies of D4Z4 that are characterized by highly methylated DNA and organized in heterochromatic structure with H3K9me3 and H3K27me3. In FSHD1 patients, reduced copy number of D4Z4 repeats is accompanied by local loss of repressive marks and overexpression of surrounding genes. In rare FSHD cases, D4Z4 copy number is not altered and is within the normal range. However, mutations in proteins SMCHD1 and DNMT3B, which are involved in heterochromatin formation at D4Z4 locus, can lead to occurrence of this disease, further reinforcing the importance of heterochromatinization of the tandem repeat. This type of FSHD occurs in approximately 5% of patients and is referred to as FSHD2. FSHD2 is clinically identical to FSHD1 and both are characterized by a loss of heterochromatin at D4Z4 locus and thus a de-repression of the region. D4Z4 repeats contain an open reading frame (ORF) coding for a double homeobox 4 gene (DUX4), usually silenced in normal tissues except for testis. The DUX4 protein has been shown to be pro-apototic and thus could explain muscle weakness observed in FSHD patients. Aberrant production of DUX4 protein requires an open chromatin conformation in addition to specific polymorphisms involved in RNA stabilization and processing. Expression of DUX4 in FSHD contributes to upregulation of germline genes, endogenous retrotransposons [long-terminal repeat (LTR) elements from MaLR class], RNA splicing and processing genes, atrophy-ubiquitin ligases, noncoding RNAs, and skeletal muscle suppressors of differentiation. All together, this DUX4-induced gene expression signature in FSHD is a major contributor to the disease's pathophysiology. There are several models for FSHD1 pathogenesis. D4Z4 is GC rich and displays features of a CpG island. Hence, it has been hypothesized that contraction of the array induces changes in chromatin structure leading to inappropriate transcriptional regulation of several FSHD candidate genes: either the DUX4 gene in D4Z4 unit, genes adjacent to D4Z4 tandem array, or genes that might be regulated by D4Z4 region in trans. In this context, 3C analysis of this region revealed specific chromatin contacts between DUX4 and adjacent genes in normal conditions, whereas in FSHD other interactions take place, probably as a result of global reorganization of the 4q35 region upon D4Z4 contraction. Although there is a disagreement on which of the FSHD candidate genes is causing FSHD1, it is clear that the presence of D4Z4 contracted alleles is essential for disease development, indicating an important role of D4Z4. Recently, in efforts to explain the mechanism underlying the epigenetic changes at the contracted D4Z4 array in FSHD1, Cabianca et al. demonstrated that upon D4Z4 copy number reduction, the effect of Polycomb silencing is reduced, resulting in expression of long sense RNAs originating in a distal region of D4Z4 array and extending through multiple repeats. Although D4Z4 repeats contain DUX4 ORFs, these long sense transcripts were nuclear and associated to the chromatin in cis, and thus they were considered noncoding and accordingly named D4Z4 binding element transcript (DBE-T). DBE-T recruits the Tritorax group protein Ash1L to D4Z4 repeats in cis, resulting in H3 lysine 36 dimethylation, and long-range gene upregulation (Fig. 2). This study was the first to show how a macrosatellite repeat-derived long noncoding RNA can alter chromatin composition in cis, an important step in understanding macrosatellite biology and its causative role in disease. In addition, it demonstrated a clear association of repeat number reduction and production of a long noncoding RNA.

Figure 2.

D4Z4 regulates local chromatin structure and expression of surrounding genes via long noncoding RNA. Healthy individuals carry between 11 and 150 copies of D4Z4 macrosatellite in the subtelomeric regions of chromosome 4 (4q35). D4Z4 repeats are highly methylated and enriched in H3K9me3. D4Z4 repeats are targets of Polycomb group proteins (PcG), with a resulting repressive chromatin structure and surrounding genes in transcriptionally silent state. In patients with facioscapulohumeral dystrophy 1 (FSHD1) there is a reduction of D4Z4 copy number to between 1 and 10 copies. The contracted allele loses heterochromatin features (DNA methylation, H3K9me3, H3K27me3) and expresses a long noncoding RNA DBE-T, from a promoter distal to the repeat array, that binds and recruits Tritorax group protein Ash1L in cis, resulting in H3 lysine 36 dimethylation and long-range gene up regulation. In addition, transcription within each repeat unit was reported to occur bidirectionally, indicated by red (sense transcription) and blue (antisense transcription) arrows. Sense and antisense transcripts originate from promoters mapped upstream and downstream of DUX4 ORF, respectively, and are transcribed through multiple D4Z4 repeat units. Those transcripts are suggested to give rise to small ncRNAs. Model design based on,ref. 57, 59 and 60. Upon D4Z4 epigenetic de-repression, long transcripts through multiple D4Z4 repeat monomers in both sense and antisense directions have also been detected. In contrast to DBE-T, these transcripts originate at each repeat unit either near DUX4 promoter (for sense transcription) or at a distal region of DUX4 ORF (for antisense transcripts). These sense and antisense transcriptions modulate DUX4 expression, and give rise to siRNA and miRNA which contribute to epigenetic silencing of the locus, opening new avenues to potential therapeutic approaches. Considering that many MSRs show significant CNV between individuals, D4Z4 study encourages further research toward unveiling whether similar mechanisms may occur with other MSRs types. Brachmachary et al. provided further evidence for a strong correlation between macrosatellite copy number and epigenetic modifications, and in several cases with nearby gene expression, supporting the hypothesis of repeat-induced gene silencing as a mechanism of gene regulation in humans. Their research included MSat10, a relatively unknown GC-rich 5.4 kb macrosatellite repeat, located several kb distal to the ZFP37 gene on chromosome 9q32. Similar to D4Z4, high copy number of Msat10 associates with high local DNA methylation and H3K9me3. On the other hand, reduction in copy number leads to the loss of heterochromatic features and de-repression of the adjacent ZFP37 gene, although in this case generation of a lncRNA was not reported.

DXZ4 macrosatellite regulates higher order nuclear architecture

DXZ4 is a 3 kb, CpG-rich macrosatellite present between 12 and 100 tandem copies on chromosome Xq23. Gialcone et al. discovered DXZ4 in 1992 as a novel X-linked variable number tandem repeat (VNTR) harboring different DNA methylation levels on the active and inactive X chromosome. In mammalian females, one of the two X chromosomes is subjected to a process known as X chromosome inactivation to ensure similar levels of expression of X-linked genes compared to males. Thus, females have one active X chromosome (Xa), and one inactive X chromosome (Xi). Xi is transcriptionally silenced, characterized by facultative heterochromatin and organized in a 3D configuration within the nucleus, known as the Barr body. Since its discovery, DXZ4 drew attention because it adopts alternative chromatin states on Xa and Xi chromosome, which differ from the surrounding chromatin. On the Xa in males and females, DXZ4 is organized in constitutive heterochromatin, characterized by the presence of H3K9me3 and DNA hypermethylation. On the contrary, DXZ4 harbors opposite chromatin structure on the Xi: DNA hypomethylation, H3K4me2 and H3K9Ac, hallmarks of euchromatin. Due to the lack of active histone marks on the Xi chromosome, DXZ4 can be visualized on the metaphase Xi chromosome by immunofluorescence as an intensive H3K4me2 signal, surrounded by heterochromatin. On Xi, but not on Xa chromosome, DXZ4 is bound by CTCF, a highly conserved multifunctional DNA-binding protein implicated in multiple processes throughout the genome, including chromatin insulation and interchromosomal interactions. In 2008 Chadwick demonstrated that the CTCF-binding region of DXZ4 unidirectionally interferes with promoter-enhancer communication, supporting the hypothesis that DXZ4 repeat array might act as an insulator. On a transcriptional level, the authors demonstrated that DXZ4 is expressed from a bi-directional promoter located within each repeat unit. Their analysis revealed long sense transcripts originating from Xa and Xi arrays, and a long antisense transcript specific to the Xi. Moreover, small antisense RNAs originate from four specific regions of DXZ4, and the site of origin of three of them overlaps precisely with the H3K9me3 and H3K4me2 peaks. This led them to speculate that the small RNAs are involved in heterochromatin formation and maintenance at the DXZ4 locus (Fig. 3).

Figure 3.

DXZ4 plays a role in genome organization and Xi chromosome higher order structure. DXZ4 harbors opposite chromatin structures on Xa and Xi chromosome, which differ from the surrounding chromatin. On Xi, DXZ4 displays features of euchromatin (hypomethylated CpGs and H3K4me2) and is bound by CTCF. Arrays on Xa and Xi chromosome are transcriptionally active; however, on Xa DXZ4 is transcribed in a long sense transcript and four small antisense RNAs, 3 of which overlap H3K9me3 peaks. On Xi chromosome, DXZ4 is transcribed into long sense and antisense transcripts. DXZ4 on Xi chromosome is necessary for regulating Xi higher order structures. Black dots represent methylated CpGs, white dots unmethylated CpGs. Model design based on ref. 65. For a long time it has been hypothesized that DXZ4 might have a role in X chromosome inactivation and/or chromatin organization, especially considering it is bound by CTCF on the Xi chromosome. The first observation that DXZ4 indeed does participate in higher order structure organization on the Xi chromosome was made by Horakova et al. By applying DNA FISH and 3C analysis, the authors demonstrated Xi-specific long-range interactions between DXZ4 and two newly described tandem repeats, named X56 and X130, in a CTCF-dependent manner. Recent studies using chromosome conformation capture approaches confirmed those results and revealed that human and mouse Xi chromosomes are split into two large superdomains separated by a region containing DXZ4 repeats. Rao et al. also reported that Xi chromosome forms very large chromatin loops called superloops, with some of them anchored at the DXZ4 macrosatellite. Two recent studies provide strong evidence of DXZ4 having an essential role in the regulation of Xi higher order structure and nuclear organization. By applying genome-wide chromosome conformation capture analysis, authors found that deletion of DXZ4 from Xi chromosome led to the loss of bipartite structure of Xi, disrupted superloops anchored at DXZ4, and induced changes in compartmentalization of the nucleus and in chromatin marks.

NBL2 macrosatellite is hypomethylated in many types of cancer

Some MSRs, such as NBL2, have been shown to be frequently hypomethylated in various types of cancer. NBL2 is a 1.4 kb macrosatellite repeat found mostly on the short arm of acrocentric chromosomes 13, 14, 15 and 2176,77 (intriguingly, not 22), and belongs to a family of macrosatellite repeats known as SST1. NBL2 is CpG-rich and highly methylated in somatic cells. Thoraval et al. discovered NBL2 in 1996 during genome-wide screening for DNA methylation differences in neuroblastoma tumors compared with normal cells by 2D separations of human genomic restriction fragments. They digested DNA from neuroblastoma cells and peripheral blood lymphocytes with methylation sensitive NotI restriction enzyme and two additional cutters, and labeled NotI-derived 5’ ends with 32P. Among fragments appearing hypomethylated at the NotI site they found a previously unreported repeat family, which they named NBL2 and whose sequence was submitted to EMBL database under accession number U59100. By applying the same approach, in 1999 Nagai et al. independently found the same sequence hypomethylated in 75% of hepatocellular carcinomas, especially in those with hepatitis B virus infection, which they named NotI repeat (submitted to EMBL database under the accession number Y10751). The majority of NotI sites of the human genome lie within CpG islands. Because NBL2 is CpG rich and contains a NotI site, it was suitable for detection with 2D separations of human genomic fragments with NotI restriction enzyme. Thus far, in addition to neuroblastoma and hepatocellular carcinoma, NBL2 was found to be hypomethylated at NotI sites in high risk gastrointestinal tumors, bladder cancer, immunodeficiency, centromeric instability, and facial abnormalities syndrome patients, which have DNMT3B gene mutated. Additionally, it was also found strongly hypomethylated in sperm. In ovarian cancer and Wilms tumors, NBL2 was reported to be more frequently hypermethylated at HhaI site than hypomethylated at NotI site. Previous work in our group also identified a prominent hypomethylated sequence in colon cancers by the methylation sensitive amplified fragment polymorphism (MS-AFLP) DNA fingerprinting technique. The sequence was later on identified as SST1/NBL286. In-depth analysis of NBL2 hypomethylation by bisulfite sequencing of an internal 317 bp region, containing a NotI site, showed that SST1/NBL2 was hypomethylated in 22% of colorectal cancers (CRCs), in 15% of gastric cancers, in 20% of ovarian cancers, and in 20% of breast cancers. Thus, alterations in NBL2 methylation are characteristic of many cancer types. Nevertheless, the advancement in understanding either the cause or the consequence of this hypomethylation has been slowed down, mostly because NBL2 is not assembled in the reference genome. With the release of genome version hg38, a group of NBL2 repeats were mapped to chromosome 21p11.2, although some other genomic NBL2 loci still remain unassembled. Consequently, the genomic context of NBL2, such as distance to protein coding genes and other regulatory features, are not known. Despite all these challenges, we could determine that in CRC, somatic demethylation of NBL2 was associated with genomic damage assessed by arbitrary primed PCR (AP-PCR), especially in tumors with wild type TP53. Furthermore, in CRC cell lines and primary tumor samples NBL2 hypomethylation is accompanied by local changes in chromatin structure (Fig. 4). In normal somatic cells, NBL2 displays features of constitutive heterochromatin, with high levels of DNA methylation and H3K9me3. However, upon hypomethylation, H3K9me3 levels are decreased, accompanied by a gain in Polycomb repressive mark H3K27me3, typical for facultative heterochromatin. Both chromatin states observed at NBL2 region are considered to form stable chromatin; however, there are important differences in the plasticity of both states.

Figure 4.

NBL2 macrosatellites are frequently hypomethylated in colorectal cancer (CRC). In normal colon epithelium, NBL2/SST1 repeats display features of constitutive heterochromatin, with high levels of DNA methylation and H3K9me3. In CRC, NBL2/SST1 repeats undergo gradual hypomethylation during aging associated with wild type TP53. Some CRC patients harbor strongly hypomethylated NBL2/SST1 repeats that implicates mechanisms other than aging, and preferentially occurs in mutated TP53 tumors. Hypomethylation of NBL2 results in reprogramming of NBL2 chromatin state from constitutive heterochromatin to facultative heterochromatin characterized by a gain in H3K27me3. NBL2 DNA hypomethylation is linked to increased genomic damage in cases with wild-type TP53. Black dots represent methylated CpGs, white dots unmethylated CpGs. Model design based on ref. 86. Moreover, a detailed bisulfite analysis revealed two types of NBL2 hypomethylation: moderate hypomethylation, with 5–10% average hypomethylation in tumor compared with adjacent normal tissue, and severe hypomethylation with equal or more than 10% of NBL2 average hypomethylation in tumors compared with adjacent normal tissue. While moderate hypomethylation of NBL2 in CRC patients appeared age-dependent, the severe cases tended to occur in younger patients. Therefore, in those severe cases, NBL2 hypomethylation could be caused by mechanisms other than gradual, stochastic erasure of methylation patterns during aging. The precise mechanism that may causally link NBL2 somatic demethylation and chromosome instability remains to be established. In this context, a chromatin remodeler enzyme, called helicase lymphoid specific (HELLS), which is known for its role as the “epigenetic guardian of repetitive elements” was found to associate with methylated NBL2 in cell lines. Furthermore, downregulation of HELLS resulted in NBL2 hypomethylation. The mechanism by which HELLS altered function or impaired recruitment to NBL2 loci could contribute to the somatic demethylation of NBL2 before and/or during CRC development is under investigation.

RS447 macrosatellite codes for ubiquitin-specific protease 17

In addition to their noncoding functions, some macrosatellites have been shown to encode for functional proteins. RS447 is a 4.7 kb macrosatellite present on chromosome 4p15 and several copies on chromosome 8p. RS447 repeat units display promoter activity and contain USP17 (ubiquitin-specific protease 17) gene, which codes for a functional deubiquitinating enzyme. USP17 removes ubiquitin from target proteins, and it has been shown to play important roles in tumor pathology. Several reports have proved that USP17 acts as a critical regulator of cell proliferation, migration and survival through regulating Ras pathway. In 2011 de Vega et al. demonstrated that USP17 depletion blocks chemokine-induced subcellular relocalization of GTPases Cdc42, Rac and RhoA, which are GTPases essential for cell motility, thus demonstrating that USP17 has a critical role in cell migration. Okada et al. performed a pedigree analysis of RS447 transmission, and detected that RS447 copy number is highly variable, ranging between 20–103 copies. Furthermore, they showed a high frequency (8.3%) of meiotic instability and somatic mosaicism. Because USP17 forms a part of RS447 repeats, the difference in the copy number of RS447 could result in altered USP17 expression levels and thus possibly affect several cellular processes. However, using cosmid vectors containing different numbers of RS447 repeat, Saitoh et al. demonstrated that the level of RS447 sense transcripts and USP17 protein was independent to the integrated copy number of RS447, while abundance of a high molecular weight RS447 antisense transcript was proportional. The process of antisense transcripts regulating the expression of their complementary sense transcripts on a transcriptional or post-transcriptional level has been recognized as a mechanism of antisense-mediated gene regulation. This opens the possibility that the copy number dependent large RS447 antisense transcript may act as a suppressor of the sense transcripts, thus buffering the difference in the repeat copy number. Okada et al. also reported that the RS447 allele can be partially methylated. It is probable that a combination of antisense transcripts and DNA methylation regulate the levels of USP17 protein in a copy number dependent manner.

RNU2 macrosatellite encodes for a housekeeping small noncoding RNA

RNU2 is a 6.1 kb macrosatellite present from 5 to 82 tandem copies on chromosome 17q21-q22. Although RNU2 arrays differ in repeat copy number from individual to individual, the arrays are stably inherited. Each RNU2 unit encodes for a housekeeping noncoding RNA, U2 small nuclear RNA (U2 snRNA). U2 snRNA is ubiquitously expressed and is an essential component of RNA splicing machinery (spliceosome). Every repeat unit contains snRNA transcriptional control elements (TATA-less promoter/enhancer and 3’ end formation signal), 5 Alu, one LTR retrotransposon and a polymorphic tract of a CT microsatellite, being an example of a microsatellite repeat embedded within a macrosatellite. Repeat units within an individual RNU2 tandem array appear to be identical, except for the CT microsatellite, which exhibits minor length and sequence polymorphisms. Various roles for the CT microsatellite were proposed: required in DNA recombination for the concerted evolution of RNU2 repeats; establishment, and/or maintenance of U2 tandem arrays; and maintenance of an open chromatin structure. Importantly, the nearest gene to the RNU2 tandem array is BRCA1, located 124 kb away. Both loci lie within the same linkage disequilibrium block, which allowed to calculate RNU2 macrosatellite mutation rate by tracing BRCA1 mutations in different families. This gives an estimation by maximum likelihood of 5 × 10−3 mutations per generation, which is close to that of microsatellites. RNU2 macrosatellite is evolutionarily conserved through speciation in baboon, orangutan, gorilla and chimpanzee. Mutations in one copy of the U2 snRNAs have been shown to cause splicing alterations that lead to neurodegeneration in mice; however, whether mutations in human RNU2 genes or CNV in this array may contribute to disease is not known. Importantly, a 3´ fragment of U2 small nuclear RNA, miR-U2–1, could be a marker for non-small cell lung carcinoma. Due to the difference in copy number, RNU2 genes must be subjected to some form of dosage compensation, although the mechanism is still not clear. There is evidence for RNU2 bimodal pattern of methylation: the 1.5 kb region covering the U2 transcriptional control elements, the U2 snRNA gene sequence and the CT microsatellite is completely unmethylated, whereas the rest of the U2 repeat (approximately 4.6 kb) is heavily methylated. The authors propose that this type of bimodal methylation may permit both efficient expression of U2 snRNA and stable maintenance of U2 tandem arrays in somatic cells.

TAF11-Like macrosatellite codes for a TAF family factor

This tandem array is located in the short arm of human chromosome 5p15.137. Each repeat unit is approximately 3.4 kb in length and contains a short open reading frame of 594 bp encoding for a predicted TATA binding associated factor 11 like protein, which gives the name to the macrosatellite. The repeat unit also contains LTR retrotransposon (MLT1E3), a disrupted DNA transposon (Charlie 2a) and a partial Alu repeat. Pulsed field gel electrophoresis (PFGE) and Southern blot analysis with a probe specific to the TAF11-Like macrosatellite revealed that the size of the TAF11-Like array is very polymorphic in the population, ranging from 34 to 335 kb, thus indicating that the copy number of TAF11-Like macrosatellite can range between 10 and 98 tandem copies. The closest genes to TAF11-Like macrosatellite are brain abundant, membrane attached signal protein (BASP1) at 210 kb and cadherin 18 type 2 preproprotein (CDH18) at over 1.8 Mb. However, whether TAF11-Like macrosatellite length or its epigenetic status could influence the expression of these genes or contribute to disease is not clear. One study showed that alleles contracted to less than 21 tandem repeats associate to schizophrenia, which could support a contributory role to the disease, perhaps in a regulatory manner similar to what has been shown for contracted D4Z4 macrosatellites and facioscapulohumeral dystrophy. However, other schizophrenia families did not show contracted TAF11-Like macrosatellite array according to quantitative PCR (qPCR) analysis, and therefore the results on TAF11-Like possible contribution to schizophrenia were inconclusive. Nevertheless, the authors hypothesize that the low monomer numbers in one 5p15.1 allele may be masked by the uncontracted allele when analyzed by qPCR, since the average number of both alleles may be higher than 21 repeats. These results highlight that qPCR, which represents the sum of the repeat number, may not be sensitive enough to measure differences in allele size of tandemly repeated DNA. Expression of TAF11-Like RNA has been detected in testes, brain and fetal tissues from brain, liver and prostate; however, the biologic significance of TAF11-Like expression remains elusive. Nevertheless, the TAF11-Like ORF sequence is conserved in primates, revealing a translated 198 aminoacid sequence with 90.9 to 96% identity in great apes and 86.4% in Macaque (199 aminoacids). Furthermore, several peptides from the putative protein [accession A6NLC8 in the PRoteomics IDEntifications (PRIDE) database] can be detected by mass spectrometry in different analyses. These data point toward a functional TAF11-Like protein, but more research is required to fully understand its functionality.

CT47, a macrosatellite with testis-specific expression

The cancer/testis gene CT47 is located on Xq24, arranged in 4 to 17 tandem repeats of 4.8 kb each. CT47 RNA is putatively coding, 1,286 bp in length [excluding the poly(A) tail] with the coding region of 867 bp encoding a protein of 288 aminoacids. Chimpanzee is the only species other than human with a gene homologous to CT47, which is also located on chromosome X. The predicted protein is approximately 80% identical in its carboxy terminal region between the two species. CT47 is highly expressed only in testis, and low levels are detectable in placenta and brain, while silenced during early development and in other normal somatic tissues tested. CT47 expression was detected in 14% of lung cancer, in 15% of esophageal cancer and in 11% of endometrial cancer specimens, but not in colorectal, breast and bladder tumors tested. In normal somatic cells, CT47 is organized in heterochromatin, characterized by high levels of H3K9me3, H3K27me3, methylated CpG sites around CT47 promoter and silenced CT47. Balog et al. encouraged with the clear correlation between D4Z4 copy number and local heterochromatin formation, studied whether reduced CT47 copy number would result in a loss of heterochromatin features, and consequently CT47 expression. Their results indicate that within the tested copy number range (4 to 17 CT47 copies), CT47 copy number does not correlate with local H3K9me3 and H3K27me3 levels, thus arguing against a direct link between repeat copy number and heterochromatization. However, since the lowest CT47 copy number analyzed was four, authors hypothesize that each MSR might have a minimal number of repeat units that would ensure proper heterochromatin formation. Small cell lung cancer (SCLC) cell lines that express CT47 (albeit at much lower levels than detected in testis) showed H3K9me3 decrease and demethylation of CpG sites near the transcription start site, which suggests that loss of heterochromatin features at CT47 array may result in CT47 expression. However, the causes of CT47 array heterochromatin loosening were not studied, nor whether CT47 RNA or protein contributes to SCLC disease.

Conclusion and future perspectives

Macrosatellite repeats, along with many other repeats, remain poorly investigated and thus are considered the “dark matter” of the human genome. They span through relatively large stretches of the genome; however, technical limitations together with the view that those sequences were functionally irrelevant (junk or garbage DNA) led to a considerable neglect in analyzing repeats as valuable components of human genetic material. With massive sequencing technologies, many noncoding regions of the genome have been discovered to be transcribed and to play distinct roles in cell biology. These discoveries have revolutionized the field, and the previous dogma of what is considered a functional genomic region has changed, leading to the acceptance of repeats as important architectural DNA building blocks and functional components of the transcriptome. Nevertheless, we are still largely unaware of many macrosatellite features such as their precise location, sequence composition, epigenetic regulation, copy number variation, transcription, and function. Substantial efforts are being placed to develop strategies that would overcome the obstacles in aligning next generation sequencing data and in de novo genome assembly of these regions. Longer read-lengths will reduce difficulties related to repeat alignment and thus fuel a more thorough analysis of those regions, which may lead to breakthrough discoveries. Increasing evidence suggests that macrosatellites are unique regulatory sequences, each of them with distinct functions. Macrosatellites encompass coding, noncoding and structural roles in the genome and seem to undergo frequent epigenetic and genetic alterations in disease. Sequence complexity and long repeat nature may allow macrosatellites to play significant roles in genome architecture, organization and regulation, representing an additional layer to fine-tune complex and dynamic regulatory networks of the genome. Mechanisms by which they accomplish those roles may be by anchoring chromatin remodeling complexes and transcription factors to form loops that regulate higher order genome architecture, as was demonstrated with DXZ4 repeats. Other mechanisms include generation of noncoding RNAs that modulate local transcription and chromatin formation, as was shown for D4Z4 macrosatellite. Furthermore, some parallels exist between several macrosatellite families. Those commonalities are most notable between D4Z4 and DXZ4, since both are well-studied, and include presence of internal promoters, bidirectional transcription, and generation of long and short ncRNAs. This poses interesting questions whether these noncoding transcripts play similar roles in both macrosatellites or even if those similarities may be extended to other macrosatellite families. What appears to be a general rule is that greater numbers of repeats associate to heterochromatic features, and this is also the case for other repeats such as some microsatellites. For instance, CTG triplet repeat expansion found in myotonic dystrophy 1 results in acquisition of heterochromatin. This “heterochromatinization” mechanism involves CTCF loss, bidirectional transcription and generation of siRNAs. Again, this highlights the fact that noncoding transcription may contribute to opposite chromatin statuses that can mediate either silencing (as in CTG microsatellite expansion) or activation (such as in D4Z4 contraction). All these facts expose the complexity of DNA repeat regulation and transcription of noncoding regions. Todate, D4Z4 remains one of the best described macrosatellite repeats and the only one reported to be transcribed into a regulatory, chromatin-associated, long noncoding RNA. Recently, lncRNAs have become a major research focus due to their versatile functions and key roles in cell physiology. They can regulate chromatin structure in cis by targeting protein complexes to a specific chromatin loci, or in trans by anchoring chromosomal interactions. They can also interact with mRNAs and regulate their metabolism, or interact with proteins and regulate protein complex assembly. Since many macrosatellite repeats remain poorly investigated, especially at the transcriptional level, it might be plausible that they contain a hoard of regulatory ncRNAs or even coding RNAs that have yet to be discovered. Exciting years of research in the field of repetitive DNA are ahead of us, which will shed more light on these still veiled regions of the genome and determine their possible relevance in genome organization, cellular biology, and disease pathogenesis.

109 in total

1. Deletion of DXZ4 on the human inactive X chromosome alters higher-order genome architecture.

Authors: Emily M Darrow; Miriam H Huntley; Olga Dudchenko; Elena K Stamenova; Neva C Durand; Zhuo Sun; Su-Chen Huang; Adrian L Sanborn; Ido Machol; Muhammad Shamim; Andrew P Seberg; Eric S Lander; Brian P Chadwick; Erez Lieberman Aiden
Journal: Proc Natl Acad Sci U S A Date: 2016-07-18 Impact factor: 11.205

2. FRG2, an FSHD candidate gene, is transcriptionally upregulated in differentiating primary myoblast cultures of FSHD patients.

Authors: T Rijkers; G Deidda; S van Koningsbruggen; M van Geel; R J L F Lemmers; J C T van Deutekom; D Figlewicz; J E Hewitt; G W Padberg; R R Frants; S M van der Maarel
Journal: J Med Genet Date: 2004-11 Impact factor: 6.318

3. Concerted evolution of the tandemly repeated genes encoding human U2 snRNA (the RNU2 locus) involves rapid intrachromosomal homogenization and rare interchromosomal gene conversion.

Authors: D Liao; T Pavelitz; J R Kidd; K K Kidd; A M Weiner
Journal: EMBO J Date: 1997-02-03 Impact factor: 11.598

4. DICER/AGO-dependent epigenetic silencing of D4Z4 repeats enhanced by exogenous siRNA suggests mechanisms and therapies for FSHD.

Authors: Jong-Won Lim; Lauren Snider; Zizhen Yao; Rabi Tawil; Silvère M Van Der Maarel; Frank Rigo; C Frank Bennett; Galina N Filippova; Stephen J Tapscott
Journal: Hum Mol Genet Date: 2015-06-03 Impact factor: 6.150

5. DUX4 activates germline genes, retroelements, and immune mediators: implications for facioscapulohumeral dystrophy.

Authors: Linda N Geng; Zizhen Yao; Lauren Snider; Abraham P Fong; Jennifer N Cech; Janet M Young; Silvere M van der Maarel; Walter L Ruzzo; Robert C Gentleman; Rabi Tawil; Stephen J Tapscott
Journal: Dev Cell Date: 2011-12-29 Impact factor: 12.270

Review 6. DNA fingerprinting techniques for the analysis of genetic and epigenetic alterations in colorectal cancer.

Authors: Johanna K Samuelsson; Sergio Alonso; Fumiichiro Yamamoto; Manuel Perucho
Journal: Mutat Res Date: 2010-09-17 Impact factor: 2.433

7. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping.

Authors: Suhas S P Rao; Miriam H Huntley; Neva C Durand; Elena K Stamenova; Ivan D Bochkov; James T Robinson; Adrian L Sanborn; Ido Machol; Arina D Omer; Eric S Lander; Erez Lieberman Aiden
Journal: Cell Date: 2014-12-11 Impact factor: 41.582

8. Nuclease S1-sensitive sites in multigene families: human U2 small nuclear RNA genes.

Authors: H Htun; E Lund; G Westin; U Pettersson; J E Dahlberg
Journal: EMBO J Date: 1985-07 Impact factor: 11.598

9. Chromosomes. A comprehensive Xist interactome reveals cohesin repulsion and an RNA-directed chromosome conformation.

Authors: Anand Minajigi; John Froberg; Chunyao Wei; Hongjae Sunwoo; Barry Kesner; David Colognori; Derek Lessing; Bernhard Payer; Myriam Boukhali; Wilhelm Haas; Jeannie T Lee
Journal: Science Date: 2015-06-18 Impact factor: 47.728

10. Concerted evolution of the tandem array encoding primate U2 snRNA occurs in situ, without changing the cytological context of the RNU2 locus.

Authors: T Pavelitz; L Rusché; A G Matera; J M Scharf; A M Weiner
Journal: EMBO J Date: 1995-01-03 Impact factor: 11.598

13 in total

Review 1. New pathologic mechanisms in nucleotide repeat expansion disorders.

Authors: C M Rodriguez; P K Todd
Journal: Neurobiol Dis Date: 2019-06-21 Impact factor: 5.996

2. From Influenza Virus Infections to Lupus: Synchronous Estrogen Receptor α and RNA Polymerase II Binding Within the Immunoglobulin Heavy Chain Locus.

Authors: Bart G Jones; Robert E Sealy; Rhiannon R Penkert; Sherri L Surman; Barbara K Birshtein; Beisi Xu; Geoffrey Neale; Robert W Maul; Patricia J Gearhart; Julia L Hurwitz
Journal: Viral Immunol Date: 2020-02-27 Impact factor: 2.257

3. Associations of BCL2 CA-Repeat Polymorphism and Breast Cancer Susceptibility in Isfahan Province of Iran.

Authors: Fatemeh Ghorbani; Farzane Amirmahani; Zahra Fatehi; Seyed-Morteza Javadirad; Manoochehr Tavassoli
Journal: Biochem Genet Date: 2020-11-05 Impact factor: 1.890

Review 4. Sequence, Chromatin and Evolution of Satellite DNA.

Authors: Jitendra Thakur; Jenika Packiaraj; Steven Henikoff
Journal: Int J Mol Sci Date: 2021-04-21 Impact factor: 5.923

Review 5. STRs: Ancient Architectures of the Genome beyond the Sequence.

Authors: Jalal Gharesouran; Hassan Hosseinzadeh; Soudeh Ghafouri-Fard; Mohammad Taheri; Maryam Rezazadeh
Journal: J Mol Neurosci Date: 2021-05-30 Impact factor: 3.444

6. 1Q12 Loci Movement in the Interphase Nucleus Under the Action of ROS Is an Important Component of the Mechanism That Determines Copy Number Variation of Satellite III (1q12) in Health and Schizophrenia.

Authors: Marina Sergeevna Konkova; Elizaveta Sergeevna Ershova; Ekaterina Alekseevna Savinova; Elena Mikhailovna Malinovskaya; Galina Vasilievna Shmarina; Andrey Vladimirovich Martynov; Roman Vladimirovich Veiko; Nataly Vyacheslavovna Zakharova; Pavel Umriukhin; Georgy Petrovich Kostyuk; Vera Leonidovna Izhevskaya; Sergey Ivanovich Kutsev; Natalia Nikolaevna Veiko; Svetlana Victorovna Kostyuk
Journal: Front Cell Dev Biol Date: 2020-06-05

7. A novel long non-coding RNA from NBL2 pericentromeric macrosatellite forms a perinucleolar aggregate structure in colon cancer.

Authors: Gabrijela Dumbovic; Josep Biayna; Jordi Banús; Johanna Samuelsson; Anna Roth; Sven Diederichs; Sergio Alonso; Marcus Buschbeck; Manuel Perucho; Sonia-V Forcales
Journal: Nucleic Acids Res Date: 2018-06-20 Impact factor: 16.971

8. Copy Number Variation of Human Satellite III (1q12) With Aging.

Authors: Elizaveta S Ershova; Elena M Malinovskaya; Marina S Konkova; Roman V Veiko; Pavel E Umriukhin; Andrey V Martynov; Sergey I Kutsev; Natalia N Veiko; Svetlana V Kostyuk
Journal: Front Genet Date: 2019-08-07 Impact factor: 4.599

9. FSHD1 Diagnosis in a Russian Population Using a qPCR-Based Approach.

Authors: Nikolay Vladimirovich Zernov; Anna Alekseevna Guskova; Mikhail Yurevich Skoblov
Journal: Diagnostics (Basel) Date: 2021-05-28

10. The Tug1 lncRNA locus is essential for male fertility.

Authors: Jordan P Lewandowski; Gabrijela Dumbović; Audrey R Watson; Taeyoung Hwang; Emily Jacobs-Palmer; Nydia Chang; Christian Much; Kyle M Turner; Christopher Kirby; Nimrod D Rubinstein; Abigail F Groff; Steve C Liapis; Chiara Gerhardinger; Assaf Bester; Pier Paolo Pandolfi; John G Clohessy; Hopi E Hoekstra; Martin Sauvageau; John L Rinn
Journal: Genome Biol Date: 2020-09-07 Impact factor: 13.583