| Literature DB >> 35723353 |
Michel Planat1, Marcelo M Amaral2, Fang Fang2, David Chester2, Raymond Aschheim2, Klee Irwin2.
Abstract
Transcription factors (TFs) are proteins that recognize specific DNA fragments in order to decode the genome and ensure its optimal functioning. TFs work at the local and global scales by specifying cell type, cell growth and death, cell migration, organization and timely tasks. We investigate the structure of DNA-binding motifs with the theory of finitely generated groups. The DNA 'word' in the binding domain-the motif-may be seen as the generator of a finitely generated group Fdna on four letters, the bases A, T, G and C. It is shown that, most of the time, the DNA-binding motifs have subgroup structures close to free groups of rank three or less, a property that we call 'syntactical freedom'. Such a property is associated with the aperiodicity of the motif when it is seen as a substitution sequence. Examples are provided for the major families of TFs, such as leucine zipper factors, zinc finger factors, homeo-domain factors, etc. We also discuss the exceptions to the existence of such DNA syntactical rules and their functional roles. This includes the TATA box in the promoter region of some genes, the single-nucleotide markers (SNP) and the motifs of some genes of ubiquitous roles in transcription and regulation.Entities:
Keywords: DNA transcription factors; aperiodic order; finitely generated groups
Year: 2022 PMID: 35723353 PMCID: PMC9164029 DOI: 10.3390/cimb44040095
Source DB: PubMed Journal: Curr Issues Mol Biol ISSN: 1467-3037 Impact factor: 2.976
The number of conjugacy classes of subgroups of index d in the free group of rank r [9].
| r | d = 1 | d = 2 | d = 3 | d = 4 | d = 5 | d = 6 | d = 7 |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 3 | 7 | 26 | 97 | 624 | 4163 |
| 3 | 1 | 7 | 41 | 604 | 13,753 | 504,243 | 24,824,785 |
| 4 | 1 | 15 | 235 | 14,120 | 1,712,845 | 371,515,454 | 127,635,996,839 |
| 5 | 1 | 31 | 1361 | 334,576 | 207,009,649 | 268,530,771,271 | 644,969,015,852,641 |
Group structure of a TATA box. Column 1 is for the selected consensus sequence (rows 4 to 6 are for the TATA box in the core promoter of UGTIA1 gene). Column 2 is for the cardinality sequence (card seq) of conjugacy classes (cc) of subgroups in the finitely generated group whose relation (rel) is the consensus sequence (cons seq). Column 3 identifies the Hecke group , which is close to the group under consideration (based on its card seq of subgroups). Column 4 refers to some references in the literature. Bold digits feature the fit to a Hecke group.
| Rel: Cons Seq | Card Struct of cc of Subgroups | Group | Literature |
|---|---|---|---|
| TATAAAA |
|
| [ |
| TATAAAAA |
|
| [ |
| A(TA) |
|
| [ |
| A(TA) |
|
| . |
| A(TA) |
|
| . |
| A(TA) |
|
| . |
Group analysis of a few known and candidate SNP markers (taken from [15]) Column 1 is for the selected gene. Column 2 is for the SNP marker. Column 3 is for the card seq for the finitely generated group whose relation (rel) is the marker. Column 4 is for the reference paper and the letter indicates the heuristic confidence level of the candidate SNP marker (in alphabetical order from the best (A) to the worst (E)). The computed closeness of the finitely generated group to the free group , most of time, correlates to a lower risk of illness, as described in [15]. The symbol * corresponds to the only two-base SNP marker in the table. The card seq is the same as the sequence for the fundamental group of 3-manifold . The latter manifold is the smallest volume closed 3-manifold and is non-orientable [17].
| Gene | Rel: Marker | Card Seq of cc of Subgroups | Literature |
|---|---|---|---|
| ESR2 | TTAAAAGGAA |
| Table 1 in [ |
| HSD17B1 | AGCCCAGAGC |
| ., A |
| . | CAAGCCCAGA |
| ., A |
| PGR | AAAGGAGCCG |
| ., A |
| GSTM3 | GGGTATAAAG |
| ., E |
| . | CCCCTCCCGC |
| ., C |
| . | CCCTCCCGCT | . | ., C |
| IL1B | AAAACAGCGA |
| Table 2 in [ |
| CYP2A6 | AAAGGCAAC |
| ., A |
| DHFR | GGGACGAGGG |
| ., A |
| . | GGACGAGGGG | . | ., A |
| LEP | GGGGCGGGA |
| Table 3 in [ |
| GCG | TGCGCCTTGG |
| ., B |
| GH1 | TATAAAAAGG | ., E | |
| . | GTATAAAAAG | . | ., D |
| . | GGTATAAAAA | . | ., E |
| . | AGGGCCCACA |
| ., A |
| . | AAAGGGCCCC |
| ., A |
| . | AAAGGGCCA | . | ., A |
| NOS2 | TCTTGGCTGC |
| Table 4 in [ |
| TPI1 | ATATAAGTGG |
| ., B |
| GJA5 | TATTAAACAC |
| ., E |
| HBD | AAAAGGCAGG |
| Table 5 in [ |
| F2 | AACCCAGAGG |
| ., A |
| F8 | GGAAGAGGGA | Table 6 in [ | |
| F3 | GCGCGGGGCA |
| ., A |
| F11 | TTTTTAGTAA | . | ., D |
| . | TTTTTAGTAA |
| ., A |
| . | AAGGAAATTT |
| ., A |
| AR | GTGGAAGATT |
| Table 7 in [ |
| . | CCACGACCCG |
| ., D |
| MTHFR | TCCCTCCCA |
| ., A |
| DMNT1 | TGTGTGGCCCG | . | ., A |
| . | GTGTGTGCCC | . | ., A |
| . | GACGAGCCCA |
| ., A |
| NR5A1 | ACAAGAGAAA |
| ., A |
| . | GGTGTGAGAG |
| ., A |
Group structure of motifs for transcription factors of immediately early genes Fos, EGR and Myc. Most of the time, the card seq of the group defined with the relation/motif is the free group (for a 3 letter motif) or (for a 4 letter motif). There are two exceptions for the EGR1 gene, depending on the selected motif, where the card seq corresponds to the modular group or the Baumslag–Solitar group , which is the fundamental group of the Klein bottle. The card seq for is in Table 2. The card seq for is .
| Gene | Rel: Motif | Card Seq | Literature |
|---|---|---|---|
| Fos | TGAGTCA |
| [ |
| . | TGACTCA |
| [ |
| EGR1 | GCGTGGGCG |
| [ |
| . | CCGCCCCCG |
| ., MA0162.2 |
| . | CCGCCCCCGC |
| ., . |
| . | ACGCCCACGCA |
| ., MA0162.3 |
| . | GGCCCACGC | . | ., MA0162.4 |
| EGR2 | CCGCCCACGC | . | ., MA0472.1 |
| . | ACGCCCACGCA | . | ., MA0472.2 |
| EGR3,EGR4 | ACGCCCACGCA | . | ., [ MA0732.1, MA0733.1] |
| Myc | CACGTG |
| [ |
| . | CGCACGTGGT | . | [ |
| . | CCCACGTGCTT | . | ., MA0147.2 |
| . | CCACGTGC | . | ., MA0147.3 |
| Mycn, Max::Myc, etc | GACCACGTGGT, etc. | . | ., [MA0104.1, etc.] |
Figure 1The DNA-binding domain of the immediate early gene Fos. The name in the protein data bank is 1FOS.
Figure 2(Left) Cartoon representation of the CysHis zinc finger motif, consisting of an -helix and an antiparallel -sheet. The zinc ion (green) is coordinated by two histidine residues and two cysteine residues. (Right) Cartoon representation of the protein ZNF268 (blue) containing three zinc fingers in complex with DNA (orange). The coordinating amino acid residues and zinc ions (green) are highlighted. The name of the DNA-binding domain in the protein data bank is 4R2A.
Figure 3(Up) Crystal structure of Myc and Max in complex with DNA. (Down) The link (which is supposed to control the binding domain Myc) is attached to the plane in the half-space . It is not splittable. This can be proven by checking that the fundamental group is not free [22] and ([23] p. 90). One gets , where (.,.) means the group theoretical commutator. The cardinality sequence of cc of subgroups of is .
Group structure of motifs for some transcription factors that are not leading to free groups. The card seq for is ; for it is . The card seq for is already in Figure 3 as . The card seq for is ]; for , it is ; for , it is ; for , it is . The card seq for is . The index i in refers to the rank of the group under examination. The three sections are for motifs on 2, 3 and 4 letters, respectively.
| Gene | Rel: Motif | Card Seq | Literature |
|---|---|---|---|
| NKX6-2 | TAATTAA |
| [ |
| HoxA1, HoxA2 | TAATTA |
| [ |
| POU6F1, Vax | . | . | ., [MAO628.1, MA0722.1] |
| RUNX1 | TGTGGT | . | ., MA0511.1 |
| RUNX1 | TGTGGTT |
| [ |
| EHF | CCTTCCTC | . | ., MA0598.1 |
| POU6F1 | TAATGAG |
| [ |
| PITX2 | TAATCCC | . | ., [MA1547.1, MA1547.2] |
| ELK4 | CTTCCGG | . | ., MA0076.2 |
| OTX2, Dmbx1 | GGATTA |
| [ |
| PitX1, PitX2, PitX3, OTX1 | TAATCC | . | .,[MA0682.1, MA0711.1] |
| N-box | TTCCGG | . | [ |
| p53 | CACATGTCCA |
| [ |
| GZF1 | TGCGCGTCTATA | . | [ |
| NF-kappa-B | GGGAATTTCC | . | [ |
| STAT1 | TTTCCCGGAA | . | ., MA0137.2 |
| . | TTCCAGGAA | . | ., MA0137.3 |
| STAT4 | TTCCAGGAAA | . | ., MA0518.1 |
| FOSL1::Jun | ATGACGTCAT |
| [ |
| USF2 | GTCATGTGACC | . | . , MA0626.1 |
| PAX1 | CGTCACGCATGA | . | . , MA0779.1 |
| STAT2 | TTCCAGGAAG | . | . , MA0144.1 |
| FOS | GATGACGTCATCA |
| [ |
| MAFA, MAFF,MAFK | TGCTGAGTCAGCA | . | ., [MA1521.1, MA0495.2, MA0946.2] |
| CREB | TGACGTCA |
| [ |
| USF2 | GGTCACGTGACC | . | ., MA0526.4 |
| SMAD3, SMAD5 | GTCTAGAC | . | ., [MA0795.1, MA1557.1], [ |
A short account of the function or dysfunction (through mutations or isoforms) of genes associated with transcription factors and sections in Table 5.
| Gene | Type | Function | Dysfunction |
|---|---|---|---|
| NKX6-2 | homeobox | central nervous system, pancreas | spastic ataxia |
| HoxA1 | homeobox | embryonic devt of face and heart | autism |
| HoxA2 | . | . | cleft palate |
| Pou6F1 | . | neuroendocrine system | clear cell adenocarcinoma |
| Vax | . | forebrain development | craniofacial malform. |
| RunX1 | Runt-related | cell differentiation, pain neurons | myeloid leukemia |
| EHF | homeobox | epithelial expression | carcinogenesis, asthma |
| PitX2 | . | eye, tooth, abdominal organs | Axenfeld–Rieger syndrome |
| ELK4 | Ets-related | serum response for c-Fos | |
| OTX1,OTX2 | homeobox | brain and sensory organ devt | medulloblastomas |
| Dmbx1 | . | . | farsightedness and strabismus |
| PitX1 | . | organ devt, left–right asymmetry | autism, club foot |
| PitX3 | . | lens formation in eye | congenital cataracts |
| N-box | Ets-related | synaptic expression | drug sensitivity |
| p53 | p53 domain | ‘Guardian of the genome’ | cancers |
| GZF1 | Zinc fingers | protein coding | short stature, myopia |
| NF-kappa-B | . | DNA transcription, cytokines | apoptosis |
| STAT1 | Stat family | signal activator of transcription | immunodeficiency 31 |
| STAT4 | Stat family | signal activator of transcription | rheumatoid arthritis |
| FOSL1::Jun | leucine zipper | cellular proliferation | marker of cancer |
| USF2 | helix-loop-helix | transcription activator | |
| PAX1 | paired box | fetal development | Klippel–Feil syndrome |
| FOS | leucine zipper | cellular proliferation | cancers |
| Maf | . | pancreatic development | congenital cerulean cataract |
| CREB | bZIP | neuronal plasticity | Alzheimer’s disease |
| USF2 | helix-loop-helix | transcription activator | |
| SMAD | homeo domain | cell development and growth | Alzheimer’s disease |
Figure 4Crystal structure of p53 binding domain. The reference number in the protein data bank is 4HJE.