| Literature DB >> 24228761 |
Katherine A Bolton, Jason P Ross, Desma M Grice, Nikola A Bowden, Elizabeth G Holliday, Kelly A Avery-Kiejda, Rodney J Scott1.
Abstract
BACKGROUND: Tandem repeats (TRs) are unstable regions commonly found within genomes that have consequences for evolution and disease. In humans, polymorphic TRs are known to cause neurodegenerative and neuromuscular disorders as well as being associated with complex diseases such as diabetes and cancer. If present in upstream regulatory regions, TRs can modify chromatin structure and affect transcription; resulting in altered gene expression and protein abundance. The most common TRs are short tandem repeats (STRs), or microsatellites. Promoter located STRs are considerably more polymorphic than coding region STRs. As such, they may be a common driver of phenotypic variation. To study STRs located in regulatory regions, we have performed genome-wide analysis to identify all STRs present in a region that is 2 kilobases upstream and 1 kilobase downstream of the transcription start sites of genes.Entities:
Mesh:
Year: 2013 PMID: 24228761 PMCID: PMC3840602 DOI: 10.1186/1471-2164-14-795
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Location of the regulatory region analysed in a representative human gene. The location of the 3 kilobase (kb) regulatory region (marked by a red box) in a representative human gene screened in the creation of STaRRRT. As the length of the 5’-UTR can be markedly different among human genes, the 1 kb region downstream of the TSS will encompass the entire 5’-UTR for some but not all human genes. This is demonstrated by the marking of two possible start codons in relation to the regulatory region screened.
Details provided in STaRRRT
| Chrom | Chromosome number on which STR is located | chr1 |
| chromStart | Start position on chromosome of the gene | 28218048 |
| chromEnd | End position on chromosome of the gene | 28241236 |
| cdsStart | Coding sequence start | 28218673 |
| cdsEnd | Coding sequence end | 28240954 |
| Strand | Strand on which the gene occurs | ₋ (negative) |
| knownGeneId | KnownGene database identifier | uc001bpe.1 |
| refSeqId1 | RefSeq database identifier | NM_002946 |
| ensGeneId | Ensembl database identifier | ENST00000373912 |
| sourceAcc | GenBank transcript accession number | NM_002946.3 |
| hgncSymbol2 | HGNC gene symbol | RPA2 |
| U133Id | Affymetrix GeneChip array identifier | U133A:201756_at; |
| U133Plus2Id | Affymetrix GeneChip Plus2.0 array identifier | 201756_at |
| Category | Type of gene (coding or noncoding) | coding |
| txPos3 | Position in relation to the TSS | −1910 |
| srStart4 | Start position on chromosome for the STR | 28243107 |
| srEnd | End position on chromosome for the STR | 28243146 |
| Period5 | Length of the repeat unit in the STR | 2 |
| numRepeats | Number of copies of the repeat unit | 19.5 |
| srLength | Total length of the STR | 39 |
| consensusSize | Number of bases in the consensus sequence | 2 |
| perMatch6 | % match of STR to consensus sequence; purity | 100 |
| perIndel | Percent insertions and/or deletions in the STR | 0 |
| Score | Alignment score (minimum = 50) | 78 |
| A | Percent of A's (adenine) in the repeat unit | 0 |
| C | Percent of C's (cytosine) in the repeat unit | 0 |
| G | Percent of G's (guanine) in the repeat unit | 48 |
| T | Percent of T's (thymine) in the repeat unit | 51 |
| Entropy | Entropy | 1 |
| Sequence | Consensus sequence of the repeat unit; motif | TG |
1An STR only appears in STaRRRT if the gene has a RefSeq database identifier; 2An STR only appears in STaRRRT if the gene has an HGNC Gene Symbol; 3txPos was limited to −2000 to +1000 bp in the creation of STaRRRT; 4sr = simple repeats, as appears in the UCSC Genome Browser; 5Period was limited to 1 to 9 bp; 6perMatch was limited to ≥ 90%.
Sample of the resource STaRRRT
| chr1 | 1102483 | 1102578 | + | NR_029639 | MIR200B | noncoding | −586 | 1101897 | 1101928 | 6 | 5.2 | 31 | 92 | 19 | 80 | 0 | 0 | CACCCC |
| chr1 | 1103242 | 1103332 | + | NR_029834 | MIR200A | noncoding | −1345 | 1101897 | 1101928 | 6 | 5.2 | 31 | 92 | 19 | 80 | 0 | 0 | CACCCC |
| chr1 | 1631377 | 1633247 | + | NR_002946 | MMP23A | coding | −340 | 1631037 | 1631077 | 9 | 4.4 | 40 | 93 | 2 | 10 | 62 | 25 | GTGTGCGGG |
| chr1 | 1950767 | 1962192 | + | NM_000815 | GABRD | coding | −994 | 1949773 | 1949836 | 5 | 12.8 | 63 | 98 | 61 | 17 | 0 | 20 | ATAAC |
| chr1 | 2487804 | 2495188 | + | NM_003820 | TNFRSF14 | coding | 183 | 2487987 | 2488012 | 6 | 4.2 | 25 | 100 | 0 | 32 | 0 | 68 | TTCTCT |
| chr1 | 2985741 | 3355185 | + | NM_022114 | PRDM16 | coding | −121 | 2985620 | 2985645 | 3 | 8.3 | 25 | 100 | 0 | 32 | 68 | 0 | GGC |
| chr1 | 3816967 | 3832011 | + | NR_024455 | LOC100133612 | noncoding | −1887 | 3815080 | 3815118 | 3 | 12.7 | 38 | 100 | 68 | 31 | 0 | 0 | AAC |
| chr1 | 6673755 | 6684093 | + | NM_153812 | PHF13 | coding | −494 | 6673261 | 6673286 | 7 | 3.6 | 25 | 100 | 16 | 16 | 68 | 0 | AGCGGGG |
| chr1 | 9352940 | 9429590 | + | NM_025106 | SPSB1 | coding | −1705 | 9351235 | 9351260 | 1 | 25 | 25 | 100 | 0 | 0 | 0 | 100 | T |
| chr1 | 9352940 | 9429590 | + | NM_025106 | SPSB1 | coding | −157 | 9352783 | 9352812 | 7 | 4.1 | 29 | 100 | 0 | 72 | 27 | 0 | CGCGCCC |
This simplified sample of STaRRRT shows the details for the first 10 STRs in STaRRRT. The number of columns have been reduced from 30 to 19 shown here due to size limitations. STaRRRT can be viewed in its entirety at http://www.newcastleinnovationhealth.com.au/STaRRRT.
Figure 2Comparison of STRs of different period lengths in the whole human genome, gene coding regions and STaRRRT STRs. This histogram shows the proportion of STRs present in STaRRRT having different period (“STaRRRT”) compared to the proportions across the whole human genome (“All STRs”), in the 2 kb upstream region (−2000, -1; “Upstream”), in the 3 kb region analysed for all STRs (with no purity restriction, “Reg. region”), in the proximal promoter (−250, +250; “Prox. Promoter”), in exons (“Exon”), in 5’-UTRs (“5’-UTR”), and in introns (“Intron”).
Figure 3Summary plots across the TSS. The distribution of STRs in the upstream regulatory region of the human genome shows distinct trends around the TSS and core promoter. All lines are smoothed by LOWESS (locally weighted scatterplot smoothing) regression. (A) The density of STaRRRT STRs across the 3 kb upstream regulatory region. This run chart shows the STR density of the 5,264 STRs from STaRRRT at each base position in the regulatory region with a regression line also fitted to the data. (B) STaRRRT STR density decomposed into periods. (C) The number of STR repeat units across the TSS. (D) The percentage of bases in each STR across the TSS.
KEGG pathway results from HEAT analysis grouped by pathway class
| | | | ||||||
|---|---|---|---|---|---|---|---|---|
| Purine metabolism | 230 | 126 | 1.94 | 0.020 | - | - | 2.44 | 0.006 |
| Glycine, serine and threonine metabolism | 260 | 53 | - | - | - | - | 3.33 | 0.006 |
| Glycosaminoglycan degradation | 531 | 12 | - | - | 7.14 | 0.006 | - | - |
| Inositol phosphate metabolism | 562 | 54 | 3.04 | 0.002 | - | - | 2.70 | 0.028 |
| Glycan structures - biosynthesis 1 | 1030 | 40 | 3.24 | 0.005 | - | - | - | - |
| Glycan structures - degradation | 1032 | 18 | - | - | 6.00 | 0.006 | - | - |
| Apoptosis | 4210 | 111 | 2.21 | 0.006 | 2.13 | 0.050 | 2.11 | 0.028 |
| Dorso-ventral axis formation | 4320 | 80 | 2.03 | 0.048 | - | - | 2.73 | 0.007 |
| Axon guidance | 4360 | 114 | 2.25 | 0.004 | 2.38 | 0.016 | 2.18 | 0.020 |
| Calcium signaling pathway | 4020 | 108 | 3.23 | 8.6E-07 | 2.50 | 0.012 | 2.30 | 0.015 |
| Phosphatidylinositol signaling system | 4070 | 64 | 3.09 | 0.001 | 2.57 | 0.050 | 2.50 | 0.028 |
| Wnt signaling pathway | 4310 | 126 | 2.32 | 0.002 | 2.14 | 0.029 | - | - |
| VEGF signaling pathway | 4370 | 155 | 2.33 | 0.001 | 2.09 | 0.020 | 1.89 | 0.032 |
| Focal adhesion | 4510 | 120 | 2.33 | 0.002 | 2.42 | 0.012 | 2.20 | 0.015 |
| Adherens junction | 4520 | 166 | 1.76 | 0.031 | 2.50 | 0.002 | 2.30 | 0.004 |
| Tight junction | 4530 | 101 | 1.95 | 0.038 | 2.68 | 0.007 | - | - |
| Gap junction | 4540 | 116 | 2.32 | 0.002 | 2.19 | 0.032 | 2.41 | 0.007 |
| Jak-STAT signaling pathway | 4630 | 140 | - | - | 2.73 | 0.002 | - | - |
| Regulation of actin cytoskeleton | 4810 | 98 | 2.26 | 0.007 | 2.41 | 0.024 | 2.24 | 0.026 |
| Hematopoietic cell lineage | 4640 | 19 | - | - | - | - | 5.39 | 0.006 |
| T cell receptor signaling pathway | 4660 | 167 | 1.89 | 0.011 | 2.50 | 0.002 | - | - |
| B cell receptor signaling pathway | 4662 | 160 | 1.75 | 0.037 | 2.61 | 0.002 | - | - |
| Leukocyte transendothelial migration | 4670 | 99 | 1.88 | 0.051 | 2.73 | 0.006 | 2.06 | 0.043 |
| Long-term potentiation | 4720 | 125 | 2.34 | 0.002 | 2.17 | 0.028 | - | - |
| Long-term depression | 4730 | 142 | 2.21 | 0.002 | 2.18 | 0.020 | 1.96 | 0.028 |
| Insulin signaling pathway | 4910 | 195 | 1.98 | 0.002 | 2.32 | 0.002 | 2.18 | 0.004 |
| Adipocytokine signaling pathway | 4920 | 150 | - | - | 3.01 | 1.3E-04 | 1.96 | 0.028 |
| Type II diabetes mellitus | 4930 | 22 | 3.16 | 0.051 | - | - | 5.33 | 0.004 |
| Epithelial cell sig. in | 5120 | 150 | 2.17 | 0.002 | 2.41 | 0.006 | 2.16 | 0.010 |
| Colorectal cancer | 5210 | 82 | 2.00 | 0.051 | - | - | 2.68 | 0.008 |
Results from gene set enrichment analysis of the set of transcripts with STaRRRT STRs are shown alongside results for transcripts with STRs located in exons and transcripts with the highest density of high purity STRs in the introns. Results shown are FDR-corrected p-values. A KEGG pathway is only presented in the table if at least one of the STaRRRT, exon or high density intron results has an FDR-corrected p-value of less than 0.01. Columns with “-“ characters are those sets unenriched (so p > 0.05 before FDR correction). “Genes” is the number of entities in each set and “Enrich” is the ratio of the number of transcripts observed with STRs relative to that expected.
Tissue-specific expression results from HEAT analysis
| | | ||||||
|---|---|---|---|---|---|---|---|
| Kidney/bladder | 139 | - | - | 2.08 | 0.014 | 2.21 | 0.003 |
| Muscle/heart | 168 | 2.01 | 0.001 | - | - | - | - |
| Neural | 393 | 2.23 | 4.0E-10 | 1.80 | 0.003 | 2.01 | 8.1E-06 |
| Placenta/testis/ovary | 198 | 1.88 | 0.001 | 1.93 | 0.014 | - | - |
A description of the columns is given in Table 3.
IPA results
| Neurological disease | 443 | 1.27E-04 - 4.94E-02 |
| Psychological disorders | 236 | 1.81E-04 - 4.94E-02 |
| Developmental disorder | 132 | 9.19E-04 - 4.24E-02 |
| Antimicrobial response | 29 | 1.38E-03 - 2.00E-02 |
| Infectious disease | 418 | 2.25E-03 - 4.17E-02 |
| Cellular movement | 325 | 3.39E-04 - 4.81E-02 |
| Cell death and survival | 501 | 6.18E-04 - 4.83E-02 |
| Cell-to-cell signaling and interaction | 119 | 1.07E-03 - 4.81E-02 |
| Cellular development | 290 | 1.17E-03 - 4.37E-02 |
| Cellular growth and proliferation | 192 | 1.47E-03 - 4.81E-02 |
| Cardiovascular system development and function | 167 | 7.56E-06 - 4.70E-02 |
| Organismal development | 146 | 3.20E-05 - 4.37E-02 |
| Humoral immune response | 12 | 1.38E-03 - 4.81E-02 |
| Reproductive system development and function | 31 | 1.47E-03 - 4.17E-02 |
| Hematological system development and function | 107 | 1.74E-03 - 4.81E-02 |
| NGF signaling | 34/111 (0.306) | 3.16E-03 |
| Pyridoxal 5'-phosphate salvage pathway | 22/62 (0.355) | 4.22E-03 |
| Reelin signaling in neurons | 26/82 (0.317) | 6.29E-03 |
| Neuropathic pain signaling in dorsal horn neurons | 31/102 (0.304) | 6.92E-03 |
| GNRH signaling | 38/135 (0.281) | 7.00E-03 |
| Cellular effects of sildenafil (Viagra) | 37/127 (0.291) | 9.28E-03 |
| Calcium signaling | 48/189 (0.254) | 1.01E-02 |
| Factors promoting cardiogenesis in vertebrates | 27/91 (0.297) | 1.27E-02 |
| Synaptic long-term depression | 39/142 (0.275) | 1.51E-02 |
| B cell receptor signaling | 43/162 (0.265) | 1.95E-02 |
| FGF signaling | 26/88 (0.295) | 2.01E-02 |
| mTOR signaling | 49/189 (0.259) | 2.06E-02 |
| Gɑq signaling | 40/157 (0.255) | 2.33E-02 |
| Dopamine-DARPP32 feedback in cAMP signaling | 43/161 (0.267) | 2.40E-02 |
| D-myo-inositol (1,4,5)-triphosphate biosynthesis | 10/26 (0.385) | 2.66E-02 |
| PPARɑ/RXRɑ activation | 44/173 (0.254) | 2.86E-02 |
| NF-κB activation by viruses | 22/79 (0.278) | 3.18E-02 |
| Xenobiotic metabolism signaling | 66/268 (0.246) | 3.20E-02 |
| Antioxidant action of vitamin C | 27/98 (0.276) | 3.43E-02 |
| Maturity onset diabetes of young (MODY) signaling | 8/22 (0.364) | 3.64E-02 |
Results from comparison of the set of transcripts containing STaRRRT STRs with the reference set Ingenuity Knowledge Base are shown. For “Top Bio Functions”, the number of molecules (n) relates to genes containing STaRRRT STRs in each enriched functional group. For “Top Canonical Pathways”, the number of STR-containing genes, relative to the total number of genes for each canonical pathway, is shown as a fraction and as a ratio (in brackets). Results shown are limited to those with a p-value less than 0.05 for the “Top Bio Functions” and the 20 most significant results with a p-value less than 0.05 for the “Top Canonical Pathways”.