| Literature DB >> 33094288 |
Limin Jiang1, Mingrui Duan1, Fei Guo2, Jijun Tang3, Olufunmilola Oybamiji1, Hui Yu1, Scott Ness1, Ying-Yong Zhao4, Peng Mao1, Yan Guo1.
Abstract
Binding motifs for transcription factors, RNA-binding proteins, microRNAs (miRNAs), etc. are vital for proper gene transcription and translation regulation. Sequence alteration mechanisms including single nucleotide mutations, insertion, deletion, RNA editing and single nucleotide polymorphism can lead to gains and losses of binding motifs; such consequentially emerged or vanished binding motifs are termed 'somatic motifs' by us. Somatic motifs have been studied sporadically but have never been curated into a comprehensive resource. By analyzing various types of sequence altering data from large consortiums, we successfully identified millions of somatic motifs, including those for important transcription factors, RNA-binding proteins, miRNA seeds and miRNA-mRNA 3'-UTR target motifs. While a few of these somatic motifs have been well studied, our results contain many novel somatic motifs that occur at high frequency and are thus likely to cause important biological repercussions. Genes targeted by these altered motifs are excellent candidates for further mechanism studies. Here, we present the first database that hosts millions of somatic motifs ascribed to a variety of sequence alteration mechanisms.Entities:
Year: 2020 PMID: 33094288 PMCID: PMC7556404 DOI: 10.1093/narcan/zcaa030
Source DB: PubMed Journal: NAR Cancer ISSN: 2632-8674
Figure 1.Overview of our somatic motif detection algorithm. Top: A simple scenario to show how somatic sequences are generated based on mutations. Middle: The possible target types of somatic motif. Bottom: Our somatic motif algorithm allows personalized motif search accounting for adjacent single mutations or insertions/deletions (INDELs).
Figure 2.(A) The overall results from conducting somatic motif analysis from five data sources (TCGA, ICGC, GTEx, dbSNP, REDIportal) against multiple data sources. The bar represents the log10 value of identified somatic motifs. (B) Genome-wide visualization of the TF somatic motifs in circos plot. There are four layers in this circos plot. The outer layer represents the genome by chromosome; the second layer (gold) represents somatic mutations that cause gains of TF motif; the third layer (dark slate gray) denotes somatic mutations that caused loss of TF motifs; and the fourth and inner layer (khaki) denotes the location of the binding motif gene. In the middle, the arrows’ color matches the layer’s color and is connecting the proper sequence altering mechanism on the second and third layers to its binding motif gene on the fourth layer. (C) Genome-wide visualization of miRNA–mRNA 3′-UTR binding somatic motifs. There are three layers in this circos plot. The outer layer represents the genome by chromosome; the second layer (chartreuse) represents the miRNA–mRNA 3′-UTR binding location; and the third and inner layer (khaki) represents miRNA locations. In the middle, the arrows’ color matches the layer’s color and is connecting the proper sequence altering mechanism on the second layer to its binding miRNA on the third layer.
ICGC somatic motif analysis against JASPAR transcript binding factors
| Project | Chra | Locationb | Gene (upstream distance)c | Mutation | Strandd | Types | Affected motif (TF)e | Frequencyf |
|---|---|---|---|---|---|---|---|---|
| LMS-FR | 8 | 18390725 | NAT2 (dist = 557) | A>ATTA | + | Gain | [ATTA]AA (Arid3a) | 11.94% |
| LMS-FR | 8 | 18390725 | NAT2 (dist = 557) | A>ATTA | + | Gain | GTC[ATTA]A (HOXC4) | 11.94% |
| BTCA-SG | 9 | 93095871 | CARD19 (dist = 346) | AAGGAACCCCCCACCGGGCCCCGCCCCTTACTCG>G | − | Loss | [GCCCCGCCCC] (KLF5) | 18.31% |
| BTCA-SG | 7 | 101166514 | VGF (dist = 945) | C>CC | − | Loss | CCCA[C]CTGCGC (ZEB1) | 12.68% |
| BTCA-SG | 1 | 211259205 | RCOR3 (dist = 72) | C>CCCCCTCCCCCCTT | + | Gain | C[CCCCCTCCCCC] (ZNF148) | 18.31% |
| BTCA-SG | 1 | 211259205 | RCOR3 (dist = 72) | C>CCCCCTCCCCCCTT | + | Loss | [C]GCCCCTCCCC (MAZ) | 18.31% |
| MELA_AU | 8 | 56074582 | RPS20 (dist = 10) | C>T | + | Gain | T[T]CCGG (ETS1) | 14.75% |
| MELA_AU | 5 | 1295135 | TERT (dist = 67) | C>T | − | Gain | T[T]CCGG (ETS1) | 11.48% |
| BTCA-SG | 8 | 86514404 | RMDN1 (dist = 31) | CA>C | + | Gain | CGCCC[C]TCCCC (MAZ) | 12.68% |
| LMS-FR | 5 | 116050856 | ARL14EPL (dist = 610) | GTCTG>G | − | Gain | TGC[G]TG (ARNT) | 16.42% |
| LMS-FR | 19 | 43826512 | ZNF283 (dist = 809) | T>GTT | − | Loss | TGCG[T]G (ARNT) | 14.93% |
| LMS-FR | 19 | 14476342 | PTGER1 (dist = 988) | T>GTT | − | Loss | TGCG[T]G (ARNT) | 13.43% |
| BTCA-SG | 2 | 88056086 | KRCC1 (dist = 303) | T>TA | + | Loss | A[T]TAAA (Arid3a) | 14.08% |
| BTCA-SG | 12 | 18090325 | RERGL (dist = 132) | T>TA | + | Loss | T[T]TAAAAAAAAA (ZNF384) | 11.27% |
| BTCA-SG | 5 | 42812249 | SELENOP (dist = 173) | T>TA | + | Loss | T[T]TAAAAAAAAA (ZNF384) | 11.27% |
| BTCA-SG | 17 | 81398932 | BAHCC1 (dist = 442) | T>TA | + | Loss | A[T]TAAA (Arid3a) | 11.27% |
| BTCA-SG | 6 | 150599621 | PLEKHG1 (dist = 264) | T>TA | + | Loss | A[T]TAAA (Arid3a) | 11.27% |
| LMS-FR | 20 | 63499644 | EEF1A2 (dist = 561) | T>TTCCGGGT | − | Gain | [TTCCGG] (ETS1) | 11.94% |
| LMS-FR | 2 | 68953575 | GKN2 (dist = 682) | TGAA>T | + | Gain | AT[T]AAA (Arid3a) | 17.91% |
| LMS-FR | 11 | 4998233 | OR51L1 (dist = 750) | TGCGTGT>T | − | Loss | [TGCGTG] (ARNT) | 10.45% |
| LMS-FR | 11 | 4998231 | OR51L1 (dist = 752) | TGCGTGTGT>T | − | Loss | [TGCGTG] (ARNT) | 10.45% |
aChromosome.
bGenomic location in GRCh38.
cMutation gene is the gene the somatic mutation occurred upstream to. The upstream distance is displayed in the parentheses.
dStrand where the somatic motif is observed: +, positive strand; −, negative strand.
eThe actual somatic motif sequence; nucleotide in the brackets indicates the mutated position and nucleotide.
fMutation frequency = number of subjects with mutation/total number of subjects.