| Literature DB >> 29363433 |
Xin Wang1,2, Peijie Lin1,2, Joshua W K Ho3,4.
Abstract
BACKGROUND: It has been observed that many transcription factors (TFs) can bind to different genomic loci depending on the cell type in which a TF is expressed in, even though the individual TF usually binds to the same core motif in different cell types. How a TF can bind to the genome in such a highly cell-type specific manner, is a critical research question. One hypothesis is that a TF requires co-binding of different TFs in different cell types. If this is the case, it may be possible to observe different combinations of TF motifs - a motif grammar - located at the TF binding sites in different cell types. In this study, we develop a bioinformatics method to systematically identify DNA motifs in TF binding sites across multiple cell types based on published ChIP-seq data, and address two questions: (1) can we build a machine learning classifier to predict cell-type specificity based on motif combinations alone, and (2) can we extract meaningful cell-type specific motif grammars from this classifier model.Entities:
Keywords: Cell-type specificity; Cis-regulatory element; DNA motif; Random Forest; Transcription factor
Mesh:
Substances:
Year: 2018 PMID: 29363433 PMCID: PMC5780765 DOI: 10.1186/s12864-017-4340-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Our bioinformatics workflow for DNA motif annotation, Random Forest (RF) classifier training and motif grammar extraction. The workflow consists of four steps. a Step 1: extraction of the genomic sequences from the cell-type specific TF binding sites. b Step 2: annotation of these sequences using a large database of motifs. c Step 3: training of a RF classifier. d Step 4: Motif rule (grammar) extraction from the RF classifier
Fig. 2The Area Under the Receiver Operator Characteristics Curve (AUROC) of the RF classifiers based on Cross-Validation. a The TCF7L2 dataset. b The MAX dataset. The error bars indicate standard deviations
Fig. 3The mean decrease accuracies (MDA) of motif importance extracted from the trained RF classifiers. a The sorted MDA values extracted from the TCF7L2 RF. b The sorted MDA values extracted from the MAX RF. c Heat map showing the MDA values of the top motifs in the TCF7L2 RF. d Heat map showing the MDA values of the top motifs in the MAX RF
DNA motif rules extracted from the RF classified trained on the TCF7L2 dataset
| Rule | Prediction | Reference |
|---|---|---|
| NFE2 > =3 | HCT-116 (colon cancer cells) | [ |
| GSX1 < =35 & NFE2 > =2 | NFE2 [ | |
| BACH1 > =2 & FOXO3 < =2 & HOMEZ < =5 | BACH1 [ | |
| BACH1 < =1 & E2F > =2 & HOXB13 > =3 | HEK293 (embryonic kidney cells) | E2F [ |
| CDX2 > =5 & GATA > =2 & MAF = 0 | GATA [ | |
| BACH1 < =1 & HNF4 < =6 & HOXA13 > =3 & OTX1 > =9 | HOXB13 [ | |
| JDP2 > =2 & SOX21 > =37 | HeLa-S3 (cervical carcinoma cells) | JDP2 [ |
| HNF4 > =2 | HepG2 (liver cancer cells) | [ |
| HOXB13 < =3 & JDP2 < =1 & TCF7L2 > =5 | TCF7L2 [ | |
| HNF4 > =1 & HOXC10 < =16 & SOX9 > =5 | HNF4 [ | |
| GRHL1 > =4 | MCF-7 (mammary gland adenocarcinoma cells) | GRHL1 [ |
| No rule identified | PANC-1 (pancreatic cancer cells) |
The numbers in the rules represent the motif frequency detected in the +/− 120 bp regions from the peak centre
DNA motif rules extracted from the RF classified trained on the MAX dataset
| Rule | Prediction | Reference |
|---|---|---|
| IRF > =1 | GM12878 (lymphoblastoid cells) | [ |
| JDP2 > =1 | HeLa-S3 (cervical carcinoma cells) | [ |
| AP1 > =9 & HESX1 > =2 & LMO2 > =2 | AP1 [ | |
| EMX1 < =11 & ETS < =12 & HNF4 > =1 | HepG2 (liver cancer cells) | HNF4 [ |
| HNF4 > =1 & IRF4 < =4 & RUNX2 < =4 & TAL1 < =4 | HNF4 [ | |
| ALX3 < =26 & EVX1 > =6 & GATA > =2 & LMO2 > =2 | K562 (immortalised myelogenous leukaemia cells) | GATA [ |
| GATA > = 4 & HNF4 = 0 & POU4F3 < =4 | GATA [ | |
| No rule identified | A549 (adenocarcinomic alveolar basal epithelial cells) |
The numbers in the rules represent the motif frequency detected in the +/− 120 bp regions from the peak centre