| Literature DB >> 31001412 |
Ruipeng Lu1, Peter K Rogan1,2,3.
Abstract
Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets using Machine Learning (ML).Entities:
Keywords: Bray-Curtis similarity; Transcription factors; binding sites; chromatin; gene expression profiles; information theory; machine learning; mutation; position-specific scoring matrices
Year: 2018 PMID: 31001412 PMCID: PMC6464064 DOI: 10.12688/f1000research.17363.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Comparison between metrics in measurement of similarity between GTEx tissue-wide expression profiles of genes.
| Similarity metric | Property 1
| Property 2
| Property 3
|
|---|---|---|---|
| Bray-Curtis | √; [0,1] | √ | √ |
| Euclidean | √; (0,1] | √ | × |
| Cosine | √; [0,1] | × | √ |
| Pearson correlation
[ | ×; [-1,1] | × | × |
| Spearman correlation
[ | ×; [-1,1] | × | × |
†The symbols √ and ×, respectively, indicate that the similarity metric satisfies and does not satisfy the property. ‡The interval in each cell indicates the range in which the result computed by the similarity metric lies.
Figure 1. The general framework for predicting genes with similar tissue-wide expression profiles and TF targets.
Red and blue contents are respectively specific to prediction of genes with similar tissue-wide expression profiles and prediction of TF targets. ( A) An overview of the ML framework. The steps enclosed in the dashed rectangle vary across prediction of genes with similar tissue-wide expression profiles and TF targets. The step with a dash-dotted border that intersects promoters with DHSs is a variant of the primary approach. In the IDBC algorithm (Additional file 1 [22]), the parameter I is the minimum threshold on the total information contents of TFBS clusters. In prediction of genes with similar tissue-wide expression profiles, the minimum value was 939, which was the sum of mean information contents ( R values) of all 94 iPWMs; in prediction of direct targets, this value was the R value of the single iPWM used to detect TFBSs. The parameter d is the radius of initial clusters in base pairs, whose value, 25, was determined empirically. The seven ML features derived from TFBS clusters are described in the Methods section. The performance of seven different classifiers was evaluated with ROC curves and 10-fold cross validation (Additional file 1 [22]). ( B) Obtaining the positives and negatives for identifying genes with similar tissue-wide expression profiles to a given gene (Additional file 2 [22]). ( C) Obtaining the positives and negatives for predicting target genes of seven TFs using the CRISPR-generated perturbation data in K562 cells (Additional file 3 [22]). ( D) Obtaining the positives and negatives for predicting target genes of 11 TFs using the siRNA-generated knockdown data in GM19238 cells (Additional file 4 [22]).
Figure 2. GTEx tissue-wide expression profiles of NR3C1, SLC25A32 and TANK.
Visualization of the expression values (in RPKM) of these genes across 53 tissues from GTEx. For each gene, the colored rectangle belonging to each tissue indicates the valid RPKM of all samples in the tissue, the black horizontal bar in the rectangle indicates the median RPKM, the hollow circles indicate the RPKM of the samples considered as outliers, and the grey vertical bar indicates the sampling error. A comparison of the panels shows that the overall expression patterns of the three genes across the 53 tissues resemble each other (e.g. all three genes exhibit the highest expression levels in lymphocytes and the lowest levels in brain tissues).
Figure 3. Comparison between the performance of different classifiers in prediction of genes with similar tissue-wide expression profiles to NR3C1.
( A) ROC curves and AUC of seven classifiers without intersecting promoters with DHSs. ( B) ROC curves and AUC of seven classifiers after intersecting promoters with DHSs. The Decision Tree classifier exhibited the largest AUC under both scenarios, and inclusion of DHS information significantly improved other classifiers’ AUC except for Naïve Bayes.
The Decision Tree classifier performance for predicting TF targets using the CRISPR-generated knockdown data.
| TF | Excluding DHS information
| Including DHS information
| ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | Accuracy | |
| EGR1 | 0.58 | 0.62 | 0.60 | 0.78 | 0.81 | 0.80 |
| ELF1 | 0.59 | 0.65 | 0.62 | 0.83 | 0.87 | 0.85 |
| ELK1 | 0.59 | 0.59 | 0.59 | 0.80 | 0.81 | 0.81 |
| ETS1 | 0.59 | 0.6 | 0.59 | 0.81 | 0.81 | 0.81 |
| GABPA | 0.55 | 0.57 | 0.56 | 0.72 | 0.75 | 0.74 |
| IRF1 | 0.54 | 0.55 | 0.54 | 0.76 | 0.64 | 0.70 |
| YY1 | 0.50 | 0.51 | 0.51 | 0.45 | 0.69 | 0.57 |
†The average performance of 10 rounds of 10-fold cross validation when setting ε to 1.05 is indicated. The accuracy of each individual round is indicated in Additional file 5 [22]. The CRISPR-generated knockdown data were obtained from Dixit et al. [16].
The Decision Tree classifier performance for predicting TF targets using the siRNA-generated knockdown data.
| TF | Excluding DHS information
| Including DHS information
| ||||
|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | Accuracy | |
| BATF | 0.96 | 0.97 | 0.96 | 0.85 | 1 | 0.93 |
| JUND | 0.86 | 0.90 | 0.88 | 0.80 | 1 | 0.90 |
| NFE2L1 | 0.92 | 0.95 | 0.94 | 0.71 | 0.93 | 0.82 |
| PAX5 | 0.96 | 0.97 | 0.96 | 0.88 | 0.98 | 0.93 |
| POU2F2 | 0.97 | 0.97 | 0.97 | 0.89 | 0.99 | 0.94 |
| RELA | 0.95 | 0.96 | 0.96 | 0.83 | 0.97 | 0.90 |
| RXRA | 0.93 | 0.91 | 0.92 | 0.84 | 0.95 | 0.89 |
| SP1 | 0.98 | 0.98 | 0.98 | 0.89 | 0.99 | 0.94 |
| TCF12 | 0.98 | 0.98 | 0.98 | 0.86 | 0.99 | 0.93 |
| USF1 | 0.97 | 0.98 | 0.97 | 0.83 | 0.98 | 0.90 |
| YY1 | 1 | 1 | 1 | 0.55 | 0.99 | 0.77 |
†The average performance of 10 rounds of 10-fold cross validation is indicated. The accuracy of each individual round is shown in Additional file 5 [22]. The siRNA-generated knockdown data were obtained from Cusanovich et al. [13].
Figure 4. Accuracy of the Decision Tree classifier when using three different values for.
Each accuracy value was averaged from 10 rounds of 10-fold cross validation. The minimum threshold ε of the average fold change in gene expression levels (for all guide RNAs) of the TF was determined for fold changes: 1.01, 1.05 and 1.1. The accuracy value of each individual round is indicated in Additional file 5 [22]. As ε increased, accuracy for all seven TFs monotonically increased.
Intersection of TF targets and 500 protein-coding genes with the most similar tissue-wide expression profiles.
| TF | Cell line | Number of
| Size of
| Targets among the most similar 10 genes
|
|---|---|---|---|---|
| EGR1 | K562 | 169 | 12 | None |
| ELF1 | 78 | 5 | None | |
| ELK1 | 112 | 4 |
| |
| ETS1 | 267 | 15 | None | |
| GABPA | 513 | 25 | TAF1(1 st) | |
| IRF1 | 457 | 10 | None | |
| YY1 | 1752 | 127 |
| |
| GM19238 | 1040 | 61 |
| |
| BATF | 186 | 21 | None | |
| JUND | 44 | 2 | None | |
| NFE2L1 | 58 | 4 | None | |
| RELA | 247 | 13 |
| |
| RXRA | 181 | 3 | None | |
| SP1 | 1595 | 81 | None | |
| TCF12 | 655 | 20 | None | |
| USF1 | 301 | 21 | None | |
| PAX5 | 918 | 86 |
| |
| POU2F2 | 532 | 26 |
|
§The rank of each target in the list of similar genes in the descending order of Bray-Curtis similarity values is shown in the brackets immediately following the target.
Figure 5. Mutation analyses on the target MCM7 of EGR1.
This figure depicts the effect of a mutation in each EGR1 binding site cluster of the MCM7 promoter on the expression level of MCM7, which is a target of the TF EGR1. The strongest binding site in each cluster were abolished by a single nucleotide variant. Upon loss of all three clusters, only weak binding sites remained and EGR1 was predicted to no longer be able to effectively regulate MCM7 expression. Multiple clusters in the promoters of TF targets confer robustness against mutations within individual binding sites that define these clusters.
Mutation analyses on promoters of TF targets.
| TF | Target | Normal cluster | Normal binding site
[ | SNP ID
[ | Variant binding site
[ | Variant
| Classifier
| ||
|---|---|---|---|---|---|---|---|---|---|
| Variant
[ | Wild-
| ||||||||
| EGR1
|
| Cluster 1
|
| rs538610162
|
| Abolished | √ | × | √ |
| rs759233998
| GA
| Abolished | √ | ||||||
| rs974735901
| GAGGGGGC
| Cluster 1
| √ | ||||||
| rs978230260
| GAGGGGGCA
| Abolished | √ | ||||||
| Cluster 2
|
| rs764734511
|
| Cluster 2
| √ | ||||
|
| Cluster 2
| √ | |||||||
| rs996639427
| GCGTGCGT
| Abolished | √ | ||||||
| GCGT
| GCGT
| ||||||||
| rs1027751538
| GCGTGGGC
| Abolished | √ | ||||||
|
|
| Cluster 2
| √ | ||||||
| ELF1
|
| Cluster 1
| GC
| rs760968937
|
| Cluster 1
| √ | √ | √ |
| GCGGAAG
| Cluster 1
| √ | × | ||||||
| rs1000196206
| GC
| Abolished | √ | ||||||
| rs144759258
| GCG
| Abolished | √ | ||||||
| rs966435996
| GCGG
| Abolished | √ | ||||||
| rs950986427
| GCGGAAGC
| Cluster 1
| √ | ||||||
| Cluster 2
|
| rs373649904
|
| Abolished | √ | ||||
| rs926919149
| CAG
| Abolished | √ | ||||||
| rs751263172
| CAGG
| Abolished | √ | ||||||
| rs369076253
| CAGGAGATGC
| Cluster 2
| √ | ||||||
|
|
| Cluster 2
| √ | √ | |||||
| ELK1
|
| Cluster 1
| C
|
|
| Cluster 1
| √ | √ | √ |
| rs887606802
| C
| Cluster 1
| √ | × | |||||
| rs1021034916
| CA
| Cluster 1
| √ | ||||||
| GAGGA
| rs941962117
| GAGGA
| Abolished | √ | |||||
| Cluster 2
|
| rs896117033
| CTGGAAGAG
| Cluster 2
| √ | ||||
| rs971962577
| CTGGAAGA
| Cluster 2
| √ | ||||||
| rs1011969709
|
| Abolished | √ | ||||||
| CCA
| CCA
| ||||||||
|
|
| Cluster 2
| √ | √ | |||||
| ETS1
|
| Cluster 1
| GCA
| rs1022234223
| GCA
| Abolished | × | × | √ |
|
|
| Cluster 1
| √ | √ | |||||
| GABPA
|
| Cluster 1
| A
| rs997328042
| A
| Abolished | × | × | √ |
| rs1020720126
| ACA
| Abolished | × | ||||||
| T
| rs185306857
| T
| Cluster 1
| √ | |||||
|
|
| Cluster 1
| √ | ||||||
|
|
| Cluster 1
| √ | ||||||
| IRF1
|
| Cluster 1
|
| rs950528541
|
| Cluster 1
| √ | × | √ |
| rs886259573
| G
| Cluster 1
| √ | ||||||
| rs982931728
| GAG
| Cluster 1
| √ | ||||||
| rs1020218811
| GAGAA
| Cluster 1
| √ | ||||||
| rs570723026
| GAGAATGAA
| Cluster 1
| √ | ||||||
| rs1004825794
| GAGAATGAAAGC
| Cluster 1
| √ | ||||||
| GAGAATGAAAGC
| Cluster 1
| √ | |||||||
| AA
| rs1030185383
| AAGACCAA
| Cluster 1
| √ | |||||
| rs5874306
| AAGACCAAAGCAG
| Cluster 1
| √ | ||||||
|
|
| Cluster 1
| √ | √ | |||||
| YY1
|
| Cluster 1 of 1 |
| rs865922947
|
| Cluster 1 | √ | × | √ |
| rs946037930
| GC
| Cluster 1 | √ | ||||||
| rs917218063
| GCG
| Abolished | × | ||||||
| rs928017336
| GCGGC
| Abolished | × | ||||||
| GCCGCCCCCGTC
| |||||||||
§All coordinates are based on the hg38 genome assembly. A bold italic letter in a binding site sequence indicates the base where a SNP occurs. For each normal and variant binding site sequence, the genome coordinate of its most 5’-end base and its R value are indicated. The negative R value of a variant binding site sequence implies this site is abolished. The SNPs strengthening binding sites and corresponding variant binding site sequences are underlined.
‡The impact on whether the occurrence of a single SNP resulted in the disappearance of the cluster containing it is shown; ‘Abolished’ indicates that the cluster is eliminated by the existence of the variant allele.
†After a single SNP occurred or multiple SNPs simultaneously occurred, the classifier produced a new prediction on whether the TF is still capable of significantly affecting gene expression via the variant promoter.