Literature DB >> 28454114

Estimating the effects of transcription factors binding and histone modifications on gene expression levels in human cells.

Lu-Qiang Zhang1, Qian-Zhong Li1.   

Abstract

Transcription factors and histone modifications are vital for the regulation of gene expression. Hence, to estimate the effects of transcription factors binding and histone modifications on gene expression, we construct a statistical model for the genome-wide 15 transcription factors binding data, 10 histone modifications profiles and DNase-I hypersensitivity data in three mammalian. Remarkably, our results show POLR2A and H3K36me3 can highly and consistently predict gene expression in three cell lines. And H3K4me3, H3K27me3 and H3K9ac are more reliable predictors than other histone modifications in human embryonic stem cells. Moreover, genome-wide statistical redundancies exist within and between transcription factors and histone modifications, and these phenomena may be caused by the regulation mechanism. In further study, we find that even though transcription factors and histone modifications offer similar effects on expression levels of genome-wide genes, the effects of transcription factors and histone modifications on predictive abilities are different for genes in independent biological processes.

Entities:  

Keywords:  Chromosome Section; DNase-I hypersensitivity; histone modifications; regulation mechanism; statistical redundancy; transcription factors

Mesh:

Substances:

Year:  2017        PMID: 28454114      PMCID: PMC5522221          DOI: 10.18632/oncotarget.16988

Source DB:  PubMed          Journal:  Oncotarget        ISSN: 1949-2553


INTRODUCTION

Earlier studies [1-4] showed transcription factors (TFs) binding and histone modifications (HMs) were critical for gene expression, and the abnormities of TFs binding and HMs may affect the cell fate such as differentiation and apoptosis [5]. The ability to comprehend and predict their effects is vital to develop treatments for hundreds of human diseases, including leukemia [6], diabetes [7] and various cancers such as prostate cancer [8, 9], lung cancer [10] and breast cancer [11, 12], etc. The significant regulations of mammalian gene expression are deemed to occur at the level of transcriptional initiation and elongation [13]. TFs can activate or block the initiation of gene transcription by binding to specific DNA sequences in enhancers or promoters [14, 15] or recruiting some chromatin-modifying enzymes to induce the changes of chromatin structure [16]. HMs are recognized to activate or inhibit transcription by either modulating the local chromatin structure to control TFs accessibility [17] or directly recruiting related enzymes [18]. In previous studies, by analyzing the relations of HMs and TF binding to gene expression, Cheng et al. [19] found that HMs or TFs binding in different positions show different predictive abilities, and they suggested HMs and TF binding may be redundant for predicting gene expression levels. Karlic et al. [20] noticed that different combinations of HMs are needed for predicting the expression levels of genes with different CpG content promoters. In this study, we investigate the relative contribution of each TF (HM) or combination of them to gene expression by constructing a support vector regression (SVR) model for the genome-wide 15 TFs binding data, 10 HMs profiles and DNase-I hypersensitivity data in three mammalian, and verify their universality in H1-HESc, Gm12878 and K562 cell lines. We further explore how TFs, HMs and gene expression interact with each other. At last, we research the effects of TFs and HMs on prediction for genes in independent biological processes.

RESULTS AND DISCUSSIONS

The “Optimal” TFs for predicting gene expression are cell-specific

TFs can bind to specific DNA elements and stimulate or suppress gene transcription. There are approximately 1700 to 1900 TFs in human, including 1391 manually curated sequence-specific TFs [5]. In this study, we download respectively available 57, 87 and 96 TFs for H1-hESc (human embryonic stem cells), Gm12878 (B-lymphoblastoid cell) and K562 (erythrocytic leukemia cells) which are immortal [21] and have the most completed data [22]. Then the top 15 TFs which are vital TFs for predicting gene expression levels are chosen by using stepwise regression analysis (the usage about stepwise regression analysis is detailed in Supplementary information), and regarded as the “optimal” TFs for each cell line (shown in Figure 1). We observe that different “optimal” TFs are needed for different cell lines, indicating TFs binding is a dynamical process that depends on tissues or cell lines. A likely explanation for these phenomena may be the essential difference among the three cell lines, necessitating the selection of alternative TFs [2].
Figure 1

List of the TFs involved in the current study for H1, Gm12878 and K562

TFs and HMs predict gene expression levels

The presences or absences of some TFs and HMs are correlated with gene expression levels [1, 16, 20, 23]. To better understand the relations between TFs (HMs) and gene expression levels, we construct log-linear model and non-linear SVR model for three immortalized human cell lines: H1-hESC, GM12878 and K562. The predictive power (R2) of the two models in 10-fold cross-validation are shown in Table 1 and Supplementary file Table S1.
Table 1

Prediction accuracy of log-linear and SVR model

TFsHMs+DNaseTFs+HMs+DNase
H1log-line regression0.4040.5290.555
SVR0.5440.5940.635
Gm12878log-line regression0.4950.6680.649
SVR0.6170.7190.730
K562log-line regression0.5270.6410.633
SVR0.6270.6900.688

The CV-R2 is the average R2 for the 10 fold cross-validation.

The CV-R2 is the average R2 for the 10 fold cross-validation. The results show that TFs, HMs and DNase have stronger correlation with gene expression levels in SVR model than in log-linear model. It may be resulted from the non-linear relationships between TFs (HMs) and gene expression [19, 24]. Therefore, SVR model is applied in the remainder of this work, despite a remarkable increase in required CPU time.

Different HMs and TFs are required for predicting gene expression levels

In order to check whether all HMs (TFs) are equally important for predicting gene expression, we construct SVR models for all possible combinations of 10 HMs and DNase or 15 TFs, which results in 2047 HMs+DNase combination modes and 32767 TFs combination modes. The detailed information and statistical results are depicted in Supplementary tables 1-6 and Figure 2. The distributions of Pearson correlation coefficient (PCC) for these 2047 HMs+DNase combination modes in the H1-hESc, GM12878 and K562 are respectively shown in Figure 2A, 2C and 2E). The distributions of PCC for these 32767 TFs combination modes in the H1-hESc, GM12878 and K562 are shown in Figure 2B, 2D and 2F). The maximum PCC for combination modes of different amounts of HMs or different amounts of TFs is connected by a black curve. It is found that the predictive powers will basically reach summit in the maximum combination of four HMs or four TFs. The combination modes of maximum prediction accuracy for the four factors (i.e. four HMs or four TFs) are described in Table 2. These results show that all HMs+DNase or TFs are not equally important and there are statistical redundancies within HMs (TFs).
Figure 2

The PCC distributions for all combination of 15 TFs or 10 HMs and DNase

A., B. H1, C., D. Gm12878 and E., F. K562 cell line. X-axis represents the combination of c kinds of HMs and DNase (choose c out of 10 HMs and DNase, c = 1,2,…,11) or d kinds of TFs (choose d out of 15 TFs, d = 1,2,…,15), and the black curves represent the maximum PCC for the combination mode of c HMs and DNase or the combination mode of d TFs.

Table 2

The combination modes of the maximum prediction accuracy for four factors

cell linefactorcomponents for the combinationPCC
H1TFsPOLR2A,SIX5,MAX,SUZ120.725
HMs+DNaseH3K36me3, H3K27me3,H3K4me3,H3K9me30.763
Gm12878TFsGABPA,NFATC1,POLR2A,TCF30.789
HMs+DNaseH3K79me2,H3K36me3,H3K27me3,H3K4me30.845
K562TFsELF1,PML,POLR2A,ZBTB7A0.791
HMs+DNaseH3K36me3,H3K79me2,H3K9me3,H3K27me30.830

The PCC distributions for all combination of 15 TFs or 10 HMs and DNase

A., B. H1, C., D. Gm12878 and E., F. K562 cell line. X-axis represents the combination of c kinds of HMs and DNase (choose c out of 10 HMs and DNase, c = 1,2,…,11) or d kinds of TFs (choose d out of 15 TFs, d = 1,2,…,15), and the black curves represent the maximum PCC for the combination mode of c HMs and DNase or the combination mode of d TFs. In addition, to further identify which HMs contribute more to predicting gene expression, we focus on the combinations modes of 4 kinds of HMs. We study all four-HMs modes whose PCC reach at least 95% of the all-HMs mode (PCCall_H1 = 0.786, PCCall_Gm12878 = 0.852 and PCCall_K562 = 0.836). There are finally 58, 116 and 117 combination modes, respectively, for H1-hESc, Gm12878 and K562, which is an enough large number to evaluate the over-representation analysis. By investigating the appearance times of each HM in these combination modes, we find the following results (see Figure 3):
Figure 3

The appearance frequency of each HM in the studied modes

A. The frequency of each HM in H1 cell line, where the integer represents the occurrence times in the studied modes. B. Venn diagram shows that the co-occurrence times of the four important HMs. C. and D. The frequency of each HM in Gm12878 and K562.

The appearance frequency of each HM in the studied modes

A. The frequency of each HM in H1 cell line, where the integer represents the occurrence times in the studied modes. B. Venn diagram shows that the co-occurrence times of the four important HMs. C. and D. The frequency of each HM in Gm12878 and K562. Firstly, H3K36me3 appears in all these modes for the three cell lines, and it may be vital for gene expression. The better predictive results (PCCH3K36me3_H1 = 0.496, PCCH3K36me3_Gm12878 = 0.698 and PCCH3K36me3_K562 = 0.750) for gene expression levels are obtained by using single H3K36me3 information parameter. Our results are consistent with previous work, Hahn et al. showed H3K36me3 is a intragenic mark of active genes, and it is associated with two categories of genes [25]. Nanty et al. noticed that H3K36me3 has bimodalities in gene-body, which would influence DNA methylation levels and help shape gene-body CpG density profiles [26]. Secondly, for the H1-hESc, each of H3K9ac, H3K27me3 and H3K4me3 appears in nearly half of the 58 combination modes (53.45%, 43.10% and 43.10%, respectively), while other HMs appear in at most 29.31% of 58 modes (shown in Figure 3A). Thus, H3K9ac, H3K27me3 and H3K4me3 are more reliable information parameters than other HMs in H1-hESc, which consist with previous study [23]. Furthermore, we check the times that H3K9ac, H3K27me3, H3K4me3 and H3K36me3 appear together (shown in Figure 3B). We notice that H3K4me3 and H3K9ac appear simultaneously only seven times in the 58 modes, it may be that the information they represented are not simultaneously needed in 58 modes because their information redundancy, which is supported by the high correlation (PCC = 0.905). H3K4me3 and H3K27me3 (H3K27me3 and H3K9ac) occur together eight times in the 58 modes, and the correlation between H3K4me3 and H3K27me3 (H3K27me3 and H3K9ac) is PCC = 0.507 (PCC = 0.502), suggesting that they are partially redundant. However, we find H3K36me3 combines with one of H3K4me3, H3K27me3 and H3K9ac respectively appear in 23, 25 and 31 times, showing that the information they provide may be non-redundant. In fact, the correlations respectively are PCC = 0.097, PCC = 0.203 and PCC = 0.202. Thirdly, for the Gm12878 and K562 cell lines, even though other HMs except H3K36me3 appear in similar level (about 30%, see Figure 3C and 3D), the combination of H3K36me3 and H3K79me2 can effectively increase the predictive power. We find the predictive accuracy of this combination in the four-HMs modes reaches at least 97.59% of the all-HMs mode. Similarly, we focus on those four-TFs modes whose PCC reach at least 95% of the all-TFs mode (PCCall_H1 = 0.753, PCCall_Gm12878 = 0.799, PCCall_K562 = 0.802), 85, 172 and 345 modes are lastly remained for H1-hESc, Gm12878 and K562, respectively. We obtain that POLR2A is ubiquitous in all studied modes for the three cell lines and it can faithfully model gene expression levels (PCCPOLR2A_H1 = 0.661, PCCPOLR2A_Gm12878 = 0.677 and PCCPOLR2A_K562 = 0.730). Previous researches had shown the importance of this mark which is linked to the synthesis of messenger RNA [27, 28]. For the K562 cell line, we also find the combination of POLR2A and ZBTB7A in the four-TFs modes reaches at least 97.58% of the all-TF mode. At last, to verify whether the above inferences depend on four-factors modes, we implement same analysis for five-factors and six-factors modes and analogous consequences are found.

TFs and HMs provide similar effect on predicting genome-wide gene expression

As shown in Table 1, TFs and HMs model both obtain high predictive power, and TF+HM+DNase model only get similar predictive accuracy with them, indicating TFs binding and HMs may offer similar effects on genome-wide gene expression. To quantify this phenomenon, the PCC between the predictive values of TFs model and the predictive values of HMs model is respectively calculated for the three cell lines. Strong correlations (PCCH1 = 0.827, PCCK562 = 0.908, and PCCGm12878 = 0.895 respectively) support that TFs and HMs offer similar effects on genome-wide gene expression and show the statistical redundancies also exist between TFs and HMs. Although TF+HM+DNase model does not obtain obviously improved predictive ability, it tends to more stable than TFs or HMs model (i.e. smaller RMSE between R2 and CV-R2 than TFs or HMs model).

Regulation mechanism leads to statistical redundancy

To investigate the fundamental source of statistical redundancies among factors, the PCC between and within TFs and HMs are calculated for the three cell lines (see Figure 4). High correlations among these factors indicate the statistical redundancies maybe come from the regulation mechanism (i.e. two factors have similar regulatory functions). To verify the above supposition, the target genes of TFs or HMs are predicted by using the software BETA [29]. Then, the co-regulated and solo-regulated targets for TFs (HMs) whose PCC > 0.85 within TFs (HMs) and the co-regulated and solo-regulated targets for TF and HM whose PCC > 0.70 between TF and HM in H1-hESc cell lines are counted. The results present that the co-regulated genes are far more than solo-regulated genes for those factors (Figure 5 and Supplementary Figure S1, similar work is done for Gm12878 and K562 (not shown)), which effectively support our inferences. It is worth noting that some factors with similar regulatory functions have been demonstrated, for instance, CEBPB and SP1 which have strong correlation both can activate the expression of the insulin receptor gene [30]. Enrichments of H3K4me2 or H3K4me3 at TSS are positively correlated to the extents of gene activities [31], etc.
Figure 4

Heatmaps of PCC both within TFs (HMs) and between TFs and HMs for the three cell lines

A., B. and C. represent H1, Gm12878 and K562 cell lines, respectively.

Figure 5

Venn diagram shows the number of the co-regulated and solo-regulated genes within and between TFs and HMs

The blue depicts the co-regulated target genes, the pink and purple represent solo-regulated genes by factors attach to the charts, respectively.

Heatmaps of PCC both within TFs (HMs) and between TFs and HMs for the three cell lines

A., B. and C. represent H1, Gm12878 and K562 cell lines, respectively.

Venn diagram shows the number of the co-regulated and solo-regulated genes within and between TFs and HMs

The blue depicts the co-regulated target genes, the pink and purple represent solo-regulated genes by factors attach to the charts, respectively.

Construction of TFs, HMs and gene expression interaction network

For further investigating how TFs, HMs interact with each other and the effects of TFs and HMs on gene expression, the interaction networks among TFs, HMs and gene expression are constructed, where Partial correlation coefficient is used to estimate inherent relationship between each paired factors and they are calculated as the edges of the networks. The entire process is done by R package ‘GeneNet_1.2.13’. Finally, 60 most significant edges are selected out for visualization (Figure 6 and Supplementary Figure S2).
Figure 6

The interaction network among TFs, HMs and gene expression for H1 cell line

In the network, nodes represent TFs, HMs and gene expression. Edges show the partial correlation coefficient between each paired factors, where the dash lines represent negative correlations and solid lines represent positive correlations. Bolder the line is, the stronger correlation it represents.

The interaction network among TFs, HMs and gene expression for H1 cell line

In the network, nodes represent TFs, HMs and gene expression. Edges show the partial correlation coefficient between each paired factors, where the dash lines represent negative correlations and solid lines represent positive correlations. Bolder the line is, the stronger correlation it represents. For the three cell lines, we notice that H3K36me3 and POLR2A have direct correlations with gene expression levels and both promote the expression of genes, which maybe an interpretation why H3K36me3 and POLR2A are important in the section 2.3. Moreover, we find there is a higher positive correlation between H3K4me1 and H3K4me2 (between H3K4me2 and H3K4me3) in three cell lines. But the higher positive correlation between ATF2 and SP4 (between USF1 and USF2) is cell line specific. Besides, based on the interactive networks, we know that the gene expressions not only are affected by TFs and HMs, but also influenced by the interactions among factors (detailed in the legend of Figure 6). In order to check the robustness of the networks, we implement 50 times simulations by randomly removing 200 genes and same networks are found.

The effects of TFs and HMs on prediction are different for genes in independent biological processes

In section 2.4, we find that TFs and HMs model offer similar predictive power for genome-wide gene expression. In order to further investigate the effects of TFs and HMs on prediction for genes in independent biological processes, we focus on the Gene Ontology biological processes [32, 33] for the high expression genes in the three cell lines (based on RPKM values, the top fifteen percent of all genes are selected as high expressed genes [3, 23]). Firstly, biological processes containing less than 30 genes are discarded, 1104, 1136 and 1070 sets of genes are remained, respectively, for H1-hESc, Gm12878 and K562 cell line. In order to ensure the effectiveness of statistics, the 604, 741 and 398 sets of genes for H1-hESc, Gm12878 and K562 cell line are lastly remained when TFs or HMs model's Benjamini-Hochberg-corrected P-value [34] is fewer than 0.05.. To quantify the effects of TFs and HMs on prediction for genes in independent biological processes, the ratio of PCC of TFs model to PCC of HMs model for these biological processes is calculated (see Supplementary tables 7-9 and Table 3). Of the 604, 741 and 398 biological processes for the three cell lines, it is found that 21, 89 and 24 processes show that the effect of HMs on prediction is superior to the effect of TFs (the ratio ranges from 0.59 to 0.90); 254, 235 and 65 processes show that the effect of TFs on prediction is superior to the effect of HMs (the ratio ranges from 1.10 to 2.01); but TFs and HMs offer similar effect on prediction in others 329, 418 and 309 processes (the ratio ranges from 0.90 to 1.10). In addition, we also notice that this phenomenon exists in same biology processes but in different cell lines (shown in Table 4). In conclusion, even though TFs and HMs offer similar effect on expression levels of genome-wide genes, the effects of TFs and HMs on predictive abilities are different for genes in some independent biological processes.
Table 3

List of three random GO-ID for each ratio range in the three cell lines

Cell linesGO-IDGo-termTF_PCCHM_PCCRatio
H1GO:0010212response to ionizing radiation0.5910.8970.659
GO:0046777protein autophosphorylation0.6900.8970.770
GO:0016569covalent chromatin modification0.6120.7500.816
GO:0006323DNA packaging0.7750.8340.929
GO:0023061signal release0.8900.9260.961
GO:0007409axonogenesis0.8690.8381.037
GO:0007010cytoskeleton organization0.8180.6591.240
GO:0006508proteolysis0.7160.5681.260
GO:0030163protein catabolic process0.8450.4291.970
Gm12878GO:0009117nucleotide metabolic process0.4420.7020.630
GO:0040007growth0.6300.8690.725
GO:0006875cellular metal ion homeostasis0.6660.8790.757
GO:0065007biological regulation0.6290.6910.910
GO:0016192vesicle-mediated transport0.7810.8050.970
GO:0006325chromatin organization0.8360.8031.041
GO:0045786negative regulation of cell cycle0.8980.7251.238
GO:0006629lipid metabolic process0.8530.6541.304
GO:0043087regulation of GTPase activity0.9620.6361.513
K562GO:0023061signal release0.7490.9060.828
GO:0007009plasma membrane organization0.8200.9330.879
GO:0006396RNA processing0.5830.6510.894
GO:0007155cell adhesion0.7800.8530.915
GO:0030097hemopoiesis0.8440.8660.975
GO:0030162regulation of proteolysis0.8210.7961.032
GO:0006952defense response0.7460.6691.114
GO:0045087innate immune response0.8130.7091.147
GO:0051049regulation of transport0.8070.6081.329
Table 4

List of five random GO-ID where TFs and HMs model show distinct PCC for the same biological process in the different cell lines

GO-IDGO_termH1-TFsH1-HMsGm12878-TFsGm12878-HMsK562-TFsK562-HMs
GO:0042326negative regulation of phosphorylation0.9220.8060.5690.9330.8460.796
GO:0009968negative regulation of signal transduction0.7670.6890.6510.8600.7380.723
GO:0006873cellular ion homeostasis0.9420.8660.6680.8970.6930.815
GO:0030003cellular cation homeostasis0.9390.8740.6680.8970.6920.830
GO:0055080cation homeostasis0.9030.8770.6580.9030.7040.822

DISCUSSION

The next-generation sequencing technology [35] provides large numbers of data that enable a more intensive research the interaction among TFs, HMs and DNA to be possible. Through a series of analyses and researches, the following interesting results can be put forward: (1) The selected TFs obtain better predictive than previous studies. Budden et al. [2] investigated the relation between core TFs and gene expression in Gm12878 by using similar method, their predictive accuracy was only CV-R2 = 0.390. But the predictive accuracy is CV-R2 = 0.617 in our study, this conclusion indicates that TFs studied in our paper may contain more information than those core TFs or can functionally substitute for some core TFs. The compared results are shown in Table 5. (2) Based on SVR model, the relationships between HMs and gene expression are investigated in Gm12878, and better results are obtained. For instance, McLeay et al. [36] studied the effects of 7 HMs and DNase on gene expression by a log-linear regression model, their predictive accuracy is CV-R2 = 0.412, but the predictive power in our study is CV-R2 = 0.719, which further imply a non-linear relations between HMs and gene expression. Dong et al. [24] constructed a two-step model to predict genes expression levels, they only use the chromatin feature density of ‘bestbin’ as predictor which ignores the information in other bins. Comparing with their accuracy PCC = 0.82, we achieve PCC = 0.85. The compared results are shown in Table 5. (3) In section 2.3 and 2.6, we and others observe that POLR2A, H3K4me3 and H3K27me3 can activate or inhibit gene expression [27, 28, 36–38], these not only show the obtained conclusions are accurate, but also indicate our model and methods may be reasonable.
Table 5

The predictive results compare with other studies

cell linesfactorsCV-R2method
Budden’s studyGm12878c-FOS,CTCF,EGR1,NRF1,NRSF,POU2F2, SP1,SRF,STAT3,USF1,YY10.390SVR
Our studyGm12878CTCF,GABPA,IKZF1,JUND,MXI1,NFYB, NFATC1,SIX5,SPT20,TCF3,USF1,ZNF274, POLR2A,USF2,0.617SVR
McLeay’s studyGm12878H3K4me1,H3K4me2,H4K20me1,H3K4me3,H3K36me3, H3K9me3,Dnase0.412log-linear regression
Our studyGm12878H3K27ac,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K79me2,H3K9ac,H3K9me3,H4K20me1,Dnase0.719SVR
cell linesfactorsPCCMethod
Dong’s studyH1Gm12878K562H2AZ, H3K27ac,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K79me2,H3K9ac,H3K9me3,H4K20me1,Dnase0.790.820.84two-step
Our studyH1Gm12878K562H3K27ac,H3K27me3,H3K36me3,H3K4me1,H3K4me2,H3K4me3,H3K79me2,H3K9ac,H3K9me3,H4K20me1,Dnase0.790.850.84SVR

The bold represents co-factors in the comparison.

The bold represents co-factors in the comparison. Though improvements have been acquired, there are still some insufficiencies. In statistical prediction, the jackknife test is deemed the least arbitrary which had been elegantly demonstrated by Eqs. (28-30) in [39]. Hence, this method had been widely used by researchers to test the quality of information parameters (see, e.g., [40-46]). However, to reduce the computational time, the 10-fold cross validation is adopted in this paper as done by many researchers who use SVM as the prediction engine. As future works, we will make our efforts to adopt more precise test method, and provide a publicly accessible and user-friendly web-server as presented in a series of recent publications [47-51] to effectively enhance their impacts [52]. Meanwhile, more precise and faster sequence analysis tools [53, 54] will be fully utilized in follow-on work.

MATERIALS AND METHODS

Available data and implementation

The RefSeq genes of the human genome (hg19) come from the UCSC database (http://genome.ucsc.edu/cgi-bin/hgTables), which contains transcription start site (TSS). Genes starting with NM are chosen out (i.e. the mature messenger RNA). In order to prevent the possibility that some genes may be the alternative transcripts of the same gene, only one of the genes which have the same TSS is retained. At last, a set of 19120 genes is left for remainder analysis. All the TFs binding data, HMs profiles and DNase-I hypersensitivity data for H1-hESc, K562 and Gm12878cell lines are downloaded from the UCSC database (detail in Figure 1, Supplementary file Table S2 and Supplementary file Table S3). Because the DNase-I hypersensitivity data for the three cell lines are in hg18 coordinate, the UCSC liftOver tool [55] is used to convert the hg18 data into hg19. For visualization, the raw data is converted to bed format by using BEDtools software [56]. The expression data of the H1-hESc, Gm12878 and K562 are measured by applying the RNA-seq techniques. The mapped RNA-seq reads reported in this paper are depicted in the Gene Expression Omnibus database (GSM915329 (H1-hESc), GSM958730 (Gm12878) and GSM958731 (K562)). The expression levels of all genes are calculated according to the reads per kilobase of exon model per million mapped reads and represented as RPKM value [57].

Transcription factors binding signal

The DNA regions flanking the TSS (-10~10kb) of all RefSeq genes are separated into 100 bins, each of 200 bps in size. Based on our previous study [3], signals of TFs binding are normalized by using the following Eq. (1),  (1) in which Nkij represents normalized signal, nkij is the total tags that k-th TF locates in the j-th bin of the i-th gene, 109 is used to eliminate the difference of magnitude with RPKM. 200 is the length of the j-th bin, and nktag is the total tags of the k-th TF. This results in a 19120×100 matrix N (matrix element is Nkij (i = 1, 2,…,19120; j = 1,2,…,100; k = 1,2,…,15) for the k-th TF.

HMs and DNase binding signal

Similarly, the DNA regions flanking the TSS (-2~2kb) of all RefSeq genes are divided into 20 bins, with each consisting of 200 bps. Then, the signals of HMs and DNase binding are normalized by using the following Eq. (2),  (2) where Hlim represents the normalized signal, hlim is the total tags that l-th HM or DNase locates in the m-th bin of the i-th gene, and hltag is the total tags of the l-th HM or DNase. This results in a 19120×20 matrix H (matrix element is Hlim (m = 1,2,…,20; l = 1,2,…,11) for the l-th HM or DNase.

Calculation of TFs association strength (TFAS)

For the i-th gene and the k-th TF, TFAS is calculated by the following Eq. (3)  (3) where Nkij is computed by Eq.(1), Fk is the normalized Gaussian kernel density function, where the bandwidth is calculated by the rule of thumb [58]. dj is a relative distance between the midpoint of the j-th bin and the corresponding gene's TSS, the σk is a pseudocount (the detailed information is displayed in supplementary information). For 19120 genes and 15 TFs, the TFAS profiles are denoted by the 19120×15 matrix a (the matrix element is).

Calculation of HMs or DNase association strength (HMAS)

For the i-th gene and the l-th HM or DNase, the HMAS is calculated by using the following Eq. (4)  (4) where Hlim is computed by Eq.(2), the σi is a pseudocount, the HMAS profiles are denoted by the 19120×11 matrix b (the matrix element is).

Log-linear regression model and non-linear SVR model

Combining with the TFASs, HMASs and multivariate linear regression, the log-linear regression model is derived by the following Eq. (5)  (5) in which Li is the RPKM value of the i-th gene, σ is a pseudocount, υ is the intercept, αk and βl are the regression coefficients. Based on the support vector machines, a SVR model is constructed by using the Eq. (6)  (6) in which μ is the intercept, K(Xi, X) is the kernel function and γi is the Lagrange multiplier. Matrix X is the matrix a (calculated by Eq.(3)) and/or the matrix b (calculated by Eq.(4)), Xi is the i-th row elements of matrix X. The entire process is done by libSVM software [59].
  56 in total

1.  Histone modification levels are predictive for gene expression.

Authors:  Rosa Karlić; Ho-Ryun Chung; Julia Lasserre; Kristian Vlahovicek; Martin Vingron
Journal:  Proc Natl Acad Sci U S A       Date:  2010-02-01       Impact factor: 11.205

2.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition.

Authors:  Zi Liu; Xuan Xiao; Wang-Ren Qiu; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2015-01-14       Impact factor: 3.365

3.  OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition.

Authors:  Maryam Rahimi; Mohammad Reza Bakhtiarizadeh; Abdollah Mohammadi-Sangcheshmeh
Journal:  J Theor Biol       Date:  2016-12-02       Impact factor: 2.691

4.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins.

Authors:  Wei-Zhong Lin; Jian-An Fang; Xuan Xiao; Kuo-Chen Chou
Journal:  Mol Biosyst       Date:  2013-01-31

5.  Histone deacetylase inhibitors suppress mutant p53 transcription via HDAC8/YY1 signals in triple negative breast cancer cells.

Authors:  Zhao-Tong Wang; Zhuo-Jia Chen; Guan-Min Jiang; Ying-Min Wu; Tao Liu; Yan-Mei Yi; Jun Zeng; Jun Du; Hong-Sheng Wang
Journal:  Cell Signal       Date:  2016-02-11       Impact factor: 4.315

Review 6.  Role of histone and transcription factor acetylation in diabetes pathogenesis.

Authors:  Steven G Gray; Pierre De Meyts
Journal:  Diabetes Metab Res Rev       Date:  2005 Sep-Oct       Impact factor: 4.876

7.  Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou's general pseudo amino acid composition.

Authors:  Mandana Behbahani; Hassan Mohabatkar; Mokhtar Nosrati
Journal:  J Theor Biol       Date:  2016-09-08       Impact factor: 2.691

Review 8.  Epithelial to mesenchymal transition inducing transcription factors and metastatic cancer.

Authors:  Mousumi Tania; Md Asaduzzaman Khan; Junjiang Fu
Journal:  Tumour Biol       Date:  2014-06-02

9.  The human gene encoding the largest subunit of RNA polymerase II.

Authors:  K Mita; H Tsuji; M Morimyo; E Takahashi; M Nenoi; S Ichimura; M Yamauchi; E Hongo; A Hayashi
Journal:  Gene       Date:  1995-07-04       Impact factor: 3.688

10.  Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors:  Kuo-Chen Chou
Journal:  J Theor Biol       Date:  2010-12-17       Impact factor: 2.691

View more
  7 in total

Review 1.  Histone demethylase JMJD2C: epigenetic regulators in tumors.

Authors:  Chengcheng Zhang; Zhongqi Wang; Qing Ji; Qi Li
Journal:  Oncotarget       Date:  2017-07-12

2.  MTTFsite: cross-cell type TF binding site prediction by using multi-task learning.

Authors:  Jiyun Zhou; Qin Lu; Lin Gui; Ruifeng Xu; Yunfei Long; Hongpeng Wang
Journal:  Bioinformatics       Date:  2019-12-15       Impact factor: 6.937

3.  Identification of Key Histone Modifications and Their Regulatory Regions on Gene Expression Level Changes in Chronic Myelogenous Leukemia.

Authors:  Lu-Qiang Zhang; Guo-Liang Fan; Jun-Jie Liu; Li Liu; Qian-Zhong Li; Hao Lin
Journal:  Front Cell Dev Biol       Date:  2021-01-12

4.  Modeling transcriptional regulation using gene regulatory networks based on multi-omics data sources.

Authors:  Neel Patel; William S Bush
Journal:  BMC Bioinformatics       Date:  2021-04-19       Impact factor: 3.307

Review 5.  Genetic Biomarkers in Chronic Myeloid Leukemia: What Have We Learned So Far?

Authors:  Bilal Abdulmawjood; Beatriz Costa; Catarina Roma-Rodrigues; Pedro V Baptista; Alexandra R Fernandes
Journal:  Int J Mol Sci       Date:  2021-11-19       Impact factor: 5.923

6.  Recognition of driver genes with potential prognostic implications in lung adenocarcinoma based on H3K79me2.

Authors:  Lu-Qiang Zhang; Hao Yang; Jun-Jie Liu; Li-Rong Zhang; Yu-Duo Hao; Jun-Mei Guo; Hao Lin
Journal:  Comput Struct Biotechnol J       Date:  2022-10-07       Impact factor: 6.155

7.  JMJD2C promotes colorectal cancer metastasis via regulating histone methylation of MALAT1 promoter and enhancing β-catenin signaling pathway.

Authors:  Xinnan Wu; Ruixiao Li; Qing Song; Chengcheng Zhang; Ru Jia; Zhifen Han; Lihong Zhou; Hua Sui; Xuan Liu; Huirong Zhu; Liu Yang; Yan Wang; Qing Ji; Qi Li
Journal:  J Exp Clin Cancer Res       Date:  2019-10-29
  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.