Literature DB >> 30886626

Prediction of Gene Expression Patterns With Generalized Linear Regression Model.

Shuai Liu1,2, Mengye Lu2, Hanshuang Li3,4, Yongchun Zuo3,4.   

Abstract

Cell reprogramming has played important roles in medical science, such as tissue repair, organ reconstruction, disease treatment, new drug development, and new species breeding. Oct4, a core pluripotency factor, has especially played a key role in somatic cell reprogramming through transcriptional control and affects the expression level of genes by its combination intensity. However, the quantitative relationship between Oct4 combination intensity and target gene expression is still not clear. Therefore, firstly, a generalized linear regression method was constructed to predict gene expression values in promoter regions affected by Oct4 combination intensity. Training data, including Oct4 combination intensity and target gene expression, were from promoter regions of genes with different cell development stages. Additionally, the quantitative relationship between gene expression and Oct4 combination intensity was analyzed with the proposed model. Then, the quantitative relationship between gene expression and Oct4 combination intensity at each stage of cell development was classified into high and low levels. Experimental analysis showed that the combination height of Oct4-inhibited gene expression decremented by a temporal exponential value, whereas the combination width of Oct4-promoted gene expression incremented by a temporal logarithmic value. Experimental results showed that the proposed method can achieve goodness of fit with high confidence.

Entities:  

Keywords:  Oct4; cell reprogramming; combination intensity; gene expression pattern; generalized linear regression model; prediction; transcription factor binding site (TFBS)

Year:  2019        PMID: 30886626      PMCID: PMC6409355          DOI: 10.3389/fgene.2019.00120

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


Introduction

Somatic cells can be reverted to a pluripotent stem cell by cell reprogramming. Cell reprogramming has been significant in many domains of biological and medical science, including tissue repair, organ reconstruction, disease pathogenesis, and new drug development (Wernig et al., 2007; Park et al., 2008). Earlier, the nuclear transfer method was the main method to cultivate new individuals. However, this method was very controversial in terms of ethics (Gurdon, 1958; Campbell et al., 1996; McCreath et al., 2000; Polejaeva et al., 2000). Recently, study of cells induced to reprogram through specific transcription factors became a hotspot. This method solved the problem of immune rejection of allogeneic cells. In this way, the patient-specific stem cells were obtained without ethical controversy (Lv et al., 2018; Poli et al., 2018; Stadhouders et al., 2018). As an important regulatory element, transcription factor (TF) was involved in the regulation of transcription initiation, and binding sites of TFs in promoter regions affected gene expression (Duren et al., 2017). Oct4, a core transcription factor, played an important regulatory role in stem cell self-renewal and pluripotency maintenance. It controlled the development and differentiation of early embryos and was highly expressed in a variety of stem cells, including germ cells, embryonic stem cells (ESCs), embryonic germ cells (EGCs), and embryonic tumor cells. In an experiment of mice, Oct4 was observed to play a central role in the cellular pluripotency regulatory network, which reprogramed somatic cells into induced pluripotent stem cells (iPSCs) by expressing transcription factors Oct4, Sox2, Klf4, and c-Myc ectopically (Chen et al., 2016). Another study showed that pluripotent stem cells can be obtained by adding Oct3/4, Sox2, c-Myc, and Klf4 to the fiber cells of mice (Boyer et al., 2005). Regulation of these transcription factors on target genes was achieved mainly through the interaction of feedforward systems, self-regulatory networks and other signaling pathways (Boyer et al., 2005). Oct4-binding sites in promoter regions were closely related to gene expression (Chen et al., 2016). However, the relationship between Oct4 combination intensity in promoter regions and gene expression remained unclear. Therefore, in this paper, a generalized linear regression model was proposed to analyze the relationship between gene expression and Oct4 combination intensity in promoter regions. The rest of paper was organized as follows. section Related Work introduces related work on cell reprogramming and gene expression; section Materials and Methods provides materials and methods, including source of data, the proposed generalized linear regression model and evaluation criteria of model performance; section Results and Analysis contains detailed experimental results and analysis, including the solution result and performance analysis of our proposed model, analysis of factors affecting gene expression on every stage of cell development, and applications of our proposed model in gene classification; and section Conclusion summarizes the contents of this paper.

Related Work

Previous studies reported mechanisms and methods of cell reprogramming. Earlier, Gurdon et al. applied the nuclear transfer method to cell reprogramming of Xenopus laevis (Gurdon, 1958). Campbell, McCreath, and Polejaeva cultivated cloning animals using nuclear transfer technology (Campbell et al., 1996; McCreath et al., 2000; Polejaeva et al., 2000). Håkelien and Hochedlinger analyzed a cell recombination mechanism based on nuclear fusion and nuclear transfer technology (Håkelien et al., 2002; Hochedlinger and Jaenisch, 2002). Later, Stadtfeld and Zardo analyzed the effects of specific transcription factors and epigenetic plasticity of chromatin on cell reprogramming (Stadtfeld et al., 2008; Zardo et al., 2008). Studies by Hanna and Li showed that overexpression of transcription factor Oct4 had an effect on cell reprogramming (Hanna et al., 2009; Li et al., 2009). Doege et al. elaborated the effects of the interaction of Oct4, Sox2, Klf4, and c-Myc on cell reprogramming in the early stages of cell reprogramming (Doege et al., 2012). Apostolou and Chen found that the dynamic mechanisms of chromatin change and DNA methylation had important effects on cell reprogramming (Apostolou and Hochedlinger, 2013; Chen et al., 2013). Koqa et al. analyzed the role of transcription factor Foxd1 in cell reprogramming (Koga et al., 2014). Recently, Poli and Stadhouders elaborated the roles of specific transcription factors used as inducing factors in cell reprogramming (Poli et al., 2018; Stadhouders et al., 2018). The process of cell reprogramming was closely related to the regulation of gene expression. Moreover, regulation of gene expression is the molecular basis of many life activities, including cell differentiation, morphogenesis, and ontogeny (Chen et al., 2016). Earlier, Chen and Rimsky analyzed regulation effects of cis- and trans-regulatory elements on gene expression (Rimsky et al., 1989; Chen et al., 1990). Later, Ueda et al. analyzed effects of diurnal variation of transcription factors on gene expression (Ueda et al., 2002). Patricia et al. analyzed effects of the interaction of cis- and trans-regulatory elements on gene expression (Wittkopp et al., 2004). Sullivan CS et al. studied the regulation effect of microRNAs encoded by SV40 on gene expression (Sullivan et al., 2005). Jeffery et al. found factors related to gene expression using gene expression data and binding sites of transcription factor (Jeffery et al., 2007). Han et al. found that certain types of genomic organization by SATB1 had an effect on gene expression (Han et al., 2008). Afterward, Costa et al. predicted gene expression in T cell differentiation by using histone modification and binding affinity of transcription factor via a linear mixed model (Costa et al., 2011). Maienscheincline et al. searched for target genes regulated by transcription factors based on some information, including binding sites of transcription factors and target genes (Maienschein-Cline et al., 2012). MT and Holoch analyzed the effects of specific transcription factors and the regulation effect of RNA on gene expression, respectively (Lee et al., 2013; Holoch and Moazed, 2015). Recently, Engreitz and Singh clarified effects of lncRNA promoter, transcription factor, variable splicing, and histone modification on gene expression, respectively (Engreitz et al., 2016; Singh et al., 2016). Thomou and Wu analyzed effects of miRNAs and histone modifications on gene expression (Thomou et al., 2017; Wu et al., 2017). Additionally, Duren et al. predicted gene expression based on chromatin accessibility data, cis-acting and trans-acting element data by logistic regression models (Duren et al., 2017). Neumann and Stadhouders analyzed effects of LncRNA and the dynamic interaction of transcription factors with expression of target genes (Neumann et al., 2018; Stadhouders et al., 2018). Many methods were proposed for deciphering regulation mechanisms of cis-regulatory and trans-regulatory elements based on gene expression. Studies showed that gene expression was closely related to Oct4 combination intensity in promoter regions (Machado et al., 2011; Machado, 2017; Yan et al., 2017; Antão et al., 2018). However, the quantitative relationship between gene expression and Oct4 combination intensity was not considered. Therefore, firstly, a generalized linear regression model was proposed for quantifying the relationship of gene expression and Oct4 combination intensity based on eight gene datapoints. Then, testing data were applied to test the generalization ability of the model. On the one hand, experiments of 27 genes, as well as all genes, from GEO were applied to analyze the quantitative relationship between Oct4 combination intensity and target gene expression at each stage of cell development by our proposed model. On the other hand, 27 genes were divided into positive and negative samples by our proposed method.

Materials and Methods

Datasets

Experimental data came from mouse transcriptome data and ChIP-seq data, which were downloaded from GEO database with accession numbers GSE67462 and GSE67520, respectively. In this paper, gene promoter regions were defined as −1.5 kb to +0.5 kb of gene transcription start sites (TSSs). For quantifying the relationship between gene expression and Oct4 combination intensity, while testing the generalization ability of the proposed model, experimental data were divided into training data and test data. Training data were related to genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2, Zfp143, which were extracted based on the following steps. Step 1. All dynamic Oct4 combination intensity and gene expression data related to genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2, Zfp143 were extracted from transcriptome and ChIP-seq data (Chen et al., 2016). Oct4 combination intensities were expressed as a series of peaks that contained three characteristics, including height, distance and width, which were defined as the value of the highest point corresponding to the midpoint of the peak (height); distance between the midpoint of the peak and transcription start site (distance); and difference between the right and left boundaries of the peak (width). Step 2. Transcriptome and ChIP-seq data of the above genes from Day 0, Day 1, Day 3, Day 5, Day 7, Day 11, Day 15, and Day 18 were selected for studying the relation between time and gene expression (Chen et al., 2016). Step 3. Promoter regions with the strongest signal were extracted to avoid the influence of redundant data. Testing data were composed of two parts, including data of 27 genes and all genes. Firstly, 27 genes and all genes were applied to analyze quantitative relationship between Oct4 combination intensity and target gene expression at each stage of cell development by our proposed model. Then, 27 genes were divided into high and low expression groups to classify. In detail, 27 genes were obtained by searching for those data that appeared in all eight different cell development stages from GEO. These genes were Alyref2, Atn1, Btbd8, Btg2, Caprin1, Cnbp, Ctgf, Cyb5r3, Dars2, Ddx5, Eef1a1, Fosb, Hes1, Hist1h2bb, Hist1h2bf, Hist1h2bp, Hnrnpa2b1, Kmt2e, Lonp1, Nfe2l2, Pecr, Phldb2, Ptrh2, Setd5, Trappc6b, Tti2, and Zfp143. In the bi-classification experiment, expression values of 27 genes were sorted by descending order. The top 30% of the sorted data were defined as the high expression group, and the lowest 30% were defined as the low expression group. The value of the minimum high expression was the threshold for classification. The numbers of all genes at each stage of cell development are shown in Table 1.
Table 1

Number of genes at each cell development stage.

Cell development stageNumber of genes at eachstage
Day 086
Day 14,062
Day 34,577
Day 54,101
Day 76,261
Day 117,984
Day 158,181
Day 186,485
Number of genes at each cell development stage.

Generalized Linear Regression Model

In Figure 1, relations between height, distance, width, gene expression of Oct4 combination intensity, and time were provided, respectively.
Figure 1

Dynamic change trends for expressions of different genes and their combination intensity. Dynamic change trends of height, distance, width and expression are shown in (A–D), respectively. X axis represents time with 0, 1, 3, 5, 7, 11, 15, and 18 day(s). Y axis represents corresponding values of eight genes.

Dynamic change trends for expressions of different genes and their combination intensity. Dynamic change trends of height, distance, width and expression are shown in (A–D), respectively. X axis represents time with 0, 1, 3, 5, 7, 11, 15, and 18 day(s). Y axis represents corresponding values of eight genes. Figure 1 shows different change trends with time of Oct4 combination intensity in promoter regions and gene expression in the eight proposed genes. Figure 1A illustrates in detail that change trends of height with time were nearly identical in these genes. Similarly, Figure 1C demonstrates that change trends of width with time in these genes were also nearly identical. Figures 1B,D show that change trends of distance and gene expression with time were disorganized. For quantifying the relationship between gene expressions and Oct4 combination intensity, correlations between height, distance, width, time, and gene expressions were analyzed by using their correlation coefficients, which is defined as Equation (1) with two random variables, X and Y. In Equation (1), r(X, Y) represents the correlation coefficient between X and Y, cov(X, Y) represents covariance between X and Y, var(X), and var(Y) represent variance of X and Y, respectively. The correlation coefficients between gene expression and Oct4 combination intensity are shown in Table 2. In addition, correlation coefficients for Oct4 combination intensity and the gene expression, height, distance, width, and time of each gene are provided in Figure 2.
Table 2

Correlation coefficients between gene expression and Oct4 combination intensity for selected genes.

Gene nameA11A12A12A14
Btbd8−0.19831290.10484610.09069120.6349461
Cnbp0.85966104−0.619278240.30350817−0.05763255
Cyb5r3−0.5934541−0.3768864−0.45859910.1329110
Dars20.4509573−0.22170200.21761280.1436256
Eef1a10.30751340.44006680.59097140.4070609
Hist1h2bf0.093211870.254285580.301787390.56638550
Ptrh20.1068043−0.3149051−0.2933016−0.6536589
Zfp1430.5607069−0.30914570.17984870.9387029

A11, A12, A13, and A14 refer to correlation coefficients between gene expression and height, distance, width, and time, respectively. Bold text represents absolute values of correlation coefficients that are > 0.5.

Figure 2

Correlation coefficients for Oct4 combination intensity and the gene expression, height, distance, width, and time of each gene. (A–H) represents the correlation coefficients in genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2 , Zfp143, respectively.

Correlation coefficients between gene expression and Oct4 combination intensity for selected genes. A11, A12, A13, and A14 refer to correlation coefficients between gene expression and height, distance, width, and time, respectively. Bold text represents absolute values of correlation coefficients that are > 0.5. Correlation coefficients for Oct4 combination intensity and the gene expression, height, distance, width, and time of each gene. (A–H) represents the correlation coefficients in genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2 , Zfp143, respectively. Table 2 and Figure 2 indicate that the correlation coefficients for gene expression and time were the largest. Correlation coefficients for time and other variables were also strong. However, goodness of fit was low when the predicted model was constructed using height, distance, and width as explanatory variables, and gene expression as explained variable. Due to the strong relationship between time and Oct4 combination intensity, several time-dependent derived combination variables were used as explanatory variables of the proposed model. Firstly, new derived combination variables were obtained by multiplication operations between height, distance, width and a function of time t, including e, log10(t + 1) and t (k = 1,2,3). In this way, a set V = {H × t, H × t2, H × t3, H × e, H × 0.5, H × log10(t + 1), D × t, D × t2, D × t3, D × e, D × 0.5, D × log10(t + 1), W × t, W × t2, W × t3, W × e, W × 0.5, W × log10(t + 1)} was constructed as the set of explanation variables, where H denotes height, D denotes distance and W denotes width. Then, stepwise regression method was used to determine explanatory parameters of the proposed regression model. Finally, six explanatory variables were selected from V, including H × e, D × t, D × t2, D × t3, D × 0.5 and W × log10(t + 1). Therefore, a generalized linear regression model was constructed by using selected explanatory variables, in which gene expression was the explained variable. In this paper, four generalized linear regression models, Models 1–4, were constructed by Equations (2–5). In Equations (2–5), Exp represents the value of gene expression; β1, β2, β3 are regression coefficients, which are calculated by the Least Squares Method (LSM), and LSM is defined as the sum of squares of differences between predicted value and true value; a random disturbance ε is a normal distribution that was applied to represent other factors affecting gene expression except height, distance and width. H × e and W × log10(t + 1) were selected in the final model because they were common items in Equations (2–5). Therefore, a general model of gene expression patterns was obtained by Equation (6), and the correctness of the model will be verified in section Analysis of Factors Affecting Gene Expression at Every Stage of Cell Development. In Equation (6), f(t) represents a function of time t, which was selected from {t, t2, t3, 0.5}; β1, β2, and β3 are regression coefficients calculated by LSM.

Evaluation Criteria of Model Performance

F-test, t-test, and goodness of fit were used to evaluate the performance of linear regression model (Huang and Pan, 2003; Zhou et al., 2003; Xu et al., 2008; Wang and Lee, 2010; Wang et al., 2012). More precisely, F-test was used to test significance of the entire regression model and t-test was used to test significance of regression coefficients in the model. Goodness of fit was used to measure the approximation degree between fitted curve and original data. Meanwhile, , a generation from original coefficient of determination R2, was an adjusted coefficient of determination. It was eliminated the influence of coefficient of determination generated by number of explanatory variables. In this paper, F-test statistic, t-test statistic, adjusted coefficient of determination , original coefficient of determination R2, total sum of squares (TSS), explained sum of squares (ESS), and residuals sum of squares (RSS) are defined as Equations (7–13) (Huang and Pan, 2003; Zhou et al., 2003; Xu et al., 2008; Wang and Lee, 2010; Wang et al., 2012). In Equations (7–13), k is the number of variables; n is the number of samples; and are estimated value and standard deviation of estimated value of regression coefficient; and Y, Ŷ, Ȳ represent true, estimated and mean values of explained variable. Accuracy (Acc), Sensitivity (S), specificity (S), and Mathew correlation coefficient (Mcc) were used to measure the performance of the classification model (Xu et al., 2013; Guo et al., 2014; Awazu, 2016). Which were defined as Equations (14–17). In Equations (14–17), TP represents the number of positive samples that are correctly predicted as positive samples; TN represents the number of negative samples that are correctly predicted as negative samples; FP represents the number of negative samples that are incorrectly predicted as positive samples; and FN represents the number of positive samples that are incorrectly predicted as negative samples (Zhang et al., 2014, 2018; Wang et al., 2017, 2018a,b).

Results and Analysis

Solution Result of Our Proposed Model

Gene expression patterns of the eight selected genes were analyzed by using Models 1–4. More specifically, Model 1 was applied to describe the expression pattern of gene Zfp143, Model 2 was applied to describe the expression pattern of gene Hist1h2bf; Model 3 was applied to describe the expression patterns of genes Dars2 and Eef1a1, and Model4 was applied to describe the expression patterns of genes Btbd8, Cnbp, Cyb5r3, and Ptrh2. Both Model 2 and Model 3 were used to express the expression pattern of gene Eef1a1. Parameter values of the models are shown in Table 3. Parameter values of Model 2 and Model 3 for gene Eef1a1 were shown in Table 4.
Table 3

Parameter values of model for eight genes.

Gene nameRegression coefficient
β1β2β3
Btbd86.690e-108.420e-021.053e-02
Cnbp−3.104e-08−3.539e-01−6.857e-02
Cyb5r33.785e-088.065e-015.937e-01
Dars2−1.853e-073.554e-044.985e-02
Eef1a13.348e-08−1.443e-045.105e-01
Hist1h2bf−7.915e-086.322e-021.346e-01
Ptrh2−3.072e-09−5.625e-01−6.945e-02
Zfp1434.503e-09−1.123e-026.652e-02

Model of expression pattern for gene Eef1a1 is Model 3 in .

Table 4

Parameter values of model for gene Eef1a1.

Gene nameModelRegression coefficient
β1β2β3
Eef1a122.510e-08−2.282e-035.448e-01
Eef1a133.348e-08−1.443e-045.105e-01

The thickened data represent the largest absolute values of weight in regression coefficients of Model 2 and Model 3, respectively.

Parameter values of model for eight genes. Model of expression pattern for gene Eef1a1 is Model 3 in . Parameter values of model for gene Eef1a1. The thickened data represent the largest absolute values of weight in regression coefficients of Model 2 and Model 3, respectively. Table 3 showed that regression coefficients β2 and β3 were large, which indicated that both distance and width had important influences on gene expression. Furthermore, distance had an effect on gene expression in the form of exponential function of time, and width had an effect on gene expression in the form of logarithmic function of time without other factors. Additionally, Table 4 shows that the difference of the regression coefficients between Model 2 and Model 3 were small. In both Model 2 and Model 3, β3 has the largest absolute values in regression coefficients for gene Eef1a1, which indicated that width was a key factor affecting gene expression.

Performance Analysis of Our Proposed Model

Goodness of fit for proposed model was calculated to evaluate the performance of these models. In addition, performance of the models was tested by F-test and t-test. Results of goodness of fit, F-test and t-test are shown in Table 5. Results of goodness of fit, F-test and t-test of gene Eef1a1 are shown in Table 6.
Table 5

Goodness of fit, F-test and t-test for eight genes.

Gene nameR¯2F testT-test
p1p2p3
Btbd80.97740.00031069.65e-050.0193820.093871
Cnbp0.99174.244e-054.75e-052.60e-050.0595
Cyb5r30.89290.0068750.005900.004530.06944
Dars20.72980.042330.07710.08750.0154
Eef1a10.9580.001070.0169340.0046090.000275
Hist1h2bf0.84480.014310.008380.005180.00988
Ptrh20.88560.0078350.01280.01050.0238
Zfp1430.90460.0054660.0089960.0770090.027986

p.

Table 6

Goodness of fit, F-test and t-test for gene Eef1a1.

Gene nameModelF testT-test
p1p2p3
Eef1a120.920.0038580.076150.017370.00125
Eef1a130.9580.001070.0169340.0046090.000275

p.

Goodness of fit, F-test and t-test for eight genes. p. Goodness of fit, F-test and t-test for gene Eef1a1. p. Table 5 demonstrates that goodness of fit reached at least 80% for all genes except Dars2 by using our proposed method. In addition, the p-value of F-test and t-test were < 0.1, which meant that our proposed model was effective with 90% confidence. As shown in Table 6, from Model 3 was larger than Model 2, which means that distance had a greater influence on gene expression than time for gene Eef1a1. As shown in Tables 3–6, absolute values of regression coefficients β2and β3were large in all regression coefficients. Additionally, the absolute value of regression coefficients for β3 was the largest in all regression coefficients with Model 2 and Model 3 for gene Eef1a1. Therefore, width was considered to be the most important factor affecting gene expression, and width had an effect on gene expression in the form of a logarithmic function.

Analysis of Factors Affecting Gene Expression in Whole-Cell Developmental Stage

In this paper, the relationship between gene expression and Oct4 combination intensity in promoter regions at the whole-cell developmental stage was analyzed based on the generalized linear regression model. Experimental results showed that the proposed model was effective for gene expression pattern of all eight selected genes except for Eef1a1. For exploring the effects of each model on the different genes, expression data of selected eight genes and Oct4 combination intensity in promoter regions were substituted into the models. Experimental results are shown in Figure 3.
Figure 3

Effect of different models on gene expression patterns. Models 1–4 are represented from 1 to 4 on the x-axis, respectively. The y-axis represents corresponding goodness of fit.

Effect of different models on gene expression patterns. Models 1–4 are represented from 1 to 4 on the x-axis, respectively. The y-axis represents corresponding goodness of fit. Figure 3 demonstrated that differences in goodness of fit between different models for the same gene were large, which indicated that distance had strong effects on the gene expression of different genes with different levels. Strong correlation between gene expression and D × 0.5, W × log10(t + 1) was found in Table 3, which indicated that distance had an effect on gene expression in the form of an exponential function of time, and width had an effect on gene expression in the form of a logarithmic function of time without other factors. However, goodness of fit from D × 0.5 and W × log10(t + 1) was lower than for the selected six derived combination variables, which indicated that gene expression was promoted by the interaction of height, distance, width, and time.

Analysis of Factors Affecting Gene Expression at Every Stage of Cell Development

Oct4 combination intensity and time had different effects on gene expression in different cell development stages. The goodness of fit obtained by Model 4 was higher than that obtained by Models 1–3 in the prediction of gene expression. Therefore, differences were analyzed based on Model 4 with testing data including 27 genes and all genes. Experimental results are shown in Figures 4, 5.
Figure 4

Effects of Oct4 combination intensity and time on gene expression at different stages of cell development for 27 genes. X-axis represents different stages of cell development after 0, 1, 3, 5, 7, 11, 15, and 18 day(s). Y-axis represents parameter value. Curves in red, green, blue, and black represent values of parameters β1, β2, β3 and ε from the proposed generalized linear regression model, respectively.

Figure 5

Effects of Oct4 combination intensity and time on gene expression at different stages of cell development for all genes from GEO. X-axis represents different stages of cell development after 0, 1, 3, 5, 7, 11, 15, and 18 days. Y axis represents parameter value. Curve in red, green, blue, and black represents values of parameter β1, β2, β3 and ε of the proposed generalized linear regression model, respectively.

Effects of Oct4 combination intensity and time on gene expression at different stages of cell development for 27 genes. X-axis represents different stages of cell development after 0, 1, 3, 5, 7, 11, 15, and 18 day(s). Y-axis represents parameter value. Curves in red, green, blue, and black represent values of parameters β1, β2, β3 and ε from the proposed generalized linear regression model, respectively. Effects of Oct4 combination intensity and time on gene expression at different stages of cell development for all genes from GEO. X-axis represents different stages of cell development after 0, 1, 3, 5, 7, 11, 15, and 18 days. Y axis represents parameter value. Curve in red, green, blue, and black represents values of parameter β1, β2, β3 and ε of the proposed generalized linear regression model, respectively. Figures 4, 5 show that the absolute value of β3 was larger than that of β1, β2, and e. Absolute values of β1and β2 were close to zero except for a few points, which indicated that width influenced gene expression in the form of a logarithmic function of time. However, change trends of β1 and β2 were different for Figures 4, 5. More specifically, the absolute value of β1 obtained by 27 genes decreased with time, and the value of was negative when time was equal to 0; the absolute value of obtained by all genes decreased with time and the value of was positive when time was equal to 0; the value of obtained by 27 genes was positive while value of obtained by all genes was negative due to partially missing data, which was contradictory and indicated that time had an important impact on gene expression. Incorrect conclusions were obtained when data of some certain time were missing. Therefore, Figures 4, 5 showed that width and time had important effects on gene expression. Furthermore, width influenced gene expression in the form of a logarithmic function of time.

Application of Our Proposed Model in Gene Classification

Gene classification experiments were provided to test the generalization ability of the proposed model. Firstly, in order to avoid the influence of random disturbance on experimental results, the data of 27 genes, including height, distance, width and gene expression, were normalized. Then, Models 1–4 were applied to predict gene expression for 27 genes. Finally, the 27 genes were divided into two categories by comparing gene expression with a threshold; meanwhile, 10-fold cross-validation was used to test the model's performance. Comparison results of Models 1–4 showed Model 4 had a high goodness of fit. Therefore, 27 genes were classified by Model 4. Gene groups of high and low expression were defined in an artificial way; meanwhile, threshold setting was random in the classification process. A BP neural network was used to classify positive and negative samples in order to prove that the randomness had little effect on experimental results. In this paper, the hidden layer of the BP neural network was set to one layer, and the number of hidden layer neurons was set to 2. In 10-fold cross-validation, regression coefficients and random disturbance of Model 4 were shown in Table 7. The prediction performance obtained by Model 4 and the BP neural network are shown in Table 8.
Table 7

Parameter values of Model 4 in 10-fold cross-validation.

SerialnumberRegression coefficientRandomdisturbance
β1β2β3ε
1−0.190400.364151.948490.09097
2−0.147770.349971.627910.08391
3−0.192830.368951.341260.11151
4−0.224300.327511.303050.12252
5−0.172210.334491.457650.11442
6−0.203140.357961.384940.11671
7−0.158620.292301.267880.11184
8−0.185970.363631.392710.10367
9−0.204820.366021.370140.11242
10−0.180640.342051.159550.11336

1–10 represents the serial number of 10-fold cross-validation.

Table 8

Prediction performance of different methods using 10-fold cross-validation.

MethodsPerformance evaluation standard
AccSnSpMcc
Model 40.76430.81260.69470.5111
BP neural network0.72380.75850.69230.4537

Bold text represents the maximum value of every performance evaluation criterion.

Parameter values of Model 4 in 10-fold cross-validation. 1–10 represents the serial number of 10-fold cross-validation. Prediction performance of different methods using 10-fold cross-validation. Bold text represents the maximum value of every performance evaluation criterion. Table 8 showed that the Acc, Sn, Sp, and Mcc obtained by Model 4 were the largest of the two different methods. Therefore, randomness of the threshold setting had little effect on experimental results, and our proposed method was effective in predicting gene expression.

Conclusion

Cell reprogramming has been a hot issue in the field of life sciences and has played a significant role in medicine, such as in tissue repair, organ reconstruction, disease pathogenesis, and new drug development. Oct4 has especially played an important regulatory role in the process of cell reprogramming. However, there was no scientific method to quantify the relationship between Oct4 combination intensity and gene expression. Therefore, data from the eight selected typical genes were extracted from mouse transcriptome data and ChIP-seq data for quantifying the relationship between gene expression values and Oct4 combination intensity in promoter regions. Firstly, a generalized linear regression model was constructed based on gene expression with eight different time periods during cell development and Oct4 combination intensity in promoter regions. Then, the relationship between Oct4 combination intensity and gene expression at whole and each stage of cell development was analyzed. Finally, the 27 genes were divided into positive and negative samples based on Model 4 and the BP neural network. Experimental results showed that width of combination influenced gene expression by a logarithmic function of time (day). Additionally, accuracy obtained by the models was 4.05% higher than that obtained by the BP neural network, which indicated that our proposed model was effective in predicting gene expression. Several additional factors, including extent of histone modification, degree of chromatin opening, strength of promoter and binding sites of transcription factors and promoter regions, also affected gene expression. Non-linear relations between gene expression and Oct4 combination intensity were also ignored due to large non-linear relations. Therefore, in the future, multiple factors and non-linear relations should be considered to analyze key factors affecting gene expression.

Author Contributions

SL: design experiment and analyze experiment result; ML: data processing and accomplish experiment; HL: extract and clean data from biological experiment and public database; YZ: provide idea from biological significance.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
  5 in total

1.  Computational Modeling of Gene-Specific Transcriptional Repression, Activation and Chromatin Interactions in Leukemogenesis by LASSO-Regularized Logistic Regression.

Authors:  Nickolas Steinauer; Kevin Zhang; Chun Guo; Jinsong Zhang
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2021-12-08       Impact factor: 3.710

2.  Biomarker Signatures of Quality for Engineering Nasal Chondrocyte-Derived Cartilage.

Authors:  M Adelaide Asnaghi; Laura Power; Andrea Barbero; Martin Haug; Ruth Köppl; David Wendt; Ivan Martin
Journal:  Front Bioeng Biotechnol       Date:  2020-04-07

3.  Chitosan Nanoparticles Inactivate Alfalfa Mosaic Virus Replication and Boost Innate Immunity in Nicotiana glutinosa Plants.

Authors:  Ahmed Abdelkhalek; Sameer H Qari; Mohamed Abd Al-Raheem Abu-Saied; Abdallah Mohamed Khalil; Hosny A Younes; Yasser Nehela; Said I Behiry
Journal:  Plants (Basel)       Date:  2021-12-08

4.  A hybrid approach to enhance the lifespan of WSNs in nuclear power plant monitoring system.

Authors:  Md Ershadul Haque; Tanvir Hossain; Mahidur R Sarker; Manoranjan Paul; Md Samiul Hoque; Salah Uddin; Abdulla Al Suman; Mohamad Hanif Md Saad; Tanvir Ul Huque
Journal:  Sci Rep       Date:  2022-03-14       Impact factor: 4.379

5.  Analysis of the Stage Performance Effect of Environmental Protection Music and Dance Drama Based on Artificial Intelligence Technology.

Authors:  Li Zeng
Journal:  J Environ Public Health       Date:  2022-09-19
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.