Koji Iwayama1, Yuri Aisaka2, Natsumaro Kutsuna2,3, Atsushi J Nagano4,5,6. 1. Research Institute for Food and Agriculture, Ryukoku University, Otsu, Shiga, Japan. 2. LPixel Inc, Hongo, Bunkyo-ku, Tokyo, Japan. 3. Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa-shi, Chiba, Japan. 4. Department of Plant Life Science, Faculty of Agriculture, Ryukoku University, Otsu, Shiga, Japan. 5. Center for Ecological Research, Kyoto University, Otsu, Shiga, Japan. 6. JST PRESTO, Kawaguchi, Saitama, Japan.
Abstract
MOTIVATION: Considerable attention has been given to the quantification of environmental effects on organisms. In natural conditions, environmental factors are continuously changing in a complex manner. To reveal the effects of such environmental variations on organisms, transcriptome data in field environments have been collected and analyzed. Nagano et al. proposed a model that describes the relationship between transcriptomic variation and environmental conditions and demonstrated the capability to predict transcriptome variation in rice plants. However, the computational cost of parameter optimization has prevented its wide application. RESULTS: : We propose a new statistical model and efficient parameter optimization based on the previous study. We developed and released FIT, an R package that offers functions for parameter optimization and transcriptome prediction. The proposed method achieves comparable or better prediction performance within a shorter computational time than the previous method. The package will facilitate the study of the environmental effects on transcriptomic variation in field conditions. AVAILABILITY AND IMPLEMENTATION: Freely available from CRAN ( https://cran.r-project.org/web/packages/FIT/ ). CONTACT: : anagano@agr.ryukoku.ac.jp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Considerable attention has been given to the quantification of environmental effects on organisms. In natural conditions, environmental factors are continuously changing in a complex manner. To reveal the effects of such environmental variations on organisms, transcriptome data in field environments have been collected and analyzed. Nagano et al. proposed a model that describes the relationship between transcriptomic variation and environmental conditions and demonstrated the capability to predict transcriptome variation in rice plants. However, the computational cost of parameter optimization has prevented its wide application. RESULTS: : We propose a new statistical model and efficient parameter optimization based on the previous study. We developed and released FIT, an R package that offers functions for parameter optimization and transcriptome prediction. The proposed method achieves comparable or better prediction performance within a shorter computational time than the previous method. The package will facilitate the study of the environmental effects on transcriptomic variation in field conditions. AVAILABILITY AND IMPLEMENTATION: Freely available from CRAN ( https://cran.r-project.org/web/packages/FIT/ ). CONTACT: : anagano@agr.ryukoku.ac.jp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Variation in environmental factors affects various aspects of organisms. The comprehensive quantification of environmental effects on organisms is an emerging problem. For example, gene-environment interactions have been studied to explain the heritability of complex diseases that could not be explained by conventional genome-wide association studies (Thomas, 2010). In addition, concerns about organisms’ adaptation and response to changes in environmental conditions has been growing because of climate change (Ahuja ; Hoffmann and Sgrò, 2011; Nicotra ; Weston ; Xu, 2016).There is a trade-off between precise control of and minimal intervention to the environment (Jones, 2013). Any results obtained in the field more accurately reflect a plant’s environmental response in natural conditions. However, these results are difficult to interpret, because environmental factors are continuously changing in a complex manner and exhibit diurnal oscillations, seasonal changes, and long-term trends. On the other hand, experiments conducted in controlled conditions provide results that are more precise but not necessarily reflective of the plant’s behavior in natural conditions. For example, photosynthetic responses to fluctuating, environments like natural conditions differ from those to controlled, constant environmental conditions (Yamori, 2016). At the molecular level, traits observed in laboratory conditions are not always consistent with those observed in natural conditions (Malmberg ; Mishra ; Weinig ). Further, even if similar physiological plasticity is observed, the molecular responses of plants to drought stress can vary between the controlled conditions of greenhouses and the uncontrolled conditions of the field (Lovell ).To reveal the effects of such fluctuating environments, transcriptome data in field environments have been collected (Alvarez ). In particular, the transcriptomic variation of plants in fields has been studied (Hayes ; Izawa ; Nagano ; Plessis ; Sato ; Richards ). To bridge the gap between natural and laboratory conditions, Nagano proposed a model to relate the transcriptomic variation of plants in a field to environmental conditions and applied it to large-scale transcriptome data of samples collected in a field. They demonstrated that the model can precisely predict transcriptome variations using meteorological data. However, the vast computational cost required to optimize parameters and select the best model has restricted the use of the model.To accelerate transcriptomic studies in fields, we propose a new statistical model based on that proposed by Nagano . The previous model contains some distinct functions representing diurnal changes in sensitivity and the characteristics of responses to environmental stimuli. The new model reduces the computational cost by unifying these functions. We also propose a cluster-based approach for optimizing parameters, in which we reuse the optimization result of one gene for other genes in the same cluster. This approach significantly reduces the cost of searching for the optimum parameter value. Both the previous and new models are intended for application to microarray data; however, the use of RNA-Seq (Wang ) technology is rapidly increasing. To apply our model to RNA-Seq data, we incorporated precision weights for normalized log-counts into the model as the voom method (Law ). In addition, we developed and released FIT, an R package that provides efficient parameter optimization, model selection, and transcriptome prediction of unsequenced samples. This is the first tool to integrate the transcriptome data from field samples and meteorological data by modeling their relation.
2 Methods
2.1 Previous model
Before explaining our new model, we briefly review the previous plant transcriptome model (Nagano ). Let N be the number of samples and be an N-element vector, where each element is the -transformed value of the normalized expressions for gene i in each sample. In the model, is described as
where and are an design matrix and regression coefficients, respectively. The independently and identically distributed noise is drawn from a Gaussian distribution. The design matrix is constructed as
Here, 1 is a vector in which all elements are 1 and , and are values designating the plant’s age, circadian clock, response to environmental stimuli, and genotype, respectively. The element-wise products and are the interactions between the age and the clock and the age and the environmental response, respectively. The plant’s age is a vector of the numbers of days after transplanting scaled to have a range 1 and a mean 0. In the previous study, the authors used two rice cultivars (Nipponbare and Norin 8) as samples. If the genotype of sample j is Norin 8, the j-th element of the genotype is set to 1; otherwise, to 0.Whereas and are independent of the genes, the circadian clock and the response to environmental stimuli are gene specific. Hence, we place a superscript (i), e.g. and . The circadian clock on gene i in sample j is described by a cosine curve with a period and the gene specific phase . It is defined as
where t is the time when sample j was obtained.The response to environmental stimuli of gene i in sample j takes the form
where g(T) is a gate aperture used to explain time-of-day-specific responses to an environmental stimulus, is the time period during which gene i was affected by an environmental stimulus, w is the value of a meteorological parameter at time T, is the response threshold to the stimulus, and f(x) determines the response to the environmental stimulus. is scaled by factors and as the plant’s age. For f(x), two types of response model are considered: dose-dependent (dd) and dose-independent (di). In the dd model, the response function is defined as
Similarly, the function in the di model is defined as
These two response model types assume that a plant responds to an environmental stimulus over the threshold. Two other types can also be considered, that is, a plant responds to a stimulus under the threshold, such asThere are three types of gate function g(T). The first model is a no-gate model, which takes 1 for any T. In the second model, the gate function is defined as
The third model assumes that the gate fully opens for a specific duration and completely closes at other times, i.e.,
Here, is the opening time of the gate and is the opening length.Parameters are optimized to maximize the likelihood by using the Nelder–Mead algorithm (Nelder and Mead, 1965). Because the likelihood function has multiple local maxima, a grid search is performed before the optimization. Regression coefficients and the phase of the circadian clock are optimized using the nonlinear least-squares method and the likelihood is calculated on each grid point of the remaining parameters. The optimization by the Nelder–Mead algorithm starts from the parameter values of the grid point with the largest likelihood. The response function to environmental stimuli with the largest likelihood is selected from four response model types (, and ). Because different gate function types have different degrees of freedom, the optimum gate function type is selected by performing likelihood-ratio tests. It can be considered that some variables in Equation (1) may contribute to neither the explanation nor the prediction for the expression patterns of many genes. Thus, a parameter reduction process that repeats parameter optimizations and likelihood-ratio tests is performed to obtain the simplest models.
2.2 New model
Although the previous model achieved a detailed description of transcriptome fluctuations in complex environments, the parameter optimization for all genes is computationally expensive. The computational cost of the optimization is due mainly to the existence of different model types, i.e., different types of response models and gate functions, and variable selection. We need to optimize and compare each model repeatedly. Therefore, to reduce computational cost, we unified the different model types.To unify the dd and di responses to environmental stimuli, we redefined the response model, when a plant responds to an environmental stimulus over the threshold, as
As γ approaches minus infinity, this function approaches the dd response function. Conversely, this function approaches the di response function in the limit of (Fig. 1(a)). Here, ensures that the outputs of this function have almost constant scales regardless of the value of γ. When a plant responds to a stimulus under the threshold, the dd and di responses are unified as
Fig. 1
Response functions (a) and gate functions (b) for various parameter values
Response functions (a) and gate functions (b) for various parameter valuesWe also defined the new gate function as
where γ controls the shape of the gate as γ in Equations (10) and (11), and θ determines the opening length. When the value of θ is smaller than –1, this function corresponds to the no-gate model in the limit . The gate function in Equation (8) is approximated by this function with a smaller value of γ and that in Equation (9) is approximated with a larger value of γ [Fig. 1(b)].To eliminate the need to optimize the phase of the circadian clock by the nonlinear least-squares method, we represent the circadian clock by the weighted sum of sine and cosine curves instead of one cosine curve (Equation 3). The weights of these sine and cosine curves are selected by linear regression together with other regression coefficients. Let and be the regression coefficients of the sine and cosine components of the clock. Using these coefficients, the clock phase is obtained by
Similarly, the interactions between the plant’s age and the circadian clock are also represented as the weighted sum of the interactions between the age and cosine and sine curves, and its phase is obtained by these regression coefficients.As in the previous model, the parameters of the new model are optimized through two steps: the determination of the initial value by a grid search, and nonlinear optimization by the Nelder–Mead algorithm. In the first step, we optimize the parameters related to environmental responses, which are , γ, γ, and θ, and the response function ( or ) by a grid search. The remaining parameter is optimized by linear regression for each setting and the setting that achieves the lowest mean squared error is searched. In the second step, parameters other than are optimized by the Nelder–Mead algorithm. After the optimization of the environmental response parameters, we select variables simultaneously with the optimization of the regression coefficients using an L1 regularization instead of repeating log-likelihood tests. To select sine and cosine curves representing the circadian clock (and the interaction between the age and the circadian clock) into or out of a model together, we used group lasso (Yuan and Lin, 2006) rather than lasso (Tibshirani, 1996). By using group lasso, sparse (i.e. a small number of non-zero) coefficients can be obtained. Let be the predicted value of the expression of gene i in sample j. The cost function to be minimized is defined as
Here, is the index set of the regression coefficients of the plant’s age, the response to environmental stimuli, the interaction between the age and the response, and the genotype. In this equation, the first term is the sum of the squared residuals and the second term is the regularization term. With a larger value of λ, we obtain fewer non-zero coefficients. To achieve consistent variable selection, we use adaptive weights ζ for penalizing different covariates as in the adaptive lasso (Zou, 2006) or the adaptive group lasso (Wang and Leng, 2008). Let denote the coefficients obtained by ordinary least squares regression. The adaptive weights are defined as , , and . Because the responses to environmental stimuli contain seven parameters, the adaptive weights for the response and the interaction between the response and the age are multiplied by seven.The optimum value of λ is selected by cross validation (CV) for each gene. We chose the largest value of λ for which the CV error was smaller than the sum of the minimum CV error and the standard error of CV errors. Because CV incurs high computational costs, we used the approximation of leave-one-out CV errors (Obuchi and Kabashima, 2016). A leave-one-out CV error and its standard error is approximated by the mean and the standard deviation of
over all samples j. Here, is the matrix given by
where and denote the submatrix of corresponding to the set of covariates with non-zero regression coefficients, and its j-th row. Although this approximation is based on the assumption that the numbers of samples and covariates are sufficiently large, we confirmed that it is sufficiently accurate even for our model, which contains only several covariates.
2.3 Cluster-based optimization for computational time reduction
For further computational reduction, we omit the grid search (which is the most time-consuming step in the optimization) for almost all genes. Similar models would be optimum for genes exhibiting similar expression patterns. Hence, it is expected that the optimum model for one representative gene of a cluster can be used as the initial value of the Nelder–Mead optimization for genes, the expression pattern of which is similar to the pattern of the representative gene of the cluster.Before optimization, we perform clustering of expression patterns using the affinity-propagation method (Frey and Dueck, 2007), which automatically provides an appropriate number of clusters and their exemplars. For each cluster, we optimize the parameter values for the exemplar of the cluster using the procedure mentioned above, and use the optimized value as the initial value for the Nelder–Mead method for other genes in the cluster.
2.4 Application to RNA-seq data
Recently, RNA-Seq (Wang ) has become a widely used technology to quantify transcriptomes. While we employed the lognormal distribution for microarray data, RNA-Seq data are discrete in nature and modeled by a negative binomial distribution (Robinson ). We associated a precision weight with each individual normalized observation as in the voom method (Law ), which allows us to apply methods developed for microarray data to RNA-Seq data. Precision weights are estimated based on the mean-variance relationship of the data according to the following procedure. The log-count per million (log-cpm) value for each read count is defined as
where denotes the read count of gene i for RNA sample j and is the total number of reads for sample j. For each gene, a model is fitted to the log-cpm value and its residual standard deviation is obtained. In this study, we smoothed the time series of the log-cpm values and calculated residual standard deviations. Let denote the average log-count value defined as
Here, and are the average log-cpm value and the geometric mean of the total number of reads plus one, respectively. Similarly, each smoothed log-cpm value of gene g for sample i is converted to the smoothed log-count . Fitting the LOWESS curve (Cleveland, 1979) to square-root standard deviations as a function of mean log-counts , a piecewise linear function is yielded. Then, the predicted square-root standard deviation of is and the voom precision weight is defined as
Precision weights are incorporated into the model by replacing the first term in Equation (14) with the weighted sum of squared residuals.
3 Datasets
3.1 Meteorological data
We used meteorological data measured every at a meteorological station (Tsukuba (Tateno), attitude ) of the Japan Meteorological Agency. The data consists of wind intensity = (), air temperature (), relative humidity (%), atmospheric pressure (), precipitation (), and global radiation ().
3.2 Synthetic gene expression data
In order to confirm that correct models can be selected by the proposed method, we synthetically generated RNA-Seq data assuming the following situation. Rice plants were transplanted into a paddy field on June 1. Samples were collected every week from June 12 to September 18, 2008 for . In order to verify influence of sampling design on model selection and parameter optimization, five types of sampling were considered: one sample at each time at intervals of , two samples at each time at intervals of , three samples at each time at intervals of , four samples at each time at intervals of , and six samples at each time at intervals of . The total number of samples for each type was 180. For evaluation of the predictive capability, we also computed gene expressions of samples assumed to be collected every week from June 12 to September 18, 2009 for at intervals of .The read count of gene i for RNA sample j was generated from a negative binomial distribution:
where and are the mean and the dispersion parameter, respectively. The mean values varied according to the circadian clock, the response to environmental stimuli, and the plant’s age at the times of sampling, according to the 31 model equations shown in Supplementary Table S1. In these equations, the logarithm of the average expression value for each gene α followed a normal distribution with an average of 5 and a standard deviation of 1.To decide dispersion parameters, we estimated the mean-dispersion trend from the Pickrell real RNA-Seq data set (Pickrell ), which is available from the tweeDESeqCountData R package (http://www.creal.cat/jrgonzalez/software.htm), using the method implemented in the edgeR package (Zhou ). From the estimated trend, we decided the value of the dispersion parameter corresponding to the average expression value for each gene α. Because we use the log-count per million values for analysis, 9969 constantly expressed genes () were also considered in order to suppress the influence of the variation of the total read counts. The dispersion parameters of these constantly expressed genes were also decided according to the estimated mean-dispersion trend.
3.3 Real gene expression data
We also analyzed the same data as Nagano , which consisted of microarray data from 461 samples of mature leaves of rice plants in a paddy field at Tsukuba collected in 2008, and 108 samples collected in 2009. We used data from the samples collected in 2008 for model selection and parameter fitting, and those from 2009 for evaluation of the model’s predictive capability. The samples collected in 2008 can be categorized into six groups: of samples collected at intervals of starting at 7:00 am on August 12, 2008 (8 samples at each time × 13 time points samples); of samples collected at intervals of starting at 10:00 am on June 5, June 19, July 3, July 17, August 7, August 14, August 21, August 28, and September 11, 2008 (225 samples); samples collected at noon every other week from June 3 to September 23, 2008 (three samples at each time × 17 time points samples); samples collected at 12:00 am every other week from June 4 to September 24, 2008 (2 samples at each time × 17 time points samples); samples collected from 5:00 pm to 8:00 pm at intervals of on August 7, 2008 (19 samples); samples collected from 3:50 am to 6:00 am at intervals of on August 8, 2008 (two samples at each time × 14 time points samples). The samples collected in 2009 can be categorized into six groups: of samples collected at intervals of starting at noon on August 10, 2009 (two samples at each time × 9 time points samples); of samples collected at intervals of starting at 7:00 am on August 24, 2009 (6 samples at each time × 13 time points samples); two samples collected at noon on August 31, 2009; two samples collected at 6:00 pm on August 31, 2009; four samples collected at 11:00 am on October 8, 2009; four samples collected at 11:00 am on October 9, 2009. This information is also summarized in Supplementary Tables S2 and S3. We extracted 17 616 genes having log2-transformed signals larger than 5 in 80% of the samples from 2008.To validate the proposed model’s applicability to RNA-Seq data, we generated pseudo RNA-Seq data from the same microarray data. The pseudo RNA-Seq data of sample j were sampled from the multinomial distribution:
Here, is the total number of reads.
4 Results
4.1 Synthetic gene expression data
We used synthetic RNA-Seq data collected in 2008 for parameter optimization and model selection; we used synthetic RNA-Seq data collected in 2009 to evaluate the prediction capabilities of the optimized model. Mean squared errors (MSE) and correlation coefficients between the predicted and synthetic gene expressions are plotted in Supplementary Figures S1 and S2. These figures indicate that sampling at intervals of provides better predictive performance than sampling at longer intervals. It can be considered that sampling at effectively captures diurnal variation of gene expression patterns. Although the predictive performance of sampling at intervals was comparable to that of sampling at intervals, longer intervals are preferable in terms of sampling labor. Hence, only results of sampling at intervals are shown below.The selected models for variably expressed genes are summarized in Supplementary Table S4. This results indicate that the correct models were selected for 29 out of 31 genes. It is important to note that absolute values of coefficients do not always correspond to those of the true models in Supplementary Table S1, because of the normalization and nonlinear transformation of inputs. Further, the constant model, where all coefficients are zero, was correctly selected for 8441 out of 9969 constantly expressed genes. These results indicate that the variable selection through L1 regularization worked well.We also confirmed that the optimized parameter values were consistent with those of the true models for variably expressed genes. The polar plot of the optimized coefficients of circadian clocks is shown in Supplementary Figure S3. The phases of circadian clocks clearly correspond to those of the true models. The optimized memory time periods were sharply distributed around the values, which were very close to the true values (Supplementary Fig. S4). To confirm whether the optimized parameter values of responses to environmental stimuli were consistent with those of the true models, we computed the responses as follows:
Here, T denotes temperature, and indicates the type of response, that is, if a plant responds to temperatures over the threshold , otherwise, . The optimized responses of variably expressed genes are plotted in Supplementary Figure S5. The optimized models of dose-independent genes tend to exhibit step function-like responses compared to those of dose-dependent genes. Although some dose-independent genes with positive (negative) coefficients of responses respond to temperatures over (under) the threshold contrary to the true models, they can match the response functions by adding (subtracting) certain constant values.
4.2 Real gene expression data
We optimized the parameters of the model using the microarray data of samples collected in 2008 by normal optimization, in which we performed both the grid search and the nonlinear optimization for all genes, and cluster-based optimization (section 2.3). We also applied cluster-based optimization to pseudo RNA-Seq data generated from the same microarray data. We performed all optimizations on an Amazon EC2 m3.medium instance (one Intel Xeon E5-2670 v2 processor, memory). The normal optimization took per gene on average. If we performed the normal optimization using a single m3.medium instance, we could optimize parameters for 17 616 genes within 17 days. The affinity-propagation method yielded 500 clusters and their exemplars. After normal optimization for these exemplars, we performed nonlinear optimization for other genes using the Nelder–Mead method. The nonlinear optimization took per gene. We could optimize parameters for 17 616 genes within 5 days by using a single instance. In the previous study (Nagano ), the parameter optimization took about 30 days using a high-performance cluster computer. Even after taking into consideration advances in computer technology, the new model improved computational efficiency.The fitness of the new model was compared to that of the previous model by coefficients of determination R2, defined as
for microarray data and
for pseudo RNA-Seq data. Here, and are the predictions of models for microarray and pseudo RNA-Seq data, and and denote the average log2-transformed signals and pseudo log-rpm values, respectively. Figure 2 shows the R2 values. The R2 values of three different optimizations were comparable to those of the previous model.
Fig. 2
Comparison of coefficients of determination (R2). Distributions of R2 from the previous model are compared to those of the new model with normal optimization (a: "normal"), cluster-based optimization (b: "cluster"), and cluster-based optimization with pseudo RNA-Seq data (c: "RNA-Seq"). (d) Boxplot of R2 values
Comparison of coefficients of determination (R2). Distributions of R2 from the previous model are compared to those of the new model with normal optimization (a: "normal"), cluster-based optimization (b: "cluster"), and cluster-based optimization with pseudo RNA-Seq data (c: "RNA-Seq"). (d) Boxplot of R2 valuesTo assess the new model’s capability to predict gene expression, we predicted the microarray data or pseudo RNA-Seq data of samples collected in 2009. We compared gene-wise mean squared errors across samples. The mean squared errors of the new model with normal optimization, cluster-based optimization, and cluster-based optimization using pseudo RNA-Seq data were smaller than those of the previous model for 12 125, 12 209 and 9305 out of 17 616 genes, respectively. Hence, the new model yields better predictions than the previous model, regardless of whether the parameters are optimized by the normal or the cluster-based methods and whether the data are measured using a microarray or RNA-Seq. We also calculated gene-wise correlation across samples and sample-wise correlation across genes between prediction and observation. The gene-wise correlation is illustrated in Figure 3. The distributions of the correlation coefficients of the new model were comparable to that of the previous model, regardless of the methods of measurement and parameter optimization. Further, the sample-wise correlations of all three tests were higher than 0.9 for most samples and improved in comparison to those of the previous model (Fig. 4).
Fig. 3
Comparison of gene-wise correlation across samples. Each panel is illustrated as in Fig. 2. The null model, in which the expression is the constant, was selected for some genes in the previous study. Because correlation coefficients for such genes cannot be defined, the total number of genes contained in the plots of the previous study (dashed) is smaller than that of the new model (solid)
Fig. 4
Comparison of sample-wise correlation across genes. Each panel is illustrated as in Figure 2
Comparison of gene-wise correlation across samples. Each panel is illustrated as in Fig. 2. The null model, in which the expression is the constant, was selected for some genes in the previous study. Because correlation coefficients for such genes cannot be defined, the total number of genes contained in the plots of the previous study (dashed) is smaller than that of the new model (solid)Comparison of sample-wise correlation across genes. Each panel is illustrated as in Figure 2Next, we compared the models selected by the normal optimization to those obtained in the previous study (Nagano ). Table 1 shows the numbers of genes for which each environmental factor was selected in the previous study and the normal optimization of the new model. Each row and column shows the number of genes for which the corresponding environmental factor was selected in the previous study and the normal optimization of the new model, respectively. The bottom row represents the model without the environmental effects. In the normal optimization, if the weight for the environmental effects is zero, the gene is counted as "new none." The bold-faced numbers in the diagonal cells of the table indicate the numbers of genes for which the same environmental factor was selected in the previous study and the normal optimization of the new model. The same environmental factor was selected in both models for 4523 (sum of the first six diagonal cells) out of 7496 genes, for which the selected models contain the response to any environmental factors in both the previous and new studies. Meanwhile, in the previous study, the model without the environmental effects was selected for 10 120 genes (sum of the bottom row); it was selected by the normal optimization of the new model for only 5714 genes (sum of the rightmost column). The previous model tends to select the model without the environmental effects about twice as frequently as the new model. This difference is probably due to differences in the model selection, i.e., likelihood ratio tests and group lasso.
Table 1
Comparison of selected models
New wind
New temperature
New humidity
New atmosphere
New radiation
New precipitation
New none
Previous wind
56
20
8
3
7
4
113
Previous temperature
36
3762
99
68
92
15
877
Previous humidity
10
49
340
7
27
14
237
Previous atmosphere
6
18
6
38
7
6
78
Previous radiation
26
238
91
49
309
20
647
Previous precipitation
4
8
13
5
10
18
55
Previous none
233
4434
641
343
643
119
3707
Comparison of selected modelsWe investigated the associations between gene annotations and time-of-day characteristics of expressions to assess the consistency between model parameters and biological knowledge. For each gene ontology term with which more than 10 genes are annotated, we compared the distribution of clock phases of genes annotated with the term to its background distribution, which is that of genes not annotated with the term. The left-hand panel in Figure 5 indicates that the distribution of of genes annotated with protein serine/threonine kinase activity (GO:0004674) was significantly different from its background distribution (, Watson–Wheeler test, Bonferroni correction). The values of of many genes annotated with the term were distributed from before to after midnight. Because there were only four genes annotated with flavonoid biosynthetic process (GO:0009813), we did not perform the statistical test. However, the distribution of of those genes was clearly concentrated in the early morning (right-hand panel in Fig. 5). These results are consistent with those of the previous study and a laboratory study of Arabidopsis in which the genes implicated in phenylpropanoid biosynthesis showed expressional peaks before subjective dawn (Harmer ). Figure 6 shows the distributions of of genes annotated with rRNA processing (GO:0006364), small ribosomal subunit (GO:0015935), ribosome (GO:0005840), large ribosomal subunit (GO:0015934), and aminoacyl-tRNA ligase activity (GO:0004812). Although the distribution of genes annotated with rRNA processing (GO:0006364) was not significantly different from its background distribution (P = 0.0659; Watson–Wheeler test, Bonferroni correction), the values of seem to concentrate in the afternoon. Other distributions (GO:0015935, GO:0005840, GO:0015934, and GO:0004812) were significantly different from their background distributions (, P = 0.0168, and , respectively) and the peaks of the distributions were shifted from afternoon to evening in the same order as their biological order in rRNA processing. These results are also consistent with the previous results, which implies that the entrained circadian clock in the field controls the order of the acceleration of translation during the same time period.
Fig. 5
Distributions of clock phases of genes annotated with protein serine/threonine kinase activity (GO:0004674) and flavonoid biosynthetic process (GO:0009813). Fractions of genes annotated with the term and not annotated with the term are shown in red and gray, respectively
Fig. 6
Distributions of clock phases of genes annotated with GO:0006364 (rRNA processing), GO:0015935 (small ribosomal subunit), GO:0005840 (ribosome), GO:0015934 (large ribosomal subunit), and GO:0004812 (aminoacyl-tRNA ligase activity), shown as in Figure 5
Distributions of clock phases of genes annotated with protein serine/threonine kinase activity (GO:0004674) and flavonoid biosynthetic process (GO:0009813). Fractions of genes annotated with the term and not annotated with the term are shown in red and gray, respectivelyDistributions of clock phases of genes annotated with GO:0006364 (rRNA processing), GO:0015935 (small ribosomal subunit), GO:0005840 (ribosome), GO:0015934 (large ribosomal subunit), and GO:0004812 (aminoacyl-tRNA ligase activity), shown as in Figure 5We can also use model parameters to form a hypothesis about biological processes occurring in a field. The associations between gene annotations and regression coefficients were also investigated. Before the investigation, we divided the regression coefficients for each gene by the standard deviation of the observed expression of the gene to normalize coefficients. The absolute values of the normalized coefficients of genotype for genes annotated with photosynthesis (GO:0015979), thylakoid (GO:0009579), and photosysthetic membrane (GO:0034357) were significantly larger than the background distributions of those values (, and , respectively, Wilcoxon rank sum test, Bonferroni correction). This result implies that the difference in genotype affects photosynthesis. Further, the absolute values of normalized coefficients of the response for genes, for which temperature was selected as an environmental factor, indicated a significant association with phosphorus metabolic process (GO:0006793) (, Wilcoxon rank sum test, Bonferroni correction). This association suggests that phosphorum metabolism may be affected by fluctuations in temperature.
5 Conclusion
In this paper, we proposed a new gene expression model for organisms in a field, based on a previous model (Nagano ). The new model vastly reduces the computational time required for parameter optimization and model selection by unifying various types of gate functions and response functions, and introducing group lasso (Yuan and Lin, 2006). By applying the model to synthetically generated RNA-Seq data, we confirmed that the optimized model was consistent with the true gene expression dynamics for most genes. Further, to assess the capability of the new model, we applied it to the same ric plant data that were used in the previous study. The new model offered a comparable or slightly better prediction for most genes.The model parameters were consistent with those of the previous study and the biological knowledge. This consistency indicates the model’s capability to provide biological insights. In fact, the investigation of model parameters found associations between genotypes and photosynthesis and between the response to temperature and phosphorum metabolism, which were not discovered in molecular biological studies. We can form a hypothesis based on such associations; it will be validated by further experimental studies.Whereas the previous model was targeted only for microarray data, the new model is applicable to RNA-Seq data. The results of applying the new model to synthetic RNA-Seq data assuming known true models and pseudo RNA-Seq data generated from real microarray data indicated that the model may be useful for the analysis of RNA-Seq data. However, it should be noted that, in this study, the applicability was verified with only simulated data rather than real data. Further verification of performance with real data is required.In this study, we focused on the time variation of gene expression and analyzed transcriptome data sampled over time. However, the proposed model is applicable to data collected by other sampling strategies, such as multiple treatments at a single time point, only by preparing meteorological data of sufficient length. Further, although we applied the model to the transcriptome data of rice plants in this study, it is applicable to those of other organisms.The developed package (FIT) offers efficient parameter optimization and model selection. While the parameter optimization and model selection processes for all genes in the previous study required 30 days when a high-performance cluster computer was used, our new package does not incur such a high computational cost. Because the model describes the expression of each gene independently, the parameter optimization and model selection processes can be easily performed in parallel by dividing genes into several groups and performing these processes for each group. For example, it is expected that the parameter optimization and model selection processes can be completed by 10 Amazon EC2 m3.medium instances in half a day. This package will accelerate field transcriptomic studies.
Funding
This work was supported by Precursory Research for Embryonic Science and Technology (PRESTO), Japan Science and Technology Agency (JST) to A.J.N.; Core Research for Evolutional Science and Technology (CREST), JST to A.J.N.; Accelerated Innovation Research Initiative Turning Top Science and Ideas into High-Impact Values (ACCEL), JST to A.J.N.; and KAKENHI [JP16H06171, JP16H01473 to A.J.N.].Conflict of Interest: none declared.Click here for additional data file.
Authors: S L Harmer; J B Hogenesch; M Straume; H S Chang; B Han; T Zhu; X Wang; J A Kreps; S A Kay Journal: Science Date: 2000-12-15 Impact factor: 47.728
Authors: A B Nicotra; O K Atkin; S P Bonser; A M Davidson; E J Finnegan; U Mathesius; P Poot; M D Purugganan; C L Richards; F Valladares; M van Kleunen Journal: Trends Plant Sci Date: 2010-10-21 Impact factor: 18.313
Authors: Kevin R Hayes; Mary Beatty; Xin Meng; Carl R Simmons; Jeffrey E Habben; Olga N Danilevskaya Journal: PLoS One Date: 2010-09-23 Impact factor: 3.240
Authors: Anne Plessis; Christoph Hafemeister; Olivia Wilkins; Zennia Jean Gonzaga; Rachel Sarah Meyer; Inês Pires; Christian Müller; Endang M Septiningsih; Richard Bonneau; Michael Purugganan Journal: Elife Date: 2015-11-26 Impact factor: 8.140