Literature DB >> 34159058

Generic and specific recurrent neural network models: Applications for large and small scale biopharmaceutical upstream processes.

Jens Smiatek^1,2, Christoph Clemens³, Liliana Montano Herrera⁴, Sabine Arnold⁴, Bettina Knapp⁵, Beate Presser⁵, Alexander Jung², Thomas Wucherpfennig⁴, Erich Bluhmki^5,6.

Abstract

The calculation of temporally varying upstream process outcomes is a challenging task. Over the last years, several parametric, semi-parametric as well as non-parametric approaches were developed to provide reliable estimates for key process parameters. We present generic and product-specific recurrent neural network (RNN) models for the computation and study of growth and metabolite-related upstream process parameters as well as their temporal evolution. Our approach can be used for the control and study of single product-specific large-scale manufacturing runs as well as generic small-scale evaluations for combined processes and products at development stage. The computational results for the product titer as well as various major upstream outcomes in addition to relevant process parameters show a high degree of accuracy when compared to experimental data and, accordingly, a reasonable predictive capability of the RNN models. The calculated values for the root-mean squared errors of prediction are significantly smaller than the experimental standard deviation for the considered process run ensembles, which highlights the broad applicability of our approach. As a specific benefit for platform processes, the generic RNN model is also used to simulate process outcomes for different temperatures in good agreement with experimental results. The high level of accuracy and the straightforward usage of the approach without sophisticated parameterization and recalibration procedures highlight the benefits of the RNN models, which can be regarded as promising alternatives to existing parametric and semi-parametric methods.

Entities: CellLine Chemical Disease Species

Keywords: 37N25; 46N60; 62M10; 92B20; Autocorrelation functions; Generic and specific machine learning models; Principal component analysis; Recurrent neural networks; Simulation of bioreactor processes; Temporal evolution; Upstream processes

Year: 2021 PMID： 34159058 PMCID： PMC8193373 DOI： 10.1016/j.btre.2021.e00640

Source DB: PubMed Journal: Biotechnol Rep (Amst) ISSN： 2215-017X

Introduction

Over the last years, modelling and simulation has become an important field of research for biotherapeutical manufacturing and process development. Due to increasing computational power as well as the improved use of process analytical technologies, novel computational approaches for complex upstream and downstream processes are in the focus of recent interest [1], [2], [3]. While mechanistic kinetic-dispersive models are nowadays considered as standard methods for the study of capturing and polishing steps in downstream operations [4], [5], [6], [7], [8], [9], [10], there exist a plethora of distinct models for upstream processes with certain advantages and shortcomings. The large number of modelling approaches may be related to the importance of correlated molecular mechanisms at distinct length scales as well as the broad variability of biological parameters among living organisms. At the largest length and time scales, active pharmaceutical ingredients (APIs) like monoclonal antibodies (mAbs) are produced by distinct cells in bioreactors whose optimal design is nowadays studied by computational fluid dynamics or Lattice-Boltzmann simulations [11], [12], [13], [14], [15], [16], [17], [18]. At smaller or even molecular scale, one is usually interested in modelling the cell metabolism, which helps to identify optimal feeding strategies as well as improved process protocols for higher product titers and improved product quality [19]. Specifically often used standard Chinese hamster ovary (CHO) cells show a rather complex cell metabolism [20] in combination with diverse post-translational modification profiles [21], such that the understanding of the cell metabolism in terms of high API quality is of fundamental interest. In addition to detailed metabolic flux pathway models [22], [23], [24], [25], [26], [27], often also simpler mechanistic models are used to predict the time-dependent concentration profiles from standard cell cultures [28], [29], [30], [31], [32], [33], [34], [35]. The mathematical framework is represented by coupled partial differential equations which may also include the temperature as well as pH values in order to provide a more detailed representation of experimental conditions. Although most models show an overall good agreement with the experimental results, certain systematic deviations are often evident, which can be attributed to an incomplete knowledge of the cell metabolism as well as the use of oversimplified pseudo first-order and Monod reaction kinetics [36]. As a specific example, complex and varying feed strategies in terms of bolus addition are often not reliably reproduced [36]. Thus, certain deviations from experimental outcomes as well as the neglected or simplified influence of intrinsic parameters like temperatures or pH values for mechanistic growth models become evident. Recently, so-called hybrid models were introduced in order to improve process simulations [37], [38], [36], [39], [40], [41], [42]. In combination with a mechanistic framework, experimental data are used to derive time-dependent rate constants in combination with relevant process parameters like the temperature and the pH value as well as feeding rates in terms of an artificial neural network approach or other advanced regression techniques [36], [37], [38], [43], [44], [45], [3], [46], [42], [41]. Despite slight differences between the approaches, a hybrid model usually extracts the temporally varying rate ν(t) for a biomass-related parameter x or for the product titer p from an ANN approach, which is then introduced according toandwhere θ(t) is the Heaviside function, which can be either 0 or 1, dependent on the presence or absence of induction in combination with the dilution factor D(t), which contains information about bolus addition or sampling. The corresponding temporal values for p(t) or x(t) are then calculated by standard numerical integration schemes [43]. Hence, hybrid models are able to reproduce the growth and metabolic rates of fed-batch processes [36], [43] in combination with complex feeding strategies. Such a detailed description is not achieved by mechanistic models, however, their benefit for simple predictions of growth parameters even for perfusion processes was recently demonstrated [33], [47]. Notably, the determination of rate constants for mechanistic and hybrid models as well as the parameterization of the approaches is still a challenging task. Moreover, it has to be noted that the corresponding mechanistic framework provides a rather coarse-grained picture when compared to more sophisticated metabolic flux pathway models [26]. Hence, the corresponding insights in terms of Eqs. (1) and (2) into growth, death and production behavior are of limited value for more refined considerations due to simplified descriptions as well as unphysical temporal variations of the rate constants. Although hybrid models can be used as a beneficial tool to complement Design of Experiment studies with regard to an adequate exploration of the design space [44], [41], it has to be noted that experimental work in terms of initial parameter scans is of essential need. Thus, given the limitations with regard to the complex parameterization procedure in combination with the rather limited insights, it can be assumed that straightforward non-parametric machine learning approaches provide comparable outcomes with less efforts. Moreover, such data-driven methods circumvent the consideration of temporal variations for the rate constants, which is thus in agreement with quasi-equilibrium thermodynamics. Over the last years, a lot of effort was spent into the development of neural networks or further advanced regression algorithms [48], [49], [50] and their application for bioprocess control and prediction [51], [52], [1], [53], [54], [55]. Often used approaches are artificial neural networks (ANNs) which can be regarded as highdimensional regression methods for connecting input parameters to target variables [56], [48]. ANNs are nowadays widely used in the field of natural sciences, as can be seen by applications ranging from the calculation of molecular properties, prediction of chemical reactions and drug screening [57], [58], [59], [60], [61], [62], [63], [64]. Although ANNs are well suited to connect static features, they are often limited for data showing temporal evolution. Promising candidates in this regard are multivariate recurrent neural network (RNN) models [65], [66], [67], [55], whose benefits for the calculation and simulation of temporal process data in various contexts were recently described [51], [55]. In this article, we present specific and generic RNN models for the simulation of multivariate large- and small scale upstream processes. Our approach can be used for the control and study of single product-specific large-scale manufacturing runs (specific RNN model) as well as small-scale evaluations in terms of combined processes for distinct products at development stage (generic RNN model). All RNN calculations rely on experimental data with broad variability. Certain variations at well-defined time points can be attributed to differences in process conditions as well as biological factors. Despite these challenges, our results only show small deviations between calculated and experimental values, which are significantly smaller than the ensemble experimental standard deviation. The main advantages of our method are the straightforward implementation without complex parameterization procedures in combination with a high predictive accuracy. In contrast to hybrid or mechanistic models, the proposed RNN approach can be used without further approximations, pre-defined boundary conditions or knowledge about the underlying metabolic connections. Moreover, the questionable introduction of temporally varying rate constants is avoided. Without further adaption, fully automatized and pre-trained RNN models can also be used by non-experts which promotes their usage for the calculation and simulation of modern biotherapeutical manufacturing and development processes in real time. The results for the platform-dependent generic RNN model approach underpin such assumptions. The article is organized as follows. In the next section we provide a short introduction into the theoretical background of RNNs. Details about the numerical implementation and the data sets are presented in Section 3. All numerical results are shown in Section 4. We conclude and summarize in the last section.

Theoretical background: recurrent neural networks

Over the last years, recurrent neural networks (RNNs) attracted recent interest as promising approaches to process and to evaluate large amounts of temporal sequences [67]. Typical applications of RNNs include speech recognition [68], [67] as well as weather, climate and finance forecasting [69], [70], [71]. In principle, RNNs can be regarded as a modified version of standard feed-forward ANNs [56], [48], [64]. The basic network structure is represented by one input layer, one or multiple hidden layers and one output layer with a varying number of nodes in each layer. In contrast to feed-forward ANNs, direct connections between two successive layers of nodes are implemented as recurrent loops. Hence, the RNN is able to process temporal sequences and to predict the evolution of outcomes. The basic algorithm of an RNN [67], [68] includes the consideration of an input interval x = (x1, …, x) of length T as fed into the nodes of the input layer, the hidden vector h = (h1, …, h) as calculated in the hidden layers and the final output vector y = (y1, …, y) where bold letters denote vectors. The following iterative algorithm connects the input sequence to the elements of the hidden vectorand hence also to the output vectorrespectively, with sequence or time points t = 1, …, T, biases b and the weights ω, where the indices j, l ∈ {i, h, o} denote the corresponding input (i), hidden (h) and output (o) layers. The function represents a standard hidden layer activation function like in ANNs which is typically a logistic, hyperbolic tangent or sigmoidal function with a smooth differentiable form [56]. A scheme of a standard RNN is shown in Fig. 1.

Fig. 1

Scheme of a recurrent neural network with one hidden layer. The green squares denote the input layer, the black circles the hidden layer and the blue diamonds the output layer. All arrows denote data flow and calculations in the corresponding direction. A compact structure of the RNN is shown on the left side with the recurrent loop. The temporal unfolding of the recurrent loop and the hidden layer shows the network structures on the right side. It can be seen that the individual representations focus on distinct time or sequence points t − 1, t, t + 1 as represented by the connections between the input and output sequence points. The recurrent loop is implemented as a connection between the nodes h, h, …, h such that h and h, respectively, are communicated (as denoted by the black square with C) to h and h. The dots on the right side mark the remaining and not shown unfolded connections with final sequence calculations for x, h and y. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Notably, a significant improvement for the stability and accuracy of RNNs was the introduction of the long-short term memory (LSTM) approach [65] which allows the consideration of long times within sequences. The recent interest in modern RNN architectures can also be attributed to the development of advanced training algorithms [72]. A reliable and highly efficient training algorithm is of fundamental importance for all iterative multivariate regression approaches. For RNNs, a so-called backpropagation through time (BPTT) method [73] is often used which requires a temporal unfolding of the network in accordance with Fig. 1. We refer the reader to the supplementary material for more details on the LSTM approach, stacked hidden layers and advanced training algorithms.

Numerical details

In this section, we present the features and the characteristics of the corresponding process experimental data sets. Moreover, we discuss the numerical details of the specific and generic RNN models.

Data sets

Large scale process runs

The large scale data set for the specific RNN model considers individual manufacturing runs for a single API with values for the titer, total cell density (TCD), viable cell density (VCD), viability, glucose and lactate concentration at distinct time points. Non-considered process parameters like seeding cell densities, time points for bolus addition or feeds as well as set points for temperatures were identical for the runs. The data set included 118 process runs with 9 measurements each at different time points with roughly comparable time intervals of 24 h. All large scale data correspond to a validated process that is executed in a 12,000 L bioreactor. The measurements were performed by a standard set of analytical methods to determine the product, cell or metabolite concentration in samples taken from cell suspension. The raw process data is shown in the supplementary material.

Small scale process runs

The ensemble process data set for the generic RNN model combines the runs of four individual mAb production processes at development scale. All processes were subject to the same platform procedure including identical growth and feed media as well as CHO clone cells. The data set included 90 processes in total with 16 runs for mAb A, 25 runs for mAb B, 25 runs for mAb C and 24 runs for mAb D including values for the product titer, TCD, VCD, viability, glucose and lactate concentration, actual pH value, bioreactor volume and cultivation temperature. All parameters were systematically and consistently varied between the runs. The individual runs included 15 measurements from initial start time with comparable time intervals of roughly 24 h for each mAb production process. Values for the product titer were only measured for the last 6 time points due to nearly negligible values for the previous lag and exponential growth phase. The glucose concentration was manually changed for some processes at later process time points due to modified feeding strategies. Despite being platform processes, minor differences between the individual products and processes can be noted for the temperature, seeding cell density, upper pH values, power inputs, medium equilibration times, and gassing rates which vary slightly among the products and the processes. Moreover, the different mAbs were of similar product type, but had slight differences in their genetic sequence and hence expression behavior. In contrast to the large scale data set, further variability can be attributed to the analytical methods, the used equipment and the corresponding calibration procedure in non-good manufacturing practice (non-GMP) and hence non-validated environment. All key parameter raw data for mAbs A, B, C and D are shown in the supplementary material.

Details of the RNNs

Specific RNN model for large scale process runs

All RNN models were programmed in Python 3.7.1 by using the modules Pandas and Numpy. The architecture of the RNN was implemented through the Keras module (version 2.3.1) [74] relying on the TensorFlow backend (version 1.13.1) [75]. Each of the three hidden layers in the RNN was formed by 120 nodes and the first and the second layer (LSTM layers) were made recurrent while the third layer only considers a feed-forward connection to the dense output layer. A hyperbolic tangent (tanh) was chosen as corresponding activation function. The learning rate was set to 0.001. For all input values of the generic and the specific RNN models, we used a robust scaler which removes the median and scales the data according to the interquartile range. The interquartile range is the range between the first quartile (25% quantile) and the third quartile (75% quantile). No further data preprocessing was performed. All calculations for the principal component analysis were performed with a standard scaler, which subtracts the mean value and normalizes by division with the standard deviation. For specific and generic RNN model training, we used the Adaptive Moment Estimation (Adam) optimizer [76] with the mean absolute error (MAE) as loss function. As a standard procedure to avoid overfitting, we added a dropout function [77] with a fraction of 0.1 to the LSTM layers. We considered 118 experimental process runs at large scale and 20 randomly chosen runs were used for validation in terms of a standard training/test splitting procedure [48]. The test data are not included in the training data set. Input and target values were the titer, TCD, VCD, viability as well as glucose and lactate concentrations. The training phase initially included 500 epochs with an early stopping function [78]. As stopping criterion, we used a standard MAE loss function [48] which needs to show a convergent behavior within 20 epochs. The RNN batch size for data processing was chosen as 4 which was used for each input vector. Instead of predicting or learning the whole temporal sequence at once, we introduced an interval procedure which introduces the two input values from previous time points for the calculation of the output value at the next time point in terms offor all t = 1, …, T − 1. Thus, only the first (initial measurement at process start) and the second value from the measurements in each run need to be known for the RNN calculations. The choice of this value was motivated by the presented results for the autocorrelation functions, which show a pronounced non-Markovian behavior. Notably, such an approach allows a simulation of process outcomes also from random starting configurations as outlined in the remainder of this article.

Generic RNN model for small scale runs

Due to the smaller data set for the development runs, the RNN included only two hidden layers with 120 nodes each and the first layer (LSTM layer) was chosen as recurrent while the second layer relies on a feed-forward connection to the dense output layer. The learning rate was set to 0.001. In order to avoid overfitting, we added a dropout function [77] with a fraction of 0.1 to the LSTM layer. All other hyperparameter settings and chosen algorithms were identical to the specific RNN models for the large scale runs as discussed in the previous subsection. For purposes of training, we used 86 process experimental runs (15 from mAb A, 24 from mAb B, 24 from mAb C and 23 from mAb D) and one randomly chosen process experiment for each mAb in terms of validation procedures. As an extension of a simple leave-one-out procedure, further evaluation with regard to random shuffling of training and test data for 100 repetitions finally provided reliable estimates for important statistical quantities in terms of validation procedures. For all repetitions, we ensured that the test data was not included in the training data set. The training phase initially included 500 epochs with an early stopping function [78]. Input and output values included the titer, TCD, VCD, viability, glucose and lactate concentration, actual pH value, bioreactor volume and the cultivation temperature. In contrast to the specific RNN models, the bioreactor volume, the actual pH value and the considered cultivation temperatures (between 307.65 K and 308.65 K) were additionally taken into consideration. The RNN batch size was chosen as 8 which was used for each input vector. An identical interval learning procedure (Eq. (5)) like for the specific RNN model was used for all calculations.

Simulations: impact of different temperatures

The generic RNN model was also used for simulations including different temperatures. Although one can in principle study also other effects, e.g. pH variations, we concentrate on the impact of temperatures as these induce the most significant changes in the process outcomes. Each individual run was started at a fixed temperature of 307.65 K, 308.15 K and 308.65 K. For each temperature, we performed 2500 independent process simulations based on the pre-trained small scale RNN models. For the initial and the first time point, the corresponding values for the titer, TCD, VCD, viability, glucose and lactate concentration, pH value and bioreactor volume were drawn from a normal distribution with mean value μp(τ) and variance where τ denotes the measurement time. The values for μp(τ) and the variance were calculated for the individual process parameters from the original experimental process data sets at the corresponding first two measurement points in accordance with the simulated temperatures. The generic RNN model used these values as random input parameters drawn from normal distributions with the same moments and provides the corresponding outcomes for the later time points in terms of fixed interval calculations (Eq. (5)). The temperature was kept constant during the simulations while all other parameters were subject to intrinsic changes. The corresponding mean values and standard deviations for the combined simulation runs are calculated at distinct time points in order to study the influence of different temperatures on key process outcomes. For purposes of independent validation, experimental values of mAb E for the VCD, TCD, product titer and viability at comparable time points in terms of a platform process at fixed temperatures of 307.65 K, 308.15 K and 308.65 K were monitored. The corresponding processes and values related to mAb E were not used for training or validation of the RNN models and serve as an independent experimental confirmation of the simulations.

Validation methods

Each RNN model was validated by comparison between the computed and the experimental (target) values. As standard statistical quantities, we used the mean absolute error of prediction (MAE) and the root-mean-squared error of prediction (RMSE). When divided by the standard deviation of the ensemble experimental values for the process parameter σExp(x), the corresponding normalized MAEs (nMAEs) and normalized RMSEs (nRMSEs) provide an unbiased estimate for the precision of the predictions. All our results revealed minor values for nMAEs and nRMSEs (nMAE < 0.31 and nRMSE < 0.43), which highlights the fact that the corresponding RNN model achieved a significantly higher accuracy when compared to a simple standard 3σExp(x) deviation criterion. In addition, we computed the corresponding values for the validation and the training data set in order to detect overfitting issues. With regard to the used dropout procedure in combination with the early stopping convention, all our results for the nMAEs and nRMSEs revealed that issues of overfitting can be largely ignored. Furthermore, the Pearson correlation coefficients showed high values (for most values R2 ≥ 0.94), which demonstrates the linear relationship between computed and experimental values. Detailed values will be discussed and presented in the remainder of the article. Noteworthy, the unknown functional relationship between the target and the input values does not allow us to compute confidence intervals in order to estimate the statistical accuracy of the calculations. In over to overcome this shortcoming, we splitted the experimental data sets for the small and the large scale runs into training and test data. The test data was not considered for the training of the RNN and the nRMSE and nMAE values were used for model validation. If the calculations for the test data reveal significantly lower nRMSE and nMAE values than unity, a higher precision when compared to randomly drawn parameter values from the underlying experimental ensemble distribution is assumed. Moreover, such an approach also allows a straightforward detection of outliers.

Numerical results

In this section, we first discuss the application of the specific RNN model for large scale manufacturing processes. Hereafter, we present a generic RNN model for the prediction of distinct mAb production processes at small scale. The corresponding generic RNN approach will also be used to simulate process outcomes for different temperatures.

Specific RNN model for large scale processes

Principal component analysis and autocorrelation functions

In principle, one may ask why an RNN model should provide reasonable results for key process parameters? Such a question is closely related to the temporal evolution of variables as well as the corresponding Markovian properties. As a further important property, the correlation between the individual parameters can be studied through a principal component analysis (PCA) [48]. The covariance matrix for a process parameter vector x is defined bywhere 〈 · 〉 denote mean values. With regard to the use of orthogonal basis transformations to a new vector z in terms ofand equivalentlyone can obtain the following expressionwith the diagonal matrix , where the jth column of corresponds to the principal component PC with eigenvalue . In addition to the introduction of independent and orthogonal eigenvectors (principal components), PCA also provides insights into essential fluctuations. Herewith, the explained variance can be calculated which sheds light onto concerted process parameter variations [48]. For such an analysis, we considered the K principal components as calculated from the experimental data set, such that the explained variance for the cumulative contribution of fluctuations including all principal components PC with j = 1, …, α can be written aswith the condition α ≤ K. The corresponding results for the large scale runs are shown in Fig. 2. As can be seen, roughly 67% of all variations within the data set can be assigned to the principal component PC 1. In combination with PC 2, it follows that nearly 95% of all fluctuations and variations can be described by only two PCs. Such extremely high values for the first two PCs forming the essential subspace are remarkable and point to the fact that most of the process outcomes are highly correlated. In addition to the correlations, one can also observe a temporal evolution of the process outcomes as monitored by the first two principal components. Hence, the values in the lower left corner of Fig. 2 (right side) can be attributed to initial process conditions while the final values for key process parameters in terms of PC 1 and PC 2 are located in the upper right corner. The corresponding correlations as shown in the supplementary material reveal that PC 1 is mainly dominated by the titer, the viability and the glucose concentration, while PC 2 shows its highest correlations with the TCD and the lactate concentration.

Fig. 2

Left side: Explained variance in terms of principal components (PC) for the large scale manufacturing runs. The values for the explained variance of the individual PCs are presented in the inset. Right side: Values of PC 1 and PC 2 for individual large scale manufacturing process runs. Closely related, the results for the individual autocorrelation functions [79], [80], [81], [82] in terms of actual process parameter values x as defined byat certain time points τ and τ0 with τ0 ≤ τ ≤ τ provide an estimate for the temporal correlation and the full decorrelation time τ at ACF(τ) ≈ 0. The corresponding results for all process outcomes are presented in Fig. 3. As can be seen, the autocorrelation functions for the titer, lactate concentration and viability show a comparable decay and thus strong temporal correlations. In addition, all correlations vanish for these parameters at τ/τ = 0.4, where τ denotes the final time point. Notably, also the values for the TCD, VCD and glucose concentration show a concerted temporal decay with a shorter decorrelation time of τ/τ = 0.2. Such findings can be rationalized by the strong correlation between lactate and titer production as well as glucose consumption [83]. Due to different slopes, the individual phases of the process in terms of exponential growth phase and stationary non-growth phase can be clearly identified [84]. Despite the fact that one recognizes two relevant decorrelation times for the initial decay of the process variables, the broad comparability of the individual process parameter autocorrelation functions becomes evident. As already mentioned, such characteristics are highly beneficial for any RNN in terms of non-Markovian processes which facilitate meaningful predictions for reasonable changes in the process outcomes. Finally, the negative values for the ACF can be attributed to an anticorrelated behavior in which the temporal change of the process parameter values is reversed.

Fig. 3

Autocorrelation functions for the corresponding temporal process parameter changes with reference to distinct time points τ/τ.

Results of the specific RNN model

The experimental data sets for the large scale runs in terms of mean values and standard deviations for certain time points are presented in the supplementary material. As expected for large scale manufacturing processes, individual variations due to slight process parameter changes are noticeable, but not highly pronounced. Nevertheless, such small variations are a challenging task for a specific RNN approach. Arbitrarily chosen experimental values for key parameters in combination with the outcomes of specific RNN model calculations are presented in Fig. 4. In general, one can recognize a good agreement between the calculated and the experimental values, which also includes accurate predictions for rapid changes in the glucose concentration at τ/τ = 0.5 − 0.6 as well as for the TCD (significant increase for τ/τ > 0.6). With regard to larger standard deviations at certain time points in the training data sets (as shown in the supplementary material), one would specifically assume less precise RNN calculations for TCD and VCD values at τ/τ > 0.2. With reference to the RNN results, it can indeed be seen that the calculated TCD and VCD values show some slight variations at exactly these time points. In terms of the chosen interval approach (Eq. (5)), one would assume that such inaccuracies also progress to later time points, which rationalizes the slight discrepancies between experimental and computed results. Corresponding conclusions can also be drawn for some outliers in the glucose and lactate concentration at later process times. In terms of the experimental values as shown in the supplementary material, it can be seen that the standard deviation of the data points increases with process time. Hence, such an increasing variability can be regarded as a challenge for the interval learning approach (Eq. (5)) in terms of error progression which rationalizes the observed slight deviations. In addition, the glucose concentration is slightly changed by non-monitored external bolus additions whereas the lactate concentration strongly depends on the cell metabolism. Despite such slighter deviations, it has to be noted that all trends in the process parameters are well reproduced.

Fig. 4

Specific RNN model calculations (blue triangles) and experimental results (red circles) for the large scale process data sets in terms of randomly chosen process runs including TCD (top left), VCD (top right), viability (middle left), titer (middle right) as well as glucose (bottom left) and lactate concentration (bottom right). The blue lines correspond to cubic spline functions as guides for the eyes. The errorbars denote the global root-mean-squared errors of predictions for the RNN in terms of the test data set and the corresponding target variable (see text for more details). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) The results for the computed values based on the training and the test data sets are shown in Fig. 5 and the corresponding key statistical values are presented in Table 1. As can be seen, all computed results show a high linear correlation with the experimental data in terms of Pearson correlation coefficients R2 ≥ 0.94. Notably, the lowest values of R2 can be observed for the values of VCD, TCD and the glucose concentration. The larger deviations from linearity for these process outcomes can be related to the more pronounced variations in the experimental data set as discussed before. Although some outliers can be identified which deviate significantly from the black line with unit slope, the vast amount of predictions reveals a high agreement with the corresponding experimental values due to rather low mean absolute errors for the validation data set as defined byas well as the root-mean-squared errorwhere y and denote the corresponding calculated and target values for N samples. Such assumptions are further underpinned by the values of the normalized MAEsand the normalized RMSEswhere σExp denotes the standard deviation of the experimentally measured ensemble data for the corresponding process parameter. Specifically the calculations for the titer, the viability as well as the lactate concentration reveal a high level of accuracy in terms of low nMAEs and nRMSEs (Table 1). Slighter deviations can be observed for the TCD, VCD and the glucose concentration. Nevertheless, the corresponding values for the nMAEs and nRMSEs are smaller than unity which highlights the applicability of the RNN model even for predictions of more complex process outcomes. As can be concluded, the results of the specific RNN model provide a significantly higher accuracy when compared to statistical estimates in terms of experimental standard deviations and the often used 3σExp criterion. In addition, a comparison of the nMAE and nRMSE values in Table 1 for the training and the test data reveals a comparable order. Hence, significant issues of overfitting can be largely ignored such that the aforementioned outliers can mainly be attributed to the broader experimental variability at the corresponding process stages. In consequence, the corresponding nMAEs and nRMSEs show a good predictive accuracy which rationalizes the use of this approach for large scale manufacturing runs. With regard to this point, it has to be noted that large scale processes reveal minor variations due to already well-defined process conditions when compared to exploratory small scale development processes. The application of RNNs for such processes will be discussed in more detail in the next subsection.

Fig. 5

Table 1

Mean Pearson correlation coefficients R2, fraction of computed values x which are located within the ensemble experimental standard deviation P(x < σExp), normalized mean absolute error MAEs (nMAE) and normalized root-mean squared error (nRMSE) between computed and experimental values for the specific RNN model when averaged over the test data set (columns 3 and 4) and over the training data set (last two columns).

Value	R²	P(x < σ_Exp)	nMAE	nRMSE	nMAE_tr	nRMSE_tr
Titer	0.99	1.0	0.09	0.14	0.04	0.06
TCD	0.94	0.95	0.30	0.42	0.07	0.11
VCD	0.94	0.95	0.30	0.40	0.08	0.12
Viability	0.99	1.0	0.16	0.22	0.07	0.09
Glucose	0.97	0.95	0.25	0.33	0.06	0.10
Lactate	0.99	0.98	0.16	0.23	0.04	0.08

Specific RNN model calculations for the test data set (red diamonds) and for the training data set (blue circles) with regard to the experimentally measured data (x-axis) and the predicted values (y-axis) including the TCD (top left), VCD (top right), viability (middle left),product titer (middle right), glucose (bottom left) and lactate concentration (bottom right). The black solid line has a slope of unity and represents full coincidence between measured and predicted values while the straight blue lines represent the ensemble standard deviation σExp of the experimental data set. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Mean Pearson correlation coefficients R2, fraction of computed values x which are located within the ensemble experimental standard deviation P(x < σExp), normalized mean absolute error MAEs (nMAE) and normalized root-mean squared error (nRMSE) between computed and experimental values for the specific RNN model when averaged over the test data set (columns 3 and 4) and over the training data set (last two columns).

Generic RNN model for small scale processes

With regard to the last section, it can be concluded that a specific RNN model for large scale runs provides meaningful results. However, the question remains if also a generic model can be developed which is able to compute the outcomes of mAb production processes at smaller scales. Such a model would be helpful to study optimal process conditions and to predict general trends for the performance of novel development candidates. Motivated by these points, we combined the small scale run data sets for four mAb products in order to study the properties as well as the validity of such a generic RNN model. As a first step, we performed a PCA on the corresponding data. The results for the explained variance of the combined process data as well as a projection of the process data on the first principle components are shown in Fig. 6. Due to the larger number of input variables, it has to be noted that we have to consider 9 PCs in contrast to the large scale runs. In consequence, the consideration of only two principle components PC 1 and PC 2 provides a reduced value for the explained variance in terms of roughly 58%. Hence, the corresponding values are a little bit smaller when compared to the large scale runs which can be rationalized by the larger number of input vectors as well as the distinct process characteristics (as shown in the supplementary material). A projection of the process data on the first two principle components is depicted on the right side of Fig. 6. As can be seen, the individual process data differ slightly in terms of mean positions and ranges, but significant overlap regions can also be identified. Thus, the individual processes show slight deviations but also some similarities which rationalizes their use for the development of a generic RNN model. Specifically the individual values for PC 2 highlight the clustering of the data into separated mAb processes. Noteworthy, the points in the lower left corner in Fig. 6 can be assigned to initial process parameter values while the symbols in the upper right corner correspond to final process outcomes. Such conclusions are further supported by the individual correlation coefficients of the principal components with the considered process parameters as shown in the supplementary material, which reveal high correlations of PC 1 with the product titer, the pH value and the viability as well as the product titer, the volume and the TCD (PC 2). In comparison to the specific RNN model, it can be assumed that the accuracy of the generic RNN approach will be less pronounced, which is due to the broader variation of the corresponding process parameters with regard to the individual platform projects.

Fig. 6

Left side: Explained variance in terms of principal components (PC) for the small scale process runs. The corresponding value of the explained variance for the individual PCs is presented in the inset. Right side: Values for principal component 1 and principal component 2 in terms of individual process runs for mAb A (blue circles), mAb B (red squares), mAb C (gray triangles) and mAb D (black diamonds). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Despite these slighter discrepancies, the results for the autocorrelation functions (Fig. 7) highlight a comparable temporal evolution of the corresponding process variables. Thus, all mAb process outcomes show a similar decay pattern for the titer, VCD and TCD with a decorrelation time of τ/τ ≈ 0.35. In contrast to the large scale runs (Fig. 3), the temporal evolution of the titer is inherently coupled to the VCD and the decorrelation time is significantly larger. These findings can be rationalized by the complex biological metabolism of the CHO cells as described in the literature [83], [84], [85], [20]. Notably, the comparable temporal decay of all process outcomes for distinct products can be considered as a consequence of the underlying platform process. With regard to this point, also the autocorrelation functions for the viability, as well as the glucose and lactate concentration reveal a comparable decay. In consequence, the outcomes of the PCA and the ACF highlight the potential applicability of a generic RNN model. Moreover, the pronounced non-Markovian behavior for the first three time points (τ/τmax < 0.18) rationalizes the use of the proposed interval learning scheme (Eq. (5)).

Fig. 7

Autocorrelation functions for the four mAb production processes as denoted by circles (mAb A), squares (mAb B), triangles (mAb C) as well as diamondoids (mAb C) for the product titer (blue color), TCD (red color) and VCD (gray color). The corresponding results for the viability, the glucose and the lactate concentration are shown on the right side. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) The raw data for the individual process runs are presented in the supporting material. In contrast to the large scale runs, the variance of the ensemble small-scale runs due to distinct mAb production processes is more pronounced. However, the question remains if a generic RNN model is able to distinguish between the four distinct mAbs in terms of individual process predictions. Examples for predicted values in terms of randomly chosen process runs for products mAb A, mAb B, mAb C and mAb D are presented in Fig. 8. As can be seen, the computed results show some larger variations which are expected in terms of the larger variance in the experimental data as shown in the supplementary material. However, it is worth to notice that the RNN is even able to reproduce the complex glucose concentration profiles as observed in the experiments. Notably, the deviations for all values become larger at τ/τ ≥ 0.75 which can be related to the propagation behavior of uncertainties as discussed in the previous subsection. Despite some discrepancies, it becomes evident that the corresponding results reveal a good agreement with the experimental data such that general trends are well reproduced.

Fig. 8

Specific RNN model results (blue diamonds) for randomly chosen processes from four mAb development candidates in combination with the corresponding experimental results (red squares) including the TCD (top left), VCD (top right), viability (middle left), titer (middle right) as well as glucose (bottom left) and lactate concentration (bottom right). Measured data for the titer at τ/τ ≤ 0.6 are not available. The predicted profiles (blue lines) are cubic spline functions which connect the outcomes of the individual RNN calculations. The errorbars denote the global mean absolute errors of calculations for the RNN in terms of the corresponding target variables (see text for more details). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) The comparison between all predicted and experimental values for different mAbs is shown in Fig. 9. As can be seen, the accuracy is not that high when compared to the specific RNN predictions for the large scale runs, but still establish a reasonable agreement in comparison with the experimental outcomes. With regard to the corresponding statistical values in Table 2, it can be seen that the normalized MAEs and normalized RMSEs reveal a satisfying accuracy. Slighter deviations can mainly be observed for the glucose and the lactate concentrations which are subject to modified feeding strategies within the process as well as metabolic properties. In summary, the corresponding results for the titer, the TCD, the VCD and the viability reveal a high predictive accuracy. Despite the fact that certain predictions for individual mAb process outcomes differ from the experimental values, e.g. larger differences for mAb A between predicted and experimental values in Fig. 9, one can conclude that the generic RNN model is validated for processes with comparable parameter variation ranges. Such conclusions are also underpinned by the low nMAE and nRMSE values which rationalize the validity of our approach.

Fig. 9

Table 2

Mean Pearson correlation coefficients R2, fraction of computed values x which are located within the ensemble experimental standard deviation P(x < σExp), normalized mean absolute error MAEs (nMAE) and normalized root-mean squared error (nRMSE) between computed and experimental values for the generic small-scale RNN model when averaged over the test data set (columns 3 and 4) and over the training data set (last two columns).

Value	R²	P(x < σ_Exp)	nMAE	nRMSE	nMAE_tr	nRMSE_tr
Titer	0.98	0.99	0.08	0.18	0.05	0.12
TCD	0.99	0.93	0.20	0.22	0.06	0.09
VCD	0.99	0.93	0.22	0.23	0.06	0.09
Viability	0.99	0.98	0.13	0.14	0.04	0.12
Glucose	0.83	0.95	0.29	0.38	0.07	0.14
Lactate	0.95	0.94	0.30	0.39	0.07	0.10

RNN calculations for the training data set (blue diamonds) and for test process data from mAbs A (brown circles), mAbs B (black circles), mAbs C (red circles) and mAbs D (magenta circles) with regard to the experimentally measured data (x-axis) and the predicted values (y-axis) for the titer (top left), TCD (top right), VCD (middle left), viability (middle right), glucose (bottom left) and lactate concentration (bottom right). The black solid line has a slope of one and represents full coincidence between measured and predicted values. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) Mean Pearson correlation coefficients R2, fraction of computed values x which are located within the ensemble experimental standard deviation P(x < σExp), normalized mean absolute error MAEs (nMAE) and normalized root-mean squared error (nRMSE) between computed and experimental values for the generic small-scale RNN model when averaged over the test data set (columns 3 and 4) and over the training data set (last two columns).

Simulated processes: temperature effects

In this subsection, we use the generic RNN model to study the influence of distinct conditions on small scale process outcomes. Here, we explicitly focus on the influence of different temperatures and how these affect the corresponding key process outcomes. With regard to this point, we simulated artificial process runs for fixed temperatures T = 307.65 K, 308.15 K and 308.65 K. The corresponding results with standard deviations (vertical bars) are presented in Fig. 10. The corresponding simulations are compared to experimental outcomes for key parameters of mAb E (filled symbols) in terms of a comparable platform process. Noteworthy, the values for mAb E were not used for training of the generic or specific RNN models. As can be seen, the corresponding values for the product titer, TCD, VCD and viability are mainly located within the standard deviation of the process simulations. Slighter deviations can only be observed for the viability at the lowest considered temperature. Despite these differences, it becomes evident that the computed results are in good agreement with independent experimental values as monitored for the product mAB E subject to the same platform process.

Fig. 10

Generic RNN model simulations for key process parameters outcomes product titer, TCD, VCD, and viability in terms of fixed temperatures T = 307.65 K, 308.15 K and 308.65 K. The individual results are obtained from 2500 simulations each with random starting conditions. The corresponding bars represent standard deviations for the individual temperatures as obtained by averaging over the 2500 simulations. The circles denote individual experimental values for one process run of mAb E at T = 307.65 K, while the squares and the triangles represent the outcomes for 308.15 K and 308.65 K, respectively. Besides predictions, one can obtain insights into the impact of distinct temperatures on growth rates and metabolite concentrations. For instance, it becomes evident that increasing temperatures establish higher titer as well as TCD and VCD values. In contrast, the values for the viability decrease with increasing temperatures. These findings can be rationalized by the faster metabolism at higher temperatures as known for mammalian cells [85]. In consequence, it can be concluded that the generic RNN model can be used to achieve deeper insights into modified process conditions and how they affect the process outcomes.

Summary and conclusions

We presented a novel approach for the calculation and prediction of upstream process outcomes in terms of specific and generic RNN models which do not rely on specific calibration procedures when compared to semi- or full parametric approaches. We demonstrated the validity of the models for large scale runs as well as for distinct individual small scale processes in terms of a platform-dependent generic RNN model. The corresponding results reveal a reasonable and good agreement with the experimental data which highlights the validity of our approach. All calculated values show minor variations when compared to ensemble experimental standard deviations such that the normalized MAE and normalized RMSE values are smaller than unity. Thus, our models provide a high accuracy which can also be used to simulate key process outcomes for small scale upstream processes in order to support the identification of suitable process conditions. In principle, one can use such simulations for the study of varying temperatures, pH values or other process parameter variations like modified feeding strategies with regard to the growth rates as well as metabolite concentrations. Even for large scale runs with minor parameter variations, the corresponding approach can be considered as an useful alternative to hybrid or mechanistic models. In particular, the proposed method reveals its benefits in terms of tighter process control and the identification of potential outliers. In contrast to parametric models like mechanistic approaches, the proposed RNN modelling strategy is also able to consider intense parameters like temperatures, pH values or dissolved oxygen content. Comparable conclusions can be drawn with regard to modified bolus additions or feeding strategies, which often require a singular and ad-hoc change of the parameters in mechanistic models. Noteworthy, such variations contradict the differentiable form of reaction dynamics in thermodynamic equilibrium and also violate the minimum entropy production principle [86], thereby pointing to the fact that mechanistic models which only rely on mass balance conditions reveal certain shortcomings. Similar conclusions are also valid for hybrid models, which crucially rely on temporally varying rate constants. In agreement with mechanistic models, certain aspects of these models are inconsistent with equilibrium thermodynamics as well as linear non-equilibrium thermodynamics in terms of rapid and non-continuous changes of the entropy production. Thus, the RNN models circumvents the missing detailed knowledge about the underlying reactions, such that a prediction of process outcomes only relies on non-Markovian properties. Hence, although hybrid models may provide a comparable functionality and predictive capability when compared to the RNN approach, it has to be stated that these are often in conflict with the underlying thermodynamic principles. In consequence, we highlight the straightforward and fast development of RNN models for cultivation processes. The underlying conflicts with thermodynamic boundary conditions can be circumvented by the proposed non-parametric functional form. To the best of our knowledge, such a broad applicability for generic and specific process description was yet not established for any other modeling approach. Although it has to be noted that hybrid as well as mechanistic models reveal their benefits depending on the level of considered detail [87], a comparable complex parameter calibration procedure as known for parametric models is not needed for our approach. Furthermore, intrinsic parameter values like the temperature as well as the pH value which are not part of mass balance conditions can be straightforwardly included in the model. Moreover, the use of non-parametric methods also provides a fast and straightforward retraining of the model if more experimental data become available. The straightforward and fast calculation procedures in terms of full automatization and thus in-line process control can be seen as the largest benefits when compared to other parametric or semi-parametric models. With regard to the recent discussions about the importance of integrated process models, digital twins as well as holistic process models [1], it also has to be noted that RNN approaches can be implemented straightforwardly in any software platform. In summary, the presented RNN models are highly flexible, straightforward to train and they can be used for distinct platform projects in upstream as well as downstream development and manufacturing.

Conflict of interest

The authors declare no conflict of interest.

53 in total

Review 1. Computational fluid dynamics for improved bioreactor design and 3D culture.

Authors: Dietmar W Hutmacher; Harmeet Singh
Journal: Trends Biotechnol Date: 2008-02-07 Impact factor: 19.536

Review 2. Machine-learning approaches in drug discovery: methods and applications.

Authors: Antonio Lavecchia
Journal: Drug Discov Today Date: 2014-11-04 Impact factor: 7.851

Review 3. Cell culture media for recombinant protein expression in Chinese hamster ovary (CHO) cells: History, key components, and optimization strategies.

Authors: Frank V Ritacco; Yongqi Wu; Anurag Khetan
Journal: Biotechnol Prog Date: 2018-10-05

Review 4. Dynamic metabolic flux analysis--tools for probing transient states of metabolic networks.

Authors: Maciek R Antoniewicz
Journal: Curr Opin Biotechnol Date: 2013-04-20 Impact factor: 9.740

Review 5. Global view of human protein glycosylation pathways and functions.

Authors: Katrine T Schjoldager; Yoshiki Narimatsu; Hiren J Joshi; Henrik Clausen
Journal: Nat Rev Mol Cell Biol Date: 2020-10-21 Impact factor: 94.444

6. Straightforward method for calibration of mechanistic cation exchange chromatography models for industrial applications.

Authors: David Saleh; Gang Wang; Benedict Müller; Federico Rischawy; Simon Kluters; Joey Studts; Jürgen Hubbuch
Journal: Biotechnol Prog Date: 2020-03-04

Review 7. Impact of CHO Metabolism on Cell Growth and Protein Production: An Overview of Toxic and Inhibiting Metabolites and Nutrients.

Authors: Sara Pereira; Helene Faustrup Kildegaard; Mikael Rørdam Andersen
Journal: Biotechnol J Date: 2018-02-19 Impact factor: 4.677

8. Application of metabolic modeling for targeted optimization of high seeding density processes.

Authors: Matthias Brunner; Klara Kolb; Alena Keitel; Fabian Stiefel; Thomas Wucherpfennig; Jan Bechmann; Andreas Unsoeld; Jochen Schaub
Journal: Biotechnol Bioeng Date: 2021-03-01 Impact factor: 4.530

9. The shortcomings of accurate rate estimations in cultivation processes and a solution for precise and robust process modeling.

Authors: B Bayer; B Sissolak; M Duerkop; M von Stosch; G Striedner
Journal: Bioprocess Biosyst Eng Date: 2019-09-20 Impact factor: 3.210