Literature DB >> 28957375

Optimization to the Phellinus experimental environment based on classification forecasting method.

Zhongwei Li1, Yuezhen Xin1, Xuerong Cui1, Xin Liu1, Leiquan Wang1, Weishan Zhang1, Qinghua Lu1, Hu Zhu2.   

Abstract

Phellinus is a kind of fungus and known as one of the elemental components in drugs to avoid cancer. With the purpose of finding optimized culture conditions for Phellinus production in the lab, plenty of experiments focusing on single factor were operated and large scale of experimental data was generated. In previous work, we used regression analysis and GA Gene-set based Genetic Algorithm (GA) to predict the production, but the data we used depended on experimental experience and only little part of the data was used. In this work we use the values of parameters involved in culture conditions, including inoculum size, PH value, initial liquid volume, temperature, seed age, fermentation time and rotation speed, to establish a high yield and a low yield classification model. Subsequently, a prediction model of BP neural network is established for high yield data set. GA is used to find the best culture conditions. The forecast accuracy rate more than 90% and the yield we got have a slight increase than the real yield.

Entities:  

Mesh:

Year:  2017        PMID: 28957375      PMCID: PMC5619749          DOI: 10.1371/journal.pone.0185444

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


1 Introduction

Phellinus is a kind of fungus having great medicinal value, since it is known as one of the elemental components in drugs avoiding cancers [1, 2]. Phellinus flavonoids is one of the most popular parasitifer of Phellinus in nature [3]. The research on Phellinus focuses on polysaccharides, proteoglycans medicinal mechanism, composition, etc., which are mostly extracted from the fruiting bodies of Phellinus flavonoids [4]. Phellinus rarely exists in the wild environment [5]. Cultivating Phellinus in the lab becomes a promising research branch. With mycelial growth by liquid fermentation, the fermentation broth flavonoids, polysaccharides, alkaloids and other active substances can be produced. These products have high level physical activity, short fermentation period and mass productions, thus providing a possible way of producing Phellinus in the lab [6]. In recent years, updated machine learning approaches [7, 8] have been developed and applied in biological data processing. From the understanding of the wild conditions of Phellinus, it is found that PH value, temperature and fermentation time have an effect on the productions. As well, in general bio-chemical experiments, we need to consider the inoculum size, initial liquid volume, seed age and rotation speed [9, 10]. In the laboratory, plenty of experiments have been designed and operated for maximizing the Phellinus production. Artificial algorithms and models have been used in the bio-process, particularly for the optimization of culture conditions. In [11], artificial neural networks (ANN) is used to optimize the extraction process of azalea Flavonoids. Neural networks combined with evolutionary algorithms have been used to optimize the experimental environment. For example, neural network and particle swarm optimization method is used for finding optimized culture conditions to maximize the Production of Pleuromutilin from Pleurotus Mutilus in [12]. The concept of classification is to learn a classification function on the basis of existing data or to construct a classification model (that is, what we usually call classifier). The function or model can map data records in the database to a given category. It can be applied to data prediction [13, 14]. Recently, many significant artificial intelligent algorithms and data processing strategies has been applied on data mining, such as a self-adaptive artificial bee colony algorithm based on global best for global optimization [15], the public auditing protocol with novel dynamic structure for cloud data [16], privacy-preserving smart semantic search method for conceptual graphs over encrypted outsourced data [17], a privacy-preserving and copy-deterrence content for image data processing with retrieval scheme in cloud computing [18] and machine learning method have been applied for experimental condition design, see. e.g. a secure and dynamic multi-keyword ranked search scheme over encrypted cloud data [19]. Genetic Algorithm (GA) derives from the computer simulation study of biological system [20], which has been widely used function optimization, combinatorial optimization, job shop scheduling problems [21], complex network clustering, pattern mining [22-24]. However, there are still some disadvantages, the most obvious disadvantages are the low efficiency and easy to fall into local optimum [25, 26]. In our previous paper in [27], we use the data collected during these experiments and take the statistical methods to establish a mathematical model in order to forecast the Flavonoid yield. Flavonoid yield is the most important product of Phellinus. With the purpose of finding the best Phellinus culture environment, the mathematical model was used as the fitness function for the GA and the result was developed. The result we got shows closely correspondence to the conclusion given by biologist. But during this process, the data we chosen to establish the mathematical model mainly rely on the prior knowledge of biologists. So we only use a little part of the whole data set. So we miss some information. Besides, the method does not work well in some areas where a priori knowledge lacked. In addition, the regression or BP neural network model established on all data sets can not get a accurate result. Therefore, in this paper, we use the classification algorithm for the whole sample set and achieve a good classification accuracy. On the basis of the high yield data set, the BP neural network and GA are used to optimize the yield. Finally, we find a better result than our previous work and the real data. This method can be used more extensively in biological experiments.

2 Data collected and data classification

2.1 Data collected

In this section, biological experiments are performed for finding optimal value of certain single factor. In Table 1, experiments are operated for collecting data. In rows 1-14, it is associated with experiments with PH values ranging from 1 to 14, where the temperature is fixed to 28°C, Initial volume is set to be 100ml, the Rotation speed is 140r/m and seed age is 8 days. Rows 15 to 20 are 6 experiments with Initial volume ranges from 40ml to 140ml, where PH value is set to be 6, the best one obtained from experiments with PH values ranging from 1 to 14.
Table 1

Experiments with PH values ranging from 1 to 14 and initial volume ranges from 40ml to 140ml.

PHTempInitial volumeRotation speedIncluding inoculumseed ageFermentation timePhellinus yield (μg/ml)class
128°C100ml1405%8845.9290
228°C100ml1405%8835.0770
328°C100ml1405%8845.6540
428°C100ml1405%88534.390
528°C100ml1405%88702.810
628°C100ml1405%881467.71
728°C100ml1405%88189.200
828°C100ml1405%8891.0490
928°C100ml1405%8860.8410
1028°C100ml1405%8857.2250
1128°C100ml1405%8843.2380
1228°C100ml1405%8836.2880
1328°C100ml1405%8820.9430
1428°C100ml1405%8822.3060
628°C40ml1405%88508.4950
628°C60ml1405%88900.6620
628°C80ml1405%881273.5941
628°C100ml1405%881153.9370
628°C120ml1405%881123.3300
628°C140ml1405%881088.0640
In Table 2, experiments with Including inoculum ranging from 2% to 16% and Temperature ranging from 25°C to 40°C are performed. That the situations on experiments with Fermentation time ranging from 1 to 12 hours are shown in Table 3. From the total 45 experiments, we collect data of culture conditions for production of Phellinus. Different culture conditions have a fundamental influence on the production of Phellinus. However, the optimized culture conditions remain unknown.
Table 2

Experiments with including inoculum ranging from 2% to 16% and temperature ranging from 25°C to 40°C.

PHTempInitial volumeRotation speedIncluding inoculumseed ageFermentation timePhellinus yield (μg/ml)class
628°C100ml1402%88546.6090
628°C100ml1404%88606.3450
628°C100ml1406%881320.7941
628°C100ml1408%881447.5191
628°C100ml14010%881841.7291
628°C100ml14012%881631.9901
628°C100ml14014%88481.11720
628°C100ml14016%88449.51870
625°C40ml14010%881145.6690
630°C60ml14010%881506.0551
635°C80ml14010%881374.9821
640°C100ml14010%88875.3410
Table 3

Experiments with fermentation time ranging from 1 to 12 hours.

PHTempInitial volumeRotation speedIncluding inoculumseed ageFermentation timePhellinus yield (μg/ml)class
628°C100ml1502%8156.6060
628°C100ml1504%8283.4350
628°C100ml1506%83303.9840
628°C100ml1508%84449.9190
628°C100ml15010%85777.3310
628°C100ml15012%861103.9870
628°C100ml15014%871619.5541
628°C100ml15016%881597.9951
628°C100ml15010%891546.3361
628°C100ml15010%8101502.4871
628°C100ml15010%8111489.3641
628°C100ml15010%8121465.6641

2.2 Data classification

In this section, we consider to divide the data set into high yield data set and low yield data set two parts. In our previous work, we found that the data collected from biological experiment has similarity and the gradient is limited. The conventional prediction method is difficult to achieve good results in the whole data set. So we use the method of classification, only focus on some important data, and increase the sample difference in the classified data set. There are two factors that must be considered. The fist one, we need to keep the balance between two data sets [28]. Larger imbalances can lead to more deviations in our classifiers. For example, we have one set of high yield data and 99 sets of low yield data, it is clear that the prediction of low yield data can reach 99% without learning, but the classifiers may not reach 99%. This is the imbalance caused by the data. Even the accuracy of the model is high, the model is certainly not good in the prediction of high yield data and not the model we want. If we use this model, our classifier can not find the high yield factors and provide a training data set for BP neural network to establish a prediction model. The second one, the high yield data set and low yield data set must cover all single factor experimental conditions. Now we have two classification strategies. The first one, we take the median of flavonoid production as the classification boundary (in our experiment is 1100μg/ml) and we have the same number of high-yield collections and low-yield collections. We have done a number of experiments to prove that the classification effect is acceptable. We can see the classification results in Table 4. But we realized that this classification method will lead to a single factor test of a class completely classified as high yield or low production set. In our experiment, all data belong to the seed age factor will be divided into high yield data set. Seed age for our classifier is no longer a decision-making factor which will lead to a large prediction error. We can see it in Table 5.
Table 4

1100μg/ml boundary classification accuracy (logical regression).

Type01The correct percentage
020676.9
131188
total82.4
Table 5

Experiments with seed age ranging from 4 to 10 hours.

PHTempInitial volumeRotation speedIncluding inoculumseed ageFermentation timePhellinus yield (μg/ml)class
628°C100ml1502%411272.3840
628°C100ml1504%521453.2311
628°C100ml1506%631428.0251
628°C100ml1508%741477.2731
628°C100ml15010%852164.5131
628°C100ml15012%962127.7261
628°C100ml15014%1071741.4981
Another strategy is to select a boundary in each set of univariate experimental data to keep the data for each single factor experiment in two different classes, while keeping the number of elements in the two categories as close as possible. In combination with the above conditions, we chose the flavonoid yield equal to 1273 μg/ml as our boundary condition. Under this boundary condition, we obtain 20 sets of high yield data and 30 sets low yield data, which include the conditions of each group of single factor experiments. We can see the classification results in Table 6.
Table 6

1273μg/ml boundary classification accuracy (logical regression).

Type01The correct percentage
0211067.7
141680
total72.5

3 Methods

Our experiment is mainly composed of three parts. The first part, the high-yielding data set is determined by the classification model, and then BP neural network is used to forecast. Finally, the parameters of BP neural network and the threshold are used as fitness function to find the optimal yield with GA.

3.1 Classification model

From the above boundary we determine the high yield and low yield of two data sets, the high yield is set to be 1 and the low yield is set to be 0. We use two classifiers to identify the classification effect, logical regression and BP neural network classifier. we use the SMOTE algorithm to improve the data set [29]. The idea of the SMOTE algorithm is to synthesize new samples of minority class (the high yield class). The synthetic strategy is to choose A’s nearest neighbor B for each sample of minority class, and then random select a new sample as a minority class sample between A and B [30]. This hybrid computational method, which combines with SVM and AGA, has the intelligent learning ability and can overcome the limitation of large-scale biotic experiments [31-36]. (1) for each sample X in a minority classes, the distance of all samples is computed from the Euclidean distance as the criterion, and the k nearest neighbor is obtained. (2) according to the sample imbalance ratio, a sampling ratio is set to determine the sampling rate N. For each minority class sample x, several samples are selected randomly from their K neighbors, assuming that the nearest neighbor is xn. (3) for each randomly selected neighbor xn, a new sample is constructed according to the following formula xm = x + rand(0,1) * (xn − x). The xm is the new sample. Compared with other data expansion methods, SMOTE algorithm generates new data instead of directly copying minority class samples. This can increase sample differences within class. We know that biological experiments set up certain experimental gradients to carry out a set of experiments. And the variation of adjacent experimental gradient data is usually linear. For example, if the PH value is 5, and corresponding yield is 300, the PH is 6, and corresponding yield is 1000, the PH is 7, and corresponding yield is 500. We usually think that when PH is 5.5, the yield is between 300 and 1000. If we set the classification boundaries yield is 300, then PH is 5.5 and can be divided into a few samples. In this way, we increase the sensitivity of the classifier to some experimental conditions and improve the accuracy of classification. We don’t use these new generated samples for production forecasting because we are not sure of their exact yields. In each of our experiments, each experiment gradient was set as a unit to compare the distance between each experiment. Since the number of samples we divide into two categories is different, there is no doubt that classification results are better for most sets. In addition, the overall number of samples is small and the classification effect fluctuates greatly. SMOTE algorithm is used to increase the sample size of the minority class, which is more balanced in the overall distribution of the data, while increasing the number of samples as a whole, reducing volatility. We can see that the classification effect has been improved by SMOTE algorithm in Tables 7 and 8.
Table 7

1273μg/ml boundary classification accuracy after SMOTE (logical regression).

Type01The correct percentage
0211067.7
132790
total79.7
Table 8

Comparison of the effects of SMOTE algorithm processing and data processing without SMOTE algorithm.

Typewithout SMOTEwith SMOTE
logical regression72.579.7
BP8087
The correct percentage = z; The predicted yield = y; The active yield = x; z = |(y−x)/x|; In this section, we establish a reliable classification model that can classify high yield and low yield data and then predict the yield in the next step if the experimental conditions belong to high yield data set.

3.2 BP neural network

BP (Back Propagation) neural network was developed by Rumelhart and McClelland in 1986. BP is a multi-layer feed forward neural network trained by error back propagation algorithm and it is the most widely used neural network [37]. The basic BP algorithm includes the forward propagation of the signal and the reverse propagation of the error. We calculate the error output from the input to the output direction, and adjust the weight and threshold from the output to the input direction. After training, the trained neural network that can be similar to the sample input information, the minimum output error is used to deal with the non-linear conversion of information [38, 39]. Each time we randomly selected 16 sets of data as a training set, the establishment of a experimental conditions and output corresponding to the forecast model. 4 sets of data as a test set, used to verify the reliability of modeling. Repeat seven experiments. We can see the result in Table 9. After repeated tests, the number of intermediate layer nodes is determine to be 9. Each hidden layer transfer function is set to be “tansig”, “logsig”, “tansig”. The training function is set to be “trainlm”. Each time 15 sets of data are selected for modeling. Five sets of data are selected to verify. Times of training is set to be 1000, training convergence error is set to be 0.00001. The results of repeat seven experiments as follows. The average error is 133.53, the percentage of error is 8.7%. The error value is shown in Fig 1 and percentage of error is shown in Fig 2. We can judge that our model has achieved a good result.
Table 9

Experimental results.

TypeActual yieldForecast yielderrorPercentage of error
11447.5191731587.9140.38082729.7%
21374.9825921273.6101.3825927.3%
31502.4871632129.5138.62%
41453.2305691274.9178.330568812.27%
51506.055691453.089652.96608963.52%
61489.3641420.73468.634.61%
72127.7257932103.792823.93299281.12%
81453.2305691423.268829.96176882.06%
91467.7905411321.5146.29054089.97%
101273.5949911320.847.20500883.71%
111447.5191731360.886.71917285.99%
121841.7293581380.6461.129358425.04%
131374.9825921592.9217.91740815.85%
141619.5541473.6145.9549.01%
151597.9951586.411.5950.73%
161502.4871394.3108.1877.20%
171506.055691454.851.25568963.40%
181465.6641278.7186.96412.76%
191477.2734821376.9100.37348166.79%
201631.9903821368.2263.790382416.16%
211447.5191731300.50147.019172810.16%
221597.9951560.9037.0952.32%
231320.7949941317.003.79499360.29%
241453.2305691699.80246.569431216.97%
251841.7293581571.40270.329358414.86%
261489.3641315.70173.66411.66%
271320.7949941274.0046.79499363.54%
281546.3361285.30261.03616.88%
Fig 1

The difference between the real value and the predicted value.

Fig 2

Percentage of error.

The Forecast yield is the yield calculated by the BP neural network under the same experimental conditions. The actual yield = x; The Forecast yield = y; error = z z = |x−y| The percentage of error = z/x In this section, we build a prediction model for high yield data sets and verify its reliability.

3.3 GA process

In this part we use the established model and GA to optimize the yield. Genetic algorithm is a kind of randomized search method which is based on the evolution of biological circles [40]. It was first proposed by Professor J. Holland of the United States in 1975 [41]. Its main feature is that it directly operates on structural objects without the existence of derivative and function continuity; with inherent implicit parallelism and better global optimization. GA use probabilistic optimization method, it can automatically obtain and guide the optimization of the search space [42]. These properties of genetic algorithms have been widely used in the fields of combinatorial optimization, machine learning, signal processing, adaptive control and artificial life. It is the modern key technology in intelligent computing [43]. The GA process is in Fig 3.
Fig 3

GA process.

The parameters for setting the GA algorithm are as follows: population size is set to be 300, chromosome size is set to be 6, generation size is set to be 1000, cross rate is set to be 1, mutate rate is set to be 0.01. The mutation rate and cross rate affect the number of iterations and iterations of the GA process. Because the number of iterations we set is much more than the actual number of iterations required. So after many tests, the mutation rate is set to be minimum value and cross rate is set to be maximum value. This is the ideal condition of the genetic algorithm. The encoding mechanism is real-number encoding. The hidden threshold of BP neural network is extracted as the fitness function of GA algorithm. After about 30 to 500 iterations the GA process returns the best individual. The training process is in Fig 4. Repeat the test seven times and result as follow in Table 10. We can see that the yield we got have a slight increase than the real yield.
Fig 4

GA result after training.

Table 10

Optimal conditions and yield obtained by simulation.

PHTempInitial volumeRotation speedIncluding inoculumseed ageFermentation timePhellinus yield (μg/ml)Iterations
629°C100ml15012%782164.839
628°C90ml15012%8112204.131
630°C90ml15012%7122121.6208
630°C90ml1419%882045.2430
628°C90ml15012%8112204.152
629°C100ml15012%9112207.644
629°C100ml15012%882171.856
In this section, we use the weight threshold of BP neural network as the optimization object, and use the GA algorithm to find the optimal experimental conditions.

4 Conclusion

In this work, we firstly classify the collected data sets and establish a classification model. Classification accuracy rate can reach more than 80%. We use our selected high-yielding data set for modeling. Forecast accuracy rate more than 90%. Finally, the weight threshold of BP neural network is used as the fitness function of GA to optimize the yield. So we have established a set of mulberry flavonoids production forecast and optimization process. When the biologist give us a new set of experimental conditions, we first use the classification model to verify whether these conditions are high-yield conditions. If these conditions are high-yield conditions, we use the established BP neural network to predict the yield. In the comparison results, it is believed that PH value is credible 6 and the temperature is also within the appropriate temperature range 28°C to 30°C. Taking into account environmental factors in the laboratory, the initial volume, rotation speed and including inoculum we predicted are also reliable. The seed age is 7 or 8 closing to the original data 8. The fermentation time predicted rang from 8 to 11 more than the original data 8. However, iit can be explained in terms of biological experiments. When the fermentation time reaches a certain limit after the mulberry community to reach the limit, this time the output depends mainly on the supply of nutrients, so the data we get is acceptable. The average Phellinus yield we predicted is 2159.9μg/ml more than the original data 2127μg/ml. Data experimental results show that predicted optimal values of the parameters have accordance with biological experimental results, which indicate that our method has a good predictability for culture conditions optimization. For further research, neural-like computing models, e.g., spiking neural P systems [44] can be used for optimization of Welan gum production. As well, some recently developed data processing and mining methods, such as the speculative approach to spatial-temporal efficiency for multi-objective optimization in cloud data and computing [45], privacy-preserving smart similarity search methods in simhash over encrypted data in cloud computing [45], k-degree anonymity with vertex and edge modification algorithm [46], kernel quaternion principal component analysis for object recognition [47], might be used for Optimization to the Phellinus Experimental Environment. In the aspect of data preparation, decision tree [48] can be used to deal with the missing attribute value of some samples in dataset.
  13 in total

1.  On the Universality and Non-Universality of Spiking Neural P Systems With Rules on Synapses.

Authors:  Tao Song; Jinbang Xu; Linqiang Pan
Journal:  IEEE Trans Nanobioscience       Date:  2015-11-25       Impact factor: 2.935

2.  Learning from imbalanced data in surveillance of nosocomial infection.

Authors:  Gilles Cohen; Mélanie Hilario; Hugo Sax; Stéphane Hugonnet; Antoine Geissbuhler
Journal:  Artif Intell Med       Date:  2005-10-17       Impact factor: 5.326

Review 3.  Machine-learning approaches in drug discovery: methods and applications.

Authors:  Antonio Lavecchia
Journal:  Drug Discov Today       Date:  2014-11-04       Impact factor: 7.851

4.  Spiking Neural P Systems With White Hole Neurons.

Authors:  Tao Song; Faming Gong; Xiyu Liu; Yuzhen Zhao; Xingyi Zhang
Journal:  IEEE Trans Nanobioscience       Date:  2016-10       Impact factor: 2.935

5.  A Mixed Representation-Based Multiobjective Evolutionary Algorithm for Overlapping Community Detection.

Authors:  Lei Zhang; Hebin Pan; Yansen Su; Xingyi Zhang; Yunyun Niu
Journal:  IEEE Trans Cybern       Date:  2017-06-13       Impact factor: 11.448

6.  Segment Based Decision Tree Induction With Continuous Valued Attributes.

Authors:  Ran Wang; Sam Kwong; Xi-Zhao Wang; Qingshan Jiang
Journal:  IEEE Trans Cybern       Date:  2014-09-29       Impact factor: 11.448

7.  Phellinus linteus activates different pathways to induce apoptosis in prostate cancer cells.

Authors:  T Zhu; J Guo; L Collins; J Kelly; Z J Xiao; S-H Kim; C-Y Chen
Journal:  Br J Cancer       Date:  2007-01-30       Impact factor: 7.640

8.  A Computational Method for Optimizing Experimental Environments for Phellinus igniarius via Genetic Algorithm and BP Neural Network.

Authors:  Zhongwei Li; Beibei Sun; Yuezhen Xin; Xun Wang; Hu Zhu
Journal:  Biomed Res Int       Date:  2016-08-09       Impact factor: 3.411

9.  On the Computational Power of Spiking Neural P Systems with Self-Organization.

Authors:  Xun Wang; Tao Song; Faming Gong; Pan Zheng
Journal:  Sci Rep       Date:  2016-06-10       Impact factor: 4.379

10.  Optimization to the Culture Conditions for Phellinus Production with Regression Analysis and Gene-Set Based Genetic Algorithm.

Authors:  Zhongwei Li; Yuezhen Xin; Xun Wang; Beibei Sun; Shengyu Xia; Hui Li; Hu Zhu
Journal:  Biomed Res Int       Date:  2016-08-16       Impact factor: 3.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.