Literature DB >> 24288577

Application of genetic algorithm for discovery of core effective formulae in TCM clinical data.

Ming Yang1, Josiah Poon, Shaomo Wang, Lijing Jiao, Simon Poon, Lizhi Cui, Peiqi Chen, Daniel Man-Yuen Sze, Ling Xu.   

Abstract

Research on core and effective formulae (CEF) does not only summarize traditional Chinese medicine (TCM) treatment experience, it also helps to reveal the underlying knowledge in the formulation of a TCM prescription. In this paper, CEF discovery from tumor clinical data is discussed. The concepts of confidence, support, and effectiveness of the CEF are defined. Genetic algorithm (GA) is applied to find the CEF from a lung cancer dataset with 595 records from 161 patients. The results had 9 CEF with positive fitness values with 15 distinct herbs. The CEF have all had relative high average confidence and support. A herb-herb network was constructed and it shows that all the herbs in CEF are core herbs. The dataset was divided into CEF group and non-CEF group. The effective proportions of former group are significantly greater than those of latter group. A Synergy index (SI) was defined to evaluate the interaction between two herbs. There were 4 pairs of herbs with high SI values to indicate the synergy between the herbs. All the results agreed with the TCM theory, which demonstrates the feasibility of our approach.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 24288577      PMCID: PMC3830796          DOI: 10.1155/2013/971272

Source DB:  PubMed          Journal:  Comput Math Methods Med        ISSN: 1748-670X            Impact factor:   2.238


1. Introduction

Traditional Chinese medicine (TCM) has been developed and practiced in China for thousands of years, and herbal prescription has played a key role in the medical treatment. A Large number of herbal prescriptions have been recorded over the years where valuable TCM knowledge is hidden. It is urgent and critical to analyze these data so that TCM models can be developed in the modernization of this ancient knowledge. Although TCM is still in practice and more countries consider it as an alternative treatment method [1], the principle of formulating TCM prescription remains unknown. However, it is a daunting task to analyze such a large dataset manually. The methods of knowledge discovery in database (KDD) have been suggested as viable approaches. KDD allows TCM researchers to find interesting patterns efficiently, and they may direct further laboratory work that leads to discovery [2]. Many successful projects have been reported. For example, Wang et al. [3] illustrated the use of structure equation modeling (SEM) to explore the diagnosis of the suboptimal health status (SHS) and provided evidence for the standardization of TCM patterns. Multilabel learning model [4, 5] was introduced for TCM syndrome identification. Complex network was built for the clinical data mining in TCM [6-8]. Generally, KDD research in TCM has been divided into two main categories. The first one attempts to extend our understanding using existing TCM knowledge, while another one attempts to identify core knowledge from existing TCM data, so that each piece of extracted knowledge can be further validated using scientific evidence. This paper belongs to the latter one and, in particular, pays attention to the study on TCM formulae from clinical data. The efficiency of a formula can be interpreted as a collaboration of its member herbs. It is common to find that most of the prescriptions are of some relatively smaller fixed composition(s) that can be called core formula (CF). Adding herbs into and/or subtracting herbs from CFs are usually carried out in order to realize the personalized treatment. For example, although there are 113 prescriptions in one of the greatest TCM classics, named “Shang Han Lun”, only 8 CFs exist, such as Gui zhi Tang that forms the basis of the formation of Guizhi Jia Gui Tang, Guizhi Xinjia Tang, Gegeng Tang, and Dang gui Si ni Tang [9]. Research on CFs does not only summarize traditional Chinese medicine (TCM) treatment experience, it also helps to reveal the underlying knowledge in the formulation of a TCM prescription. Several computational models were proposed in the past decade to mine the TCM formulae, such as factor analysis [10], the information theory based association rule algorithm [11, 12] or clustering method [13], machine learning models [14], latent tree (LT) models [15], and network analysis [16-20]. These methods can reveal the core herbs and herb-collaboration patterns in TCM prescriptions or uncover the relationship between the herb and symptom, but they seldom concern the related clinical effect. In clinical activities, a number of herbs are combined to form a formula and different formulae are prescribed to different patients, but not all the formulae are effective. It is vital to determine whether a herb combination is effective or not in order to arrive at the valuable formulae. Those core and effective formulae (CEF) are of great interest to TCM practitioners as well as pharmaceutical companies that manufacture medicine using Chinese herbs. Integrated tumor treatment using Chinese and western medicine is getting standardized in China and has become an important method of prevention and treatment. Many clinical studies [21, 22] considered that TCM is effective and potentially meets the demands of treatment with multitarget therapeutics. Although the current evaluation approach of cancer treatment is still using tumor response and survival as the main indices, TCM concerns the patient as a whole rather than just the tumor; it means that the overall effect should be evaluated instead. Many researchers suggested the use of quality of life (QOL) as a proxy to evaluate the efficacy of TCM treatment [23-25]. To be more specific, it considers the treatment efficacy via the reduction in symptoms severity [26]. For that reason, those herbs combination patterns that are effective in improving symptoms significantly can be regarded as core and effective formulae in TCM tumor clinics. Therefore, a major goal of this paper is to discuss approaches and strategies for the discovery of core and effective formulae (CEF) in tumor clinical data. Genetic algorithm (GA) was applied, which is a search heuristic that mimics the process of natural evolution [27]. GA generates solutions to optimization problems using techniques inspired by biological evolution, such as inheritance, mutation, selection, and crossover. This is similar to the process of TCM development: prescriptions were created in different herbal combinations for various symptoms, and only the effective prescriptions would make their way into text and records. This is followed by practitioners, who used these effective prescriptions and adapt and create more effective prescriptions. Our previous work [28-30] has proven that, given proper fitness function and search space, GA is suitable for the complex combinatorial optimization in TCM. This paper is organized as follows. The Materials and Dataset section contains the process of the data acquisition and a description of the data. The Methods section is the methodological part of this paper. It contains definitions related to the assessment of CEF and a description of the genetic algorithm including the definition of fitness function. Complex network is presented in order to address the core herbs analysis in combination of prescriptions, and the analysis of herb-herb interactions is performed in the Results section.

2. Materials and Dataset

2.1. Data Source

The dataset used in this paper came from the inpatient lung cancer (LC) records of Shanghai Longhua Hospital of TCM. 161 patients with different stages (both early and metastatic stages) of LC only receiving TCM therapy were included. Their prescriptions and symptoms were recorded during February 2010 to August 2012. Traditional Chinese medical herbs were taken as decoction, and fifteen LC symptoms were recorded and they are cough, expectoration, short of breath, chest tightness, chest pain, fatigue, loss of appetite, bloody sputum, dry mouth and throat, fever, spontaneous and night sweating, insomnia, diarrhea, nocturia, five upset hot. A 4-point response scale (0: not at all, 1: a little, 2: quite a bit, 3: very much) was used to indicate the severity of the symptoms. Since the efficacy of a prescription can only be made known when the patient is met again in the next consultation, hence, to evaluate the efficacy of a prescription and to find the TCM treatment principles, only patients with multiple records (visits) were chosen.

2.2. Data Preprocessing and Description

There were 595 transaction records for the 161 patients, which range from 1 to 9 visits, and the average number of transaction records per patient is near 4. Each record has its information of symptoms and the corresponding prescription. The interval of time between two visits was one or two weeks, during which the patient took the same prescription. After excluding those patients who had only one visit, 586 transaction records for the 152 patients were left behind which had a total 230 different herbs being used. In the next stage, the symptom score in each record was calculated as follows: where Score represents the score for symptom i, and m represents the number of symptoms. Symptom change value (SCV) was calculated as the following formula: An illustrative example for data format and SCV calculation is shown in Figure 1 where there are 10 transaction records for 4 patients, and the visit range is 1 to 4. The first patient is excluded because of his single visit. Since the symptom score for evaluating the prescription is recorded at the next (following) visit, SCV1 for evaluating prescription “P2” is calculated by “SS2” and “SS3”. In the context of SCV, prescription which belongs to neither single visit nor last visit has its corresponding SCV.
Figure 1

An illustrative example for data format and SCV calculation.

After removing missing values, 419 SCVs for the 150 patients were obtained. According to the TCM theory, the criterion to be effective requires the SCV to be greater than or equal to 30% [60]; in other words, it is a positive outcome and the value is set as 1; otherwise, the outcome is marked as 0. At the end of this step, 93 out of the 419 records have positive outcome, making it an imbalanced dataset with 22.2% being effective. The statistic information for the number of herbs is shown in Table 1. The top 50 frequently used herbs based on records and patients are shown in Table 2.
Table 1

Statistic information for number of herbs.

Per recordPer patientAverage number per patient per record
Minimum999
Average232922
Maximum367333
Table 2

Top 50 frequently used herbs.

RankRecord basedPatient based
HerbFrequencyHerbFrequency
1Chinese sage herb395Chinese sage herb147
2Doederlein's spikemoss herb393Doederlein's spikemoss herb146
3Akebia fruit359Akebia fruit133
4Herba oldenlandiae 332Herba oldenlandiae 129
5 Atractylis ovata 321 Atractylis ovata 127
6Rice-grain sprout268Astragalus root116
7Malt268 Pachyma cocos 107
8 Pachyma cocos 266Chicken gizzard membrane107
9Astragalus root263Rice-grain sprout106
10Chicken's gizzard-membrane252Malt106
11Common selfheal spike239Common selfheal spike104
12Rhizoma batatatis235Rhizoma batatatis97
13Coix seed223Coix seed96
14Tangerine peel195Coastal glehnia root86
15Coastal glehnia root195Oysters85
16Oysters183Tangerine peel80
17Rhizoma amorphophalli172Pericarpium trichosanthis79
18Pericarpium trichosanthis162Asparagus cochinchinensis72
19Asparagus cochinchinensis158Rhizoma amorphophalli68
20Edible tulip139Edible tulip64
21 Crataegus pinnatifida 133 Crataegus pinnatifida 62
22 Ophiopogon japonicus 122Tatarian aster root and rhizome58
23Chinese date120Shorthorned epimedium herb58
24 Glycyrrhiza uralensis 119Pilose asiabell root58
25Pilose asiabell root119 Ophiopogon Japonicus 57
26Tatarian aster root and rhizome112 Glycyrrhiza uralensis 55
27Chinese taxillus herb111Chinese taxillus herb53
28Shorthorned epimedium herb109Chinese date53
29 Pinellia tuber99 Pinellia tuber51
30Baikal skullcap root98Heartleaf houttuynia herb48
31Suberect spatholobus stem97Suberect spatholobus stem46
32Heartleaf houttuynia herb95Glossy privet fruit46
33Glossy privet fruit91Baikal skullcap root45
34Noble dendrobium stem herb89Balloon flower root43
35Chekiang fritillary bulb82Chekiang fritillary bulb42
36Paris polyphylla smith79 Eucommia bark40
37Almond78Paris polyphylla smith40
38 Eucommia bark76Barbary wolfberry fruit40
39Balloon flower root75Almond40
40Barbary wolfberry fruit73Noble Dendrobium stem herb39
41 Pyrrosia leaf65Cherokee rose fruit35
42Cherokee rose fruit63Chinese dodder seed32
43Reed rhizome57 Pyrrosia leaf31
44Toad skin57Reed rhizome30
45Radix semiaquilegia 55Toad skin30
46Fingered citron fruit55Common macrocarpa fruit29
47Chinese dodder seed54Immature bitter orange28
48Common macrocarpa fruit53Radix Semiaquilegia 25
49Dragon's bones51Radish seed24
50Immature bitter orange51Dragon's bones24

3. Method

The aim of this paper is to find the core and effective formula (CEF). The measure of effectiveness of a formula helps to determine the efficacy of the herbal interaction in TCM medicine, while the coreness of the prescriptions can help us summarize the TCM treatment principle. The identification of CEF comes from a high dimensional search space of symptoms and herbs; hence, the discovery of CEF can be described as a complicated combinatorial optimization problem. The analytic process of this paper can be described as follows: recognizing and defining the problem, constructing and solving a model for the problem, validating the obtained solutions. The following sections discuss the different process steps in detail.

3.1. Recognizing and Defining the Problem

Our problem focuses on how to choose a best combination of herbs. The typical data format is shown in Figure 2. A combinatorial optimization problem H = (Q, f) can be defined by
Figure 2

Data format for combinatorial optimization problem of discovery of CEF.

a set of herbs X = {Herb1, Herb2,…, Herb}; an outcome variable SCV = {y 1, y 2,…, y }; herb domains D 1,…, D , D ∈ {B}, where B indicates the set of binary values {0,1}; an objective function f to be maximized, where f : D 1 × D 2 × ⋯×D , and f ∝ y. The set of possible feasible combinations is where Q is a search space which contains all herbs in the data, as each combination of herbs can be seen as a candidate solution. To solve a combinatorial optimization problem of discovery of CEF, we have to find a solution q* ∈ Q with maximum objective function value, that is, f(q*) ≥ f(q) for all q ∈ Q. Before using such a formulation, we have to select the evaluation criterion for CEF. According to the meaning of CEF, the definitions of three heuristics are introduced here. (i) Average Confidence of a Core Effective Formula (CEF) Called α. Let the prescriptions in the dataset be x 1, x 2,…, x , the number of prescriptions is n and the average confidence of a given CEF is where number() is a counting function, ∩ is intersection operator and number(CEF∩x )/number(CEF) represents the percentage of the common herbs between CEF and x with respect to the number of herbs in CEF. The value of α is between 0 and 1 inclusive. The larger the value of α is, the more representative the given CEF is. When α is 1, it implies that every prescription carries all the herbs in the given CEF. (ii) Support under Confidences α and S . With different confidences, we define support as follows: The higher the value S , the higher the occurrence of CEF in the dataset. Let us say we have a CEF with the S 0.8 = 0.25; it means that there are 25% of the prescriptions in the dataset which are composed of at least 80% of the herbs from the given CEF. (iii) Effectiveness Value (EV). Effectiveness value (EV) is the difference of SCV between two groups. Let the prescription in dataset be x , i = 1,2,…, n and their effectiveness (outcome variable) y , i = 1,2,…, n. If x* is a CEF and ((number(x*∩x )/number(x*))/n ≥ α, it means that the confidence of x is greater than or equal to α CEF; when α is greater, it indicates that the proportion of herbs from x* being used in x is higher than the average confidence α. In other words, x is an application of x*. Let x 1* denote the group of all the x , namely, CEF group, otherwise, x 0*, namely, non-CEF group. The effectiveness value (EV) is defined as If y is continuous, then and represent the average effectiveness of the group of x 1* and x 0*, respectively. If the y is binary, then and represent the effective proportion of the group of x 1* and x 0*, respectively. std represents the joint standard deviation. The bigger the value EV is, the better the effectiveness is. Furthermore, we should consider the minimum number of herbs contained in x*.

3.2. Constructing and Solving a Model for the Problem

To construct and solve a model for combinatorial optimization is a difficult task: in general, we start with a realistic but possible solution, and then execute iterative optimization. As a computational model of evolutionary processes, GA not only has the ability to solve combinatorial optimization problems that are nonparametric, in contrast to most other algorithms that find one solution at a time, but also it has the strength to find multiple pareto optimum solutions in parallel at the same time. This is compatible with TCM treatment that multiple formulae are applicable to a set of symptoms, that is, it is an equifinality. The concept of equifinality refers to many alternative ways of attaining the same objective. Using the previous definitions in Section 3.1, the sequence of steps of GA for the application of the discovery of CEF is shown in Figure 3.
Figure 3

Flow chart of GA and explanations of the sequence of GA steps for the discovery of CEF.

Step 1 (encoding and initial population)

The herb combination to be optimized is represented by a chromosome whereby each herb is encoded in a binary string called gene according to the original herb space. Since there were 230 distinct herbs, the chromosome was made up of a string of 230 binary characters, with the value of “0” and “1” to describe a prescription. A population, which consisted of a given number of chromosomes, was initially created by randomly assigning “1” to all genes with probability P . The value “1” in a gene meant that the corresponding herb was used in this prescription. Otherwise, the herb was not used in this prescription.

Step 2 (the design of fitness function (objective function))

A crucial point in using GA is the design of fitness function, which determines what a GA should optimize. The goal of this study is to find CEF, which is a small subset of herbs that are frequently used and most significant for effectiveness. Fitness was measured by two criteria of CEF, one is coreness that is represented by α, S , and N (minimum number of herbs contained in CEF), and the other is effectiveness that is represented by EV. N is typically decided by TCM theory, while the determination of S depends on how representative and frequently used the required CEF are. An important characteristic of GA is the way it deals with infeasible solutions (unsatisfactory CEF). The offspring might be potentially infeasible when recombining solutions. The most general and simple way is to reject infeasible solutions. Therefore, penalizing infeasible solutions in the fitness function that measures the qualification of a solution is more appropriate, which was presented in our research. Hence, the fitness function, f, is defined as follows: where S and N set are the predefined values of the support under a certain confidence and the minimum number of the herb contained in x* (mentioned above). R is a penalty constant, which is used to penalize the infeasible solutions. Thus, the evaluation of fitness started with the randomly generating prescription that was composed of all the presence of herb whose gene was coded as “1.” Then the prescription's coreness and effectiveness were evaluated by S , N, and EV. Finally, the fitness was measured by the fitness function f. In the context of f, those prescriptions whose coreness meet the requirement with high EV will have the higher probability to survive.

Step 3 (design of the GA operator)

After evolving the fitness of the population, the chromosomes were selected by means of the tournament selection, which involved running several “tournaments” among a few chromosomes chosen at random from the population. The winner of each tournament (the one with the best fitness) was more likely selected. Then children chromosomes were created from parent chromosomes by multipoint crossover operator. After that, the chromosomes were mutated with a three-way swap of three randomly chosen genes in a permutation, which could lead to new chromosomes in the searching space. Sometimes, this may lead to new and better results. Mathematically, using crossing over is helpful to find a local optimal solution, and mutations can help to discover new and better optima.

Step 4 (terminal condition)

GA is an iterative search method, which will approach the optimized region but may not arrive at the optimized solution. So a terminal condition is needed. Here, we terminated GA process after a predefined number of generations. The chromosomes of the last generation with the highest value of f were considered to be the CEF candidates.

3.3. Validating the Obtained Solutions

After finding optimal or near-optimal solutions (prescriptions), we had to evaluate them. Based on the meaning of CEF, solutions were evaluated on both coreness and effectiveness. In this study, the measurements of confidence and support are proxy to the coreness property of a formula. The definitions α and S which denote the confidence and support of a solution were used to evaluate the coreness. Generally speaking, when S has to be calculated, α should have a minimum value of 0.7 according to the TCM expert practitioner, otherwise, it may undermine the representative property of CEF. The greater these two values are, the better the coreness of a solution is, that is also to say, the constituent herbs are more widely used in the prescriptions and the formula is more frequently used. As for the evaluation of effectiveness, the dataset can be divided into two groups, namely, the CEF group and non-CEF group. By the definition mentioned in Section 3.1, in CEF group, all the records' prescriptions carried more than preset α proportion of herbs in the specific solution, while the prescriptions in the non-CEF group did not. Then Z-test for the difference between two effective proportions (EP) was carried out at a 5% significance level (P < 0.05).

4. Results

4.1. GA Results

The results in the following discussion were averaged over three executions using the same parameters. To compare effect of changing the parameters on GA efficiency and results, we needed to fix all the parameters of fitness function. The fixed configuration used for fitness function is being described here: R (the penalty constant) was set to 200, N set was set to 8, α was set to 1, and its corresponding S 1 was set to 2%. First, we investigated the effect of population size. Here, generation was initially set to 100 and the other basic parameters of the GA followed the default. The population sizes were chosen from 100 to 1200 with the step size of 100. We have summarized the results in Figure 4. The figure only compares the average of best fitness values. Figure 4(a) shows that fitness values are acceptable when the population sizes are greater than 400, and fitness values are better (exceed 3.5) when the population sizes are greater than 900. While bigger population needs more time for the algorithm to run, in this study, we use a population size of 1000 in our experiments. Second, we fixed population size at the optimal value, then generation number was chosen to be 400. It can be seen from Figure 4(b) that generation number has positive effect on fitness value found. When generation number reaches 160, the fitness value is the best and remains unchanged. We, therefore, select generation number of 200 in this paper. Similarly, we compared the effect of the other parameters of GA, in turn. We consider initial herb selection probability (P ,  P = k*N set/230)  k ∈ {0.5,1, 1.5,2, 2.5,3, 3.5,4}, crossover probability P ∈ {0.5,0.6,0.7,0.8,0.9,1.0}, and tournament selection size T ∈ {5,10,15,20,25,30,35,40}. Figures 4(c)~4(e) show the results. We can see that fitness values are acceptable when k is smaller than 4. When k equals 4, the initial number of herb selection is close to the maximum number of herb in prescription, which leads to filtering out the most feasible solutions. So we set k to 1 in this paper. While P and T are insignificant with respect to fitness value, we set them to the default values. Table 3 lists the parameters of the GA for the experiments in this paper.
Figure 4

Parameter selection in GA.

Table 3

Parameters for GA.

ParameterValue
Population size1000
Initial herb selection probability (P i) N set/230
Crossover probability (P c)0.7
Tournament selection size (T s)15
Generation200
As for the parameters of fitness function, in order to get CEF with the highest confidence, α was set to 1. The other parameters were set to the following values: R (the penalty constant) is set to 200 and N set were given in the range of 8 to 11, while S 1 began from 2% and stepped up by 1% in each increment until it reached 10%. A heat map in Figure 5 shows the sensitivity of fitness values in relation to the different setup of S and N set. We can see that fitness values are mostly positive with values of N set being 8 or 9, while S 1 is in the range of 2~6%. After removing duplicates, there were 9 CEF which are solutions with positive fitness values. These 9 CEF are composed of 15 distinct herbs (Tables 4 and 5). We summarize the traditional indications and effects of cancer treatment for these herbs in Table 4. According to the clinical experiences and the literature reports, these herbs as well as their extracts or isolated compounds can exert their anticancer effects in several ways: (a) they can enhance immunity and body resistance; (b) they have antiproliferative activity in cancer cells; (c) they can improve quality of life and prolong the life span of the patients. Table 5 shows that the maximum number of herbs in a CEF is 11, which is much smaller than the average number of herbs in a transaction in the dataset. There are a few herbs existing that are common across 6 CEF, such as AF, AO, AR, DS, PC, PT, RB, and TP. According to TCM terms, these common herbs are related to nourishing Yin, regulating Qi, and strengthening the spleen function, which are generally consistent to the TCM principle in LC treatment.
Figure 5

Fitness value by GA with different N and S 1 combination.

Table 4

Traditional indications and biological effects of herbs.

HerbAbbreviationTraditional indications# Effects of cancer treatmentReference
Astragalus rootARTo reinforce Qi and invigorate the function of the spleen.Immune stimulating effect. Improving quality of life for patients with nonsmall cell lung cancer.[31, 32]
Akebia fruitAFTo regulate Qi, to promote blood circulation and relieve pain, and to cause diuresis.Popularly used for primary liver cancer treatment in China.[33]
Atractylis ovataAOTo invigorate the function of the spleen and replenish Qi and to eliminate dampness by causing diuresis.Antiangiogenic activity. Inhibiting the growth of B16 cancer cells. [3438]
Chinese dateCDTo tonify the spleen, replenish Qi and to nourish blood.Antiproliferative activity in human breast cancer cells.[39]
Chinese sage herbCSTo remove toxic heat and blood stasis and relieve pain.Antiangiogenic activity.[40]
Coix seedCSETo transform dampness and promote water metabolism, to strengthen the spleen, and to clear heat and eliminate pus.Affecting cellular pathways in neoplasia: to inhibit NFkappaB and protein kinase C signaling.[41]
Doederlein's spikemoss herbDSTo remove toxic heat and dampness and to promote blood circulation and remove blood stasis.Antiproliferative activity in three types of human cancer cells in vitro.[42]
Herba Oldenlandiae HOTo eliminate heat and toxic material, to promote blood circulation and remove blood stasis, and to clear dampness heat.Antiproliferative activity in eight cancer cell lines. Strengthening the patient's resistance. [4345]
MaltMATo invigorate the function of the spleen, to regulate the function of the stomach, and to promote the flow of milk.Proliferative function of colonic epithelial cells.[4649]
Pachyma cocos PCTo cause diuresis, to invigorate the spleen function, and to calm the mind.Inhibiting the growth of nonsmall cell lung cancer cells.[50, 51]
Pyrrosia leafPLTo induce diuresis, relieve dysuria, remove heat, and arrest bleeding.Its active components: isomangiferin has capability of inhibiting virus replication within cells, and fumaric acid has chemopreventive potential for tobacco-nitrosamine-induced lung tumors.[52, 53]
Pinellia tuberPTTo remove damp and phlegm, to relieve nausea and vomiting, and to eliminate stuffiness in the chest and the epigastrium.Antiproliferative activity in five cancer cell lines in vitro.[54, 55]
Rhizoma batatatisRBTo replenish the spleen and stomach, to promote fluid secretion, and to benefit the lung.Inhibiting the cancer cell line of melanoma B16 and Lewis lung cancer in mice in vivo.[56]
Rice-grain sproutRSTo promote digestion, invigorate the function of the spleen, and improve appetite.Popularly used for strengthening function of the spleen and the stomach during cancer treatment in China.[57]
Tangerine peelTPTo regulate the flow of Qi, to invigorate the spleen function, to eliminate damp, and to resolve phlegm.Antioxidative and anti-inflammatory functions. Antiproliferative activity in human gastric cancer cells.[58, 59]

#Information is queried from TCM-ID database (http://bidd.nus.edu.sg/group/TCMsite/).

Table 5

CEF obtained by GA.

No.Number of herbsComposition
ARAFAOCDCSCSEDSHOMAPCPLPTRBRSTP
110XXXXXXXXXX
29XXXXXXXXX
38XXXXXXXX
49XXXXXXXXX
58XXXXXXXX
610XXXXXXXXXX
79XXXXXXXXX
811XXXXXXXXXXX
910XXXXXXXXXX
689152744719948

4.2. Evaluation

4.2.1. Coreness

Herb-herb network was constructed using a cooccurrence frequency-based method. The degree value of one node (herb) was defined as the number of other nodes (herbs) that it connects to; it is a simple but an important property of any complex network. A node has a more significant role to play if it has a higher degree value. The importance of a herb was studied according to its degree value and frequency in the dataset. These values were sorted into descending order and shown in Table 6. Among the 230 herbs in the dataset, the 15 herbs that make up CEF are all ranked in the top 50 in terms of degree and frequency based on both records and patients. In other words, it is a good indication that these 15 herbs in CEF are core herbs.
Table 6

Core herb identification.

HerbDegreeDegree rankRecord basedPatient based
FrequencyFrequency rankFrequencyFrequency rank
DS225139321462
CS225239511471
AF223335931333
AO220432151275
HO219533241294
PC207626681077
AR198926391166
CSE19710223139613
RB19411235129712
RS1911226861069
MA19113268710610
TP18416195148016
CD15830120235328
PT1523499295129
PL1274765413143
Average confidence of the prescription (α) and support under the different α confidence (S ) were calculated in order to evaluate the coreness of CEF. In order to evaluate the correlation within individual, patient-based support (PBS) was also calculated for each CEF (Table 7). The values of α and S are all relatively high. In particular, the second CEF (CEF2) has its α value that exceeds 0.7, which means that the prescriptions in dataset are consistently composed of more than 70% herbs from the CEF2. The values under S 0.7 of both CEF8 and CEF9 exceed 0.5, which means that there are more than 50% of the prescriptions that are composed of 70% or more herbs from these two CEF. As for PBS, a CEF is not valuable for its small PBS when it is concentratedly used for the minority. Results show that all PBS are larger than the corresponding S 1 (record-based support), which indicates no concentrated use on patient level for CEF.
Table 7

Confidence and support of CEF.

No.α S 0.7 S 0.8 S 0.9 S 1 PBS
10.5490.3960.2480.0840.0210.040
20.7070.4990.2390.0410.0410.067
30.5980.3750.1980.0430.0430.073
40.6530.3560.1910.0500.0500.073
50.6750.4890.2940.0840.0840.120
60.6160.4610.2650.1220.0360.060
70.6700.4060.1960.0410.0410.067
80.6780.5010.3100.1670.0410.067
90.6370.5110.3320.1810.0410.067

4.2.2. Effectiveness

To test the effectiveness of CEF, the dataset was divided into two groups, namely, the CEF group and non-CEF group. In this study, α was set to 1. In other words, all the prescriptions in CEF group carried all the herbs of the specific CEF, while the prescriptions in the non-CEF group did not have a full set of herbs from CEF. The Z-test for the difference between two effective proportions (EP) was performed for each CEF. Table 8 shows that EP of all the CEF groups are significantly better than the non-CEF group.
Table 8

Z-test for the difference of EP for CEF.

No.EP of non-CEF group EP of CEF group P value
10.210 0.778 0.000
20.206 0.588 0.002
30.204 0.611 0.000
40.206 0.524 0.004
50.203 0.429 0.009
60.210 0.533 0.013
70.206 0.588 0.002
80.206 0.588 0.002
90.206 0.588 0.002
Sampling is a simple and well-known method for parameter studies and robustness evaluations [61]. To test the robustness of effectiveness in this study, leave one (patient) out analysis was performed. After removing one patient from the original data, effectiveness of CEF was remeasured for the remaining patients. This was repeated such that each patient in the data was removed once. EP and P value of Z-test were calculated for each CEF each time. Results are shown in Table 9. Then P value was transformed into −log⁡⁡(P), where −log⁡⁡(P) was larger than 1.301 indicating that P value was smaller than 0.05. It can be seen in Table 9 that there is little change in EP of both groups from the original to the perturbed and all −log⁡⁡(P) exceed 1.33, which shows good robustness for the effectiveness evaluation with a small perturbation in sample (patient level) space.
Table 9

Leave one (patient) out analysis to test the robustness of effectiveness (total 150 times).

No.EP of non-CEF groupEP of CEF group−log (P)
MeanRangeMeanRangeMeanRange
10.209[0.204, 0.213]0.778[0.750, 0.857]4.215[3.308, 5.867]
20.206[0.201, 0.210]0.589[0.563, 0.643]2.764[2.314, 3.194]
30.204[0.199, 0.208]0.612[0.588, 0.667]3.273[2.791, 3.779]
40.206[0.201, 0.209]0.524[0.500, 0.579]2.360[1.990, 2.936]
50.203[0.197, 0.206]0.429[0.412, 0.455]2.036[1.757, 2.337]
60.210[0.205, 0.214]0.533[0.500, 0.583]1.859[1.335, 2.160]
70.206[0.201, 0.210]0.589[0.563, 0.643]2.764[2.314, 3.194]
80.206[0.201, 0.210]0.589[0.563, 0.643]2.764[2.314, 3.194]
90.206[0.201, 0.210]0.589[0.563, 0.643]2.764[2.314, 3.194]

4.3. Assumption Analysis

There were 9 CEF and 15 core herbs generated from the GA process. Since the number of distinct herbs from the overall CEF was relatively small, we want to find out whether a CEF consisting of these 15 core herbs exists or not, if so, check its effectiveness. It was found that such a combination of herbs was in the dataset. Its coreness and effectiveness were evaluated (Table 10). Although α and EP values were relatively high and may be acceptable, S 1 value was only 0.009, which meant there were only 4 records covering this combination. Its value is too small to be considered as a core formula, but it is still worthwhile to carry out clinical trial in the future because of its higher effectiveness.
Table 10

Core-ness and effectiveness evaluation of 15 core herbs combination.

α S 1 EP of non-CEF groupEP of CEF group P value
0.6050.0090.2170.7500.014

4.4. Analysis of Herb-Herb Interactions in CEF

A herb combination is chosen to promote desirable herb-herb interaction; the efficacy of a TCM formula comes from the synergistic effects of its constituent herb pairs. Therefore, practitioners are interested to identify the potential interacting herbs from a prescription. Based on the previous work [7] of the analysis of herb-herb interactions in CEF, the synergy index (SI) was calculated for each herb pair in CEF as follows: where E 11 denotes the EP value of cooccurring of the two herbs and E 01 or E 10 is the EP value of each one used without the other herb, while ∨ denotes a maximum function, that is max⁡⁡(E 01, E 10). When SI is equal to 1, it indicates no real advantage of putting the two herbs together. When SI is greater than 1, it shows potential synergy. When SI is getting a larger value, it indicates a synergistic interaction between the two herbs in the pair. Figure 6 shows the distribution of SI of all core herb pairs. Although most SIs are closed to 1, the distribution skews more to the positive side (greater than 1), which indicates the existence of some potential synergies. All the SIs values are greater than 0.9, which imply no obvious antagonistic effect among the core herb pairs. Permutation test [7] was performed to test the significance of SI by permuting the outcome variable 2000 times. As a result, 4 significantly synergistic effects of core herb pairs were obtained (Table 11). Table 11 shows that most of these pairs were related to the functions of regulating Qi to promote diuresis and eliminating dampness to eliminate phlegmon according to TCM theory.
Figure 6

Distribution of SI.

Table 11

Analysis of herb-herb interactions in CEF.

No.Herb pairSI P value
1PLPT1.6730.004
2CDPT1.4190.012
3CSEPL1.3630.028
4PTTP1.0770.025

5. Discussion

Prescription for a diagnosis is a complicated and flexible procedure that integrates the knowledge of TCM theory. TCM practitioners put heavy emphasis on individualities when prescribing formulae in clinical practices. This is very different from the modern western medical therapies that usually comply with a common and operational clinical guideline. Revealing the regularity in prescriptions is an important step to reveal the underpinning TCM theory. It has generated much research interest to discover the regularity from the TCM prescriptions. Although computational models have been applied to reveal the core herbs and herb-collaboration patterns, not much effort has been expended to study their effectiveness. This is a critical and important research to discover these hidden patterns that are core and effective herbal formula. As for the discovery of CEF, it can be described as a complicated combinatorial optimization problem mathematically, which is concerned with the efficient combination of herbs to meet requirement. The purpose of this study is to set the stage and give an outline of properties of optimization problems that are relevant for discovery of CEF in TCM. We described the process of how to define this problem model that could be solved by GA method. In brief, analytic process consisted of recognizing and defining problems, constructing and solving models, and evaluating solutions. Furthermore, we looked at important properties of CEF, which could be used as the validation criteria. For CEF, there are two key questions to be answered. One is how to evaluate the coreness of a TCM formula and the other is the assessment of its clinical effectiveness. In this study, the measurements of confidence and support are proxy to the coreness property of a formula; the greater these two values are, the more widely the constituent herbs are used in the prescriptions and the more frequently the formula is used. The definitions α and S denote the confidence and support of a given CEF, respectively. It is quite common for a TCM practitioner to pick a subset of formulae, which are CEF, as templates, and personalize them for a patient. Upon the selected template(s), the practitioner can add or remove or replace herbs. The confidence value (α) well explains the flexible usage of the CEF and the personalized adaptation in action. Regarding the assessment of clinical effectiveness, the primary outcome measurement in our study was to quantify information related to the symptoms changes in a cancer treatment. In an internal panel meeting of TCM cancer experts, the most common LC symptoms were identified and they were consistent with the literature [26, 62]. Our results show that the total improvement proportion in symptoms was only 22.19%, which indicated a great challenge for the LC treatment for the TCM practitioners. Of course, it makes no sense that the frequently used herb combinations (CEF candidates) do not have high efficacy. GA has the ability to solve combinatorial optimization problems, which was reported by the literature [63-65]. A basic GA has the following implementation steps. First, the feature values are encoded into chromosomes to form the initial population. Second, calculate the fitness of every chromosome using the defined fitness function. Thirdly, according to the fitness values, genetic operators are applied to select chromosomes to form a new population. This process is repeated until a certain condition is satisfied. In our previous work [29], GA has successfully helped us to find a meaningful relationship between herbs and symptoms after designing a proper fitness function. Therefore, it is our belief that the usefulness of GA for other combinatorial optimization problems in TCM cannot be fairly assessed on the basis of its performance on the discovery of herb-symptom relationship alone. In this study, we gave an outline description of the way in which a genetic algorithm worked. While a crucial point in using GA is the design of the fitness function, which determines what a GA should optimize. In this study, we designed the fitness function based on two evaluation criteria of CEF, one is coreness which is represented by confidence and support defined in the present paper, the other is effectiveness which is evaluated by the statistic difference in effective proportion between CEF group and non-CEF group. The proposed fitness function is flexible and suitable for both binary and continuous outcome. To apply a penalty constant R in the fitness function is the strategy of removal of unsatisfactory CEF. This constant could be set to a value greater than the maximum value of EV to identify the CEF that meet the requirement; that is, a CEF would be dropped if the fitness value was negative (not meeting expectation), otherwise it would be kept. Parameter tuning is always a challenging task for GA. The GA toolbox for Matlab developed by the University of Sheffield was used in these experiments. We implement and run the algorithm using different configurations and compared results. Results show that some parameters need careful selection of settings like population size, generation, and P . Others are insignificant with respect to fitness value and can follow the default. As for fitness function, the additional key parameters are N and S in our approach. In this study, α was set to 1 and S 1 began from 2%. When S 1 was high or N get large, there were not many CEF with positive values. A small but reasonable number of CEF were reported after proper values were set for S and N. In particular, for multiple records data, which can be also regarded as longitudinal data, there are three types of correlation effects: (1) correlation between variables (herbs), (2) correlation within individual (patient), and (3) correlation between individuals (patients). As for the research of TCM formula, the first one can be seen as the herb-herb relationship; such relationships are meaningful patterns of herb combination, which provokes many researchers to develop methods to uncover the underlying rules. For this purpose, support- and confidence-based association rules algorithms are generally introduced. Motivated by the idea of association algorithm, we presented the support- and confidence-based criteria (α and S ) in order to evaluate the coreness of herb combination. It is found that when α is equal to 1, S is effectively the concept of support as commonly used in the association mining algorithm, such as Apriori algorithm [66-69]. It is hard to tackle the second correlation, which may undermine the evaluation of herb combination. For example, when one CEF is used for only one patient who visits frequently, although its support may be relatively high because of its large number of times for visit, such CEF is meaningless. However this disadvantage can be reduced by choosing a large sample size. Hence, individual- (patient) based support analysis could be helpful to identify the correlation within patient. In this paper, we gave support based on the patient for the analysis and carried out robustness analysis of CEF's effectiveness by the leave one (patient) out method. Results showed no concentrated use on patient level for CEF and good robustness also implied the stability for the effectiveness evaluation with a small perturbation in sample (patient level) space, which meant that correlation within patient level in this study did not undermine our evaluation on the effectiveness of CEF and our sample size was appropriate for discovering the reliable solutions. The last correlation is related to the individual's factors, such as age, gender, pathology, family history, pulmonary function, and TCM syndrome. In order to reveal the relationship between patient pattern and CEF, another mathematical pattern recognition model needs to be established, which will be in our future work. A total of 9 CEF were reported with good core property and high effectiveness. In the calculation of the EP value of single use of each core herb, the maximum value was 31.3% that was significantly lower than any combinations (CEF). These results highlighted the advantages and rationality of the combined use of herbs in TCM and were also meaningful for further experimental researches. In the theory of TCM, deficiency is the important cause and pathogenesis during the occurrence and development of tumor. Lack of vital Qi and deficiency of both Qi and spleen can lead to a series of pathological changes, such as Qi stagnation, blood stasis, dampness, and phlegm, and eventually lead to the tumor [7, 70–73]. For that reason, the TCM treatment to lung cancer is guided by strengthening body resistance, including benefiting Qi and nourishing Yin, and it is also supplemented by eliminating pathogens including dissipating phlegm, promoting blood circulation to dispel blood stasis, and detoxification[70, 73]. The prescription can be divided into two major parts [74]. Strengthening [. The emphasis of the treatment is invigorating spleen and kidney. Si jun zi decoction which is made up of Codonopsis, AO, PC, and Licorice is a classic prescription to invigorate spleen and replenish Qi. RB characterized by spleen, lung, and kidney can nourish spleen and kidney and benefit lung for promoting production of fluid. All the 9 CEF contain the medication intentions above-mentioned, for example, Codonopsis and AR which are both characterized by spleen and lung can benefit Qi for promoting production of fluid, and AR also has the function of invigorating Qi to consolidate the superficies and expelling pathogens by strengthening vital Qi and expelling pus. As a kind of common Chinese herb, AR is often used to strengthen resistance and to remove toxic substance instead of Codonopsis. Eliminating Pathogens [. The common function of PT and CSE is eliminating dampness and phlegm. AO combined with PC can invigorate spleen for eliminating dampness to enhance the effectiveness of softening and resolving hard mass. CS and AF can regulate Qi-flowing for promoting blood circulation to remove blood stasis. DS and HO can clear heat toxicity. All these functions complement each other in order to achieve the effect of a treat for a disease by looking into both its root cause and symptoms. What is more, as a consumptive disease of lung cancer, the digestive function would decline over time, so RS, MA, TP, and CD help to resolve food stagnation and promoting herb absorption. One interesting observation is the similarity among the CEF; this can help understand underlying TCM therapeutic principles for LC. Since it is fairly common for the doctors in the same hospital to use similar sets of herbs for the same disease (LC), it is necessary and beneficial to compare the results of CEF with an LC dataset from another hospital. It is also worthwhile to observe what CEF are discovered if a larger dataset with higher supports is used. The herb-herb interactions in CEF were also studied and reported. Four herb pairs with high and significant SI values indicate that they were synergistic. Some of them are present in classic TCM formulae. For example, PT and TP are in Ban xia Chen pi Tang, which contribute to the relieving of cough and reducing sputum. Therefore, all the results conformed with TCM theory, which indicated the feasibility and validity of the proposal. However, dosages not considered in this work, which are a key aspect in CEF, should be taken as the future work. GA is capable of representing its chromosomes in real numbers, and a reformulation of the fitness function can accommodate this change. A mathematical model of dose-effect needs to be defined. This may increase the complexity of the definition of the fitness functions, but the valuable results will make the effort worthwhile.

6. Conclusions

After the confidence, support, and effectiveness values related to a CEF were introduced, GA was used to discover the CEF from a TCM cancer clinical dataset. Results indicated that GA is suitable for the discovery of CEF that can be interpreted from the TCM principles. This is just an attempt and exploration of data mining to discover CEF from TCM clinical data. More work is still required to explore the strength, limitation, and appropriateness of the measures if they are relevant to other types of diseases.
  51 in total

1.  Identification of bioactive constituents of Ziziphus jujube fruit extracts exerting antiproliferative and apoptotic effects in human breast cancer cells.

Authors:  Pierluigi Plastina; Daniela Bonofiglio; Donatella Vizza; Alessia Fazio; Daniela Rovito; Cinzia Giordano; Ines Barone; Stefania Catalano; Bartolo Gabriele
Journal:  J Ethnopharmacol       Date:  2012-01-24       Impact factor: 4.360

2.  Oldenlandia diffusa extracts exert antiproliferative and apoptotic effects on human breast cancer cells through ERα/Sp1-mediated p53 activation.

Authors:  Guowei Gu; Ines Barone; Luca Gelsomino; Cinzia Giordano; Daniela Bonofiglio; Giancarlo Statti; Francesco Menichini; Stefania Catalano; Sebastiano Andò
Journal:  J Cell Physiol       Date:  2012-10       Impact factor: 6.384

3.  Astragalus polysaccharide injection integrated with vinorelbine and cisplatin for patients with advanced non-small cell lung cancer: effects on quality of life and survival.

Authors:  Li Guo; Shu-Ping Bai; Ling Zhao; Xiao-Hong Wang
Journal:  Med Oncol       Date:  2011-09-18       Impact factor: 3.064

4.  Traditional Chinese medicine in cancer care: perspectives and experiences of patients and professionals in China.

Authors:  W Xu; A D Towers; P Li; J-P Collet
Journal:  Eur J Cancer Care (Engl)       Date:  2006-09       Impact factor: 2.520

5.  Application of genetic algorithm-kernel partial least square as a novel nonlinear feature selection method: activity of carbonic anhydrase II inhibitors.

Authors:  Mehdi Jalali-Heravi; Anahita Kyani
Journal:  Eur J Med Chem       Date:  2007-01-12       Impact factor: 6.514

6.  Effects of sesquiterpenes isolated from largehead atractylodes rhizome on growth, migration, and differentiation of B16 melanoma cells.

Authors:  Gui-Xin Chou; Jian-Hong Chu; Wang-fun Fong; Zhi-Ling Yu
Journal:  Integr Cancer Ther       Date:  2010-08-16       Impact factor: 3.279

7.  Prescription pattern of traditional Chinese medicine for climacteric women in Taiwan.

Authors:  Y-H Yang; P-C Chen; J-D Wang; C-H Lee; J-N Lai
Journal:  Climacteric       Date:  2009-12       Impact factor: 3.005

8.  [Treatment of operated late gastric carcinoma with prescription of strengthening the patient's resistance and dispelling the invading evil in combination with chemotherapy: follow-up study of 158 patients and experimental study in animals].

Authors:  G T Wang
Journal:  Zhong Xi Yi Jie He Za Zhi       Date:  1990-12

9.  [Professor Ling Changquan's experience in treating primary liver cancer: an analysis of herbal medication].

Authors:  Zhen Sun; Yong-hua Su; Xiao-qiang Yue
Journal:  Zhong Xi Yi Jie He Xue Bao       Date:  2008-12

10.  Diagnosis Analysis of 4 TCM Patterns in Suboptimal Health Status: A Structural Equation Modelling Approach.

Authors:  Li-Min Wang; Xin Zhao; Xi-Ling Wu; Yang Li; Dan-Hui Yi; Hua-Ting Cui; Jia-Xu Chen
Journal:  Evid Based Complement Alternat Med       Date:  2012-04-10       Impact factor: 2.629

View more
  6 in total

1.  Ingredients, Anti-Liver Cancer Effects and the Possible Mechanism of DWYG Formula Based on Network Prediction.

Authors:  Yao Li; Han-Min Li; Zhi-Cheng Li; Ming Yang; Rui-Fang Xie; Zhi Hua Ye; Xiang Gao; Xin Zhou
Journal:  Onco Targets Ther       Date:  2020-05-15       Impact factor: 4.147

2.  An analysis of chemical ingredients network of Chinese herbal formulae for the treatment of coronary heart disease.

Authors:  Fan Ding; Qianru Zhang; Carolina Oi Lam Ung; Yitao Wang; Yifan Han; Yuanjia Hu; Jin Qi
Journal:  PLoS One       Date:  2015-02-06       Impact factor: 3.240

3.  Effects and possible mechanism of Ruyiping formula application to breast cancer based on network prediction.

Authors:  Rui-Fang Xie; Sheng Liu; Ming Yang; Jia-Qi Xu; Zhi-Cheng Li; Xin Zhou
Journal:  Sci Rep       Date:  2019-03-27       Impact factor: 4.379

4.  Data Mining, Network Pharmacology, and Molecular Docking Explore the Effects of Core Traditional Chinese Medicine Prescriptions in Patients with Rectal Cancer and Qi and Blood Deficiency Syndrome.

Authors:  Shiyu Ma; Lin Zheng; Lan Zheng; Xiaolan Bian
Journal:  Evid Based Complement Alternat Med       Date:  2021-08-02       Impact factor: 2.629

Review 5.  Advances in Patient Classification for Traditional Chinese Medicine: A Machine Learning Perspective.

Authors:  Changbo Zhao; Guo-Zheng Li; Chengjun Wang; Jinling Niu
Journal:  Evid Based Complement Alternat Med       Date:  2015-07-12       Impact factor: 2.629

6.  The Common Prescription Patterns Based on the Hierarchical Clustering of Herb-Pairs Efficacies.

Authors:  Jia Cao
Journal:  Evid Based Complement Alternat Med       Date:  2016-04-10       Impact factor: 2.629

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.