Literature DB >> 24804251

Supervised clustering based on DPClusO: prediction of plant-disease relations using Jamu formulas of KNApSAcK database.

Sony Hartono Wijaya1, Husnawati Husnawati2, Farit Mochamad Afendi3, Irmanida Batubara4, Latifah K Darusman4, Md Altaf-Ul-Amin5, Tetsuo Sato5, Naoaki Ono5, Tadao Sugiura5, Shigehiko Kanaya5.   

Abstract

Indonesia has the largest medicinal plant species in the world and these plants are used as Jamu medicines. Jamu medicines are popular traditional medicines from Indonesia and we need to systemize the formulation of Jamu and develop basic scientific principles of Jamu to meet the requirement of Indonesian Healthcare System. We propose a new approach to predict the relation between plant and disease using network analysis and supervised clustering. At the preliminary step, we assigned 3138 Jamu formulas to 116 diseases of International Classification of Diseases (ver. 10) which belong to 18 classes of disease from National Center for Biotechnology Information. The correlation measures between Jamu pairs were determined based on their ingredient similarity. Networks are constructed and analyzed by selecting highly correlated Jamu pairs. Clusters were then generated by using the network clustering algorithm DPClusO. By using matching score of a cluster, the dominant disease and high frequency plant associated to the cluster are determined. The plant to disease relations predicted by our method were evaluated in the context of previously published results and were found to produce around 90% successful predictions.

Entities:  

Mesh:

Year:  2014        PMID: 24804251      PMCID: PMC3997850          DOI: 10.1155/2014/831751

Source DB:  PubMed          Journal:  Biomed Res Int            Impact factor:   3.411


1. Introduction

Big data biology, which is a discipline of data-intensive science, has emerged because of the rapid increasing of data in omics fields such as genomics, transcriptomics, proteomics, and metabolomics as well as in several other fields such as ethnomedicinal survey. The number of medicinal plants is estimated to be 40,000 to 70,000 around the world [1] and many countries utilize these plants as blended herbal medicines, for example, China (traditional Chinese medicine), Japan (Kampo medicine), India (Ayurveda, Siddha, and Unani), and Indonesia (Jamu). Nowadays, the use of traditional medicines is rapidly increasing [2, 3]. These medicines consist of ingredients made from plants, animals, minerals, or combination of them. The traditional medicines have been used for generations for treatments of diseases or maintaining health of people and the most popular form of traditional medicine is herbal medicine. Blended herbal medicines as well as single herb medicines include a large number of constituent substances which exert effects on human physiology through a variety of biological pathways. The KNApSAcK Family database systems can be used to comprehensively understand the medicinal usage of plants based upon traditional and modern knowledge [4, 5]. This database has information about the selected herbal ingredients, that is, the formulas of Kampo and Jamu, omics information of plants and humans, and physiological activities in humans. Jamu is generally composed based on the experience of the users for decades or even hundreds of years. However, versatile scientific analyses are needed to support their efficacy and their safety. Attaining this objective is in accordance with the 2010 policy of the Ministry of Health of Indonesian Government about scientification of Jamu. Thus, it is required to systemize the formulations and develop basic scientific principles of Jamu to meet the requirement of Indonesian Healthcare System. Afendi et al. initiated and conducted scientific analysis of Jamu for finding the correlation between plants, Jamu, and their efficacy using statistical methods [6-8]. They used Biplot, partial least squares (PLS), and bootstrapping methods to summarize the data and also focused on prediction of Jamu formulations. These methods give a good understanding about relationship between plants, Jamu, and their efficacy. Among 465 plants used in 3138 Jamu, 190 plants were shown to be effective for at least one efficacy and these plants were considered to be the main ingredients of Jamu. The other 275 plants are considered to be supporting ingredients in Jamu because their efficacy has not been established yet. Network biology can be defined as the study of the network representations of molecular interactions, both to analyze such networks and to use them as a tool to make biological predictions [9]. This study includes modelling, analysis, and visualizations, which holds important task in life science today [10]. Network analysis has been increasingly utilized in interpreting high throughput data on omics information, including transcriptional regulatory networks [11], coexpression networks [12], and protein-protein interactions [13]. We can easily describe relationship between entities in the network and also concentrate on part of the network consisting of important nodes or edges. These advantages can be adopted for analyzing medicinal usage of plants in Jamu and diseases. Network analysis provides information about groups of Jamu that are closely related to each other in terms of ingredient similarity and thus allows precise investigation to relate plants to diseases. On the other hand, multivariate statistical methods such as PLS can assign plants to efficacy by global linear modeling of the Jamu ingredients and efficacy. However, there is still lack of appropriate network based methods to learn how and why many plants are grouped in certain Jamu formula and the combination rule embedding numerous Jamu formulas. It is needed to explore the relationship between Indonesian herbal plants used in Jamu medicines and the diseases which are treated using Jamu medicines. When effectiveness of a plant against a disease is firmly established, then further analysis about that plant can be proceeded to molecular level to pinpoint the drug targets. The present study developed a network based approach for prediction of plant-disease relations. We utilized the Jamu data from the KNApSAcK database. A Jamu network was constructed based on the similarity of their ingredients and then Jamu clusters were generated using the network clustering algorithm DPClusO [14, 15]. Plant-disease relations were then predicted by determining the dominant diseases and plants associated with selected Jamu clusters.

2. Methods

2.1. Concept of the Methodology

Jamu medicines consist of combination of medicinal plants and are used to treat versatile diseases. In this work we exploit the ingredient similarity between Jamu medicines to predict plant-disease relations. The concept of the proposed method is depicted in Figure 1. In step 1 a network is constructed where a node is a Jamu medicine and an edge represents high ingredient similarity between the corresponding Jamu pair. In Figure 1, the nodes of the same color indicate the Jamu medicines used for the same disease. The similarity is represented by Pearson correlation coefficient [16, 17]; that is, where x is the weight of plant-i in Jamu X, y is the weight of plant-i in Jamu Y, is mean of Jamu X, and is mean of Jamu Y. The higher similarity between Jamu pairs the higher the correlation value. In the present study, x and y are assigned as 1 or 0 in cases the ith plant is, respectively, included or not included in the formula. Under such condition, Pearson correlation corresponds to fourfold point correlation coefficient; that is, where a, b, c, and d represent the numbers of plants included in both X and Y, in only X, in only Y, and in neither X nor Y, respectively.
Figure 1

Concept of the methodology: network construction based on ingredient similarity between individual Jamu medicines, network clustering, and classification of medicinal plants to dominant disease.

In step 2 the Jamu clusters are generated using network clustering algorithm DPClusO. DPClusO can generate clusters characterized by high density and identified by periphery; that is, the Jamu medicines belonging to a cluster are highly cohesive and separated by a natural boundary. Such clusters contain potential information about plant-disease relations. In step 3 we assess disease-dominant clusters based on matching score represented by the following equation: Matching score of a cluster is the ratio of the highest number of Jamu associated with a single disease to the total number of Jamu in the cluster. We assign a disease to a cluster for which the matching score is greater than a threshold value. In step 4, we determine the frequency of plants associated with a cluster if and only if a disease is assigned to it in the previous step. The highest frequency plant associated to a cluster is considered to be related to the disease assigned to that cluster. True positive rates (TPR) or sensitivity was used to evaluate resulting plants. TPR is the proportion of the true positive predictions out of all the true predictions, defined by the following formula [18]: where true positive (TP) is the number of correctly classified and false negative (FN) is the number of incorrectly rejected entities. We refer to the proposed method as supervised clustering because after generation of the clusters we narrow down the candidate clusters for further analysis based on supervised learning and thus improve the accuracy of prediction of the proposed method.

3. Result and Discussion

3.1. Construction and Comparison of Jamu and Random Networks

We used the same number of Jamu formulas from previous research [6], 3138 Jamu formulas, and the set union of all formulas consists of 465 plants. We assigned 3138 Jamu formulas to 116 diseases of International Classification of Diseases (ICD) version 10 from World Health Organization (WHO, Table 1) [19]. Those 116 diseases are mapped to 18 classes of disease, which contains 16 classes of disease from National Center for Biotechnology Information (NCBI) [20] and 2 additional classes. Table 2 shows distribution of 3138 Jamu into 18 classes of disease. According to this classification, most Jamu formulas are useful for relieving muscle and bone, nutritional and metabolic diseases, and the digestive system. Furthermore, there is no Jamu formula classified into glands and hormones and neonatal disease classes. We excluded 4 Jamu formulas which are used to treat fever in the evaluation process because this symptom is very general and almost appeared in all disease classes. Jamu-plant-disease relations can be represented using 2 matrices: first matrix is Jamu-plant relation with dimension 3138 × 465 and the second matrix is Jamu-disease relation with dimension 3138 × 18.
Table 1

List of diseases using International Classification of Diseases ver. 10 (class of disease IDs correspond to Table 2).

IDDiseaseClass of disease
1Abdominal pain3
2Abdominal pain, diarrhea3
3Acne16
4Acne, skin problems (cosmetics)16
5Amenorrhoea, dysmenorrhea6
6Amenorrhoea, irregular menstruation6
7Anaemia1
8Appendicitis, urinary tract infection, tonsillitis3
9Arthralgia11
10Arthralgia, arthritis11
11Asthma15
12Benign prostatic hyperplasia (Bph)10
13Breast disorder6
14Bromhidrosis16
15Bronchitis15
16Cancer2
17Cancer pain2
18Cancer, inflammation2
19Colic abdomen, bloating (in infant)3
20Common cold15
21Common cold, dyspepsia, insect bites15, 3, 16
22Common cold, influenza15
23Cough15
24Degenerative disease14
25Dermatitis, urticaria, erythema16
26Diabetes14
27Diabetic gangrene16
28Diarrhea3
29Diarrhea, abdominal pain3
30Diseases of the eye5
31Disorders in pregnancy6
32Dysmenorrhea6
33Dysmenorrhea, irregular menstruation6
34Dysmenorrhea, menstrual syndrome6
35Dyspepsia3
36Dyspnoea15
37Dyspnoea, cough, orthopnoea15
38Fatigue11
39Fatigue, anaemia, loss appetite1
40Fatigue, lack of sexual function6
41Fatigue, low back pain11
42Fatigue, myalgia, arthralgia11
43Fatigue, osteoarthritis11
44Fertility problem6, 10
45Fever0
46Gastritis, gastric ulcer3
47Haemorrhoids1
48Headache13
49Heart diseases8
50Heartburn3, 8
51Hepatitis, other diseases of liver3
52Hypercholesterolaemia14
53Hypertension8
54Hypertension, diabetes14
55Hypertension, hypercholesterolaemia14
56Hyperuricemia1
57Immunodefficiency9
58Indigestion (K.30)3
59Indigestion, lose appetite3
60Infertility6, 10
61Irregular menstruation, menstruation syndrome6
62Kidney diseases17
63Lactation problems6
64Leukorrhoea (Vaginalis)6
65Leukorrhoea (Vaginalis), dysmenorrhoea6
66Lose appetite3
67Lose appetite, underweight14
68Low back pain, myalgia, arthralgia11
69Low back pain, myalgia, constipation11
70Low back pain, urinary tract infection17
71Lung diseases15
72Malaise and Fatigue11
73Malaise and Fatigue, Constipation11
74Malaise and Fatigue, Fertility Problems10, 11
75Malaise and Fatigue, Low Back Pain11
76Malaise and Fatigue, Sexual Dysfunction11, 6, 10
77Malaise and Fatigue, Skin Problems (Cosmetics)16
78Malaria, anaemia1
79Meno-metrorrhagia6
80Menopausal syndrome6
81Menopause/menstrual syndrome, leukorrhoea (vaginalis)6
82Menstrual syndrome6
83Menstrual syndrome, fatigue6
84Migraine13
85Mood disorder18
86Myalgia, arthralgia11
87Nausea/vomiting of pregnancy6
88Osteoarthritis11
89Osteoarthritis, fatigue11
90Overweight, obesity14
91Paralysis13
92Post partum syndrome6
93Prevent from overweight14
94Respiratory infection due to smoking15
95Respiratory tract infection15
96Rheumatoid arthritis, gout11
97Secondary amenorrhea6
98Secondary amenorrhea, irregular menstruation6
99Sexual dysfunction, fatigue6, 10
100Skin diseases16
101Skin problems (cosmetics)16
102Sleeping and Mood Disorders18
103Sleeping disorders18
104Stomatitis3
105Stomatitis, gingivitis, tonsilitis3
106Stone in kidney (N20.0)17
107Stone in kidney (N20.0), urinary bladder stone (N21.0)17
108Tonsilitis4
109Tonsilofaringitis4
110Toothache13
111Typhoid, dyspepsia3
112Ulcer of anus and rectum3
113Underweight, lose appetite3
114Urinary tract infection (urethritis)17
115Vaginal discharges6
116Vaginal diseases6
Table 2

Distribution of Jamu formulas according to 18 classes of disease (classes of diseases are determined by NCBI in ID1 to ID16 and by the present study in ID17 and ID18 represented by asterisks in Ref. columns).

IDClass of disease (NCBI)Ref.Number of JamuPercentage
1Blood and lymph diseasesNCBI201 6.41
2CancersNCBI32 1.02
3The digestive systemNCBI457 14.56
4Ear, nose, and throatNCBI2 0.06
5Diseases of the eyeNCBI1 0.03
6Female-specific diseasesNCBI382 12.17
7Glands and hormonesNCBI0
8The heart and blood vesselsNCBI57 1.82
9Diseases of the immune systemNCBI22 0.70
10Male-specific diseasesNCBI17 0.54
11Muscle and boneNCBI649 20.68
12Neonatal diseasesNCBI0
13The nervous systemNCBI32 1.02
14Nutritional and metabolic diseasesNCBI576 18.36
15Respiratory diseasesNCBI313 9.97
16Skin and connective tissueNCBI163 5.19
17The urinary system90 2.87
18Mental and behavioral disorders21 0.67

The number of Jamu classified into multiple disease classes1193.79
The number of Jamu unclassified40.13

Total Jamu formulas3138100.00
After completion of data acquisition process, we calculated the similarity between Jamu pairs using correlation measure. The similarity measures between Jamu pairs were determined based on their ingredients. Corresponding to K (3138 in present case) Jamu formulas, there can be maximum (K × (K − 1)/2) = (3138 × (3137/2)) = 4,921,953 Jamu pairs. We sorted the Jamu pairs based on correlation value using descending order and selected top-n (0.7%, 0.5%, and 0.3%) pairs of Jamu formula to create 3 sets of Jamu pairs. The number of Jamu pairs for 0.7%, 0.5%, and 0.3% datasets is 34,454 pairs, 24,610 pairs, and 14,766 pairs and the corresponding minimum correlation values are 0.596, 0.665, and 0.718, respectively. The three datasets of Jamu pairs can be regarded as three undirected networks (step 1 in Figure 1) consisting of 2779, 2496, and 2085 Jamu formulas, respectively (Table 3). Figure 2 shows visualization of 0.7% Jamu networks using Cytoscape Spring Embedded layout. We verified that the degree distributions of the Jamu networks are somehow close to those of scale-free networks, that is, roughly are of power law type. However, in the high-degree region the power law structure is broken (Figure 3). Nearly accurate relation of power laws between medicinal herbs and the number of formulas utilizing them was observed in Jamu system but not in Kampo (Japanese crude drug system) [4]. The difference of formulas between Jamu and Kampo can be explained by herb selection by medicinal researchers based on the optimization process of selection [4]. Thus, the broken structure of power law corresponding to Jamu networks is associated with the fact that selection of Jamu pairs based on ingredient correlation leads to nonrandom selection. We also constructed random networks according to Erdős-Rényi (ER) model [21], Barabási-Albert (BA) model [22], and Vazquez's Connecting Nearest Neighbor (CNN) model [23] of the same size corresponding to each of the real Jamu network. We used Cytoscape Network Analyzer plugin [24] and R software for analyzing the characteristics of both the Jamu and the random networks.
Table 3

Statistics of three datasets.

Parameters0.7%0.5%0.3%
Network statisticsTotal pairs34,45424,61014,766
Minimum correlation0.5960.6650.718
Number of Jamu formulas2,7792,4962,085
Average degree24.819.714.2
(Random network: ER)(24.8 ± 0.0)(19.7 ± 0.0)(14.2 ± 0.0)
(Random network: BA)(24.7 ± 0.1)(19.7 ± 0.1)(14.1 ± 0.1)
(Random network: CNN)(24.7 ± 0.4)(19.7 ± 0.4)(14.0 ± 0.4)
Clustering coefficient0.5210.5200.540
(Random network: ER)(0.009 ± 0.000)(0.008 ± 0.000)(0.007 ± 0.000)
(Random network: BA)(0.030 ± 0.001)(0.028 ± 0.001)(0.026 ± 0.001)
(Random network: CNN)(0.246 ± 0.008)(0.239 ± 0.008)(0.233 ± 0.010)
Number of connected components69119254
(Random networks: ER, BA, CNN)(1)(1)(1)
Network diameter 151720
(Random network: ER)(4.0 ± 0.0)(4.0 ± 0.0)(5.0 ± 0.0)
(Random network: BA)(10.8 ± 0.8)(11.2 ± 1.5)(10.8 ± 0.9)
(Random network: CNN)(14.6 ± 1.9)(14.1 ± 1.4)(14.7 ± 1.3)
Network density0.0080.0080.007
(Random network: ER)(0.009 ± 0.000)(0.008 ± 0.000)(0.007 ± 0.000)
(Random network: BA)(0.009 ± 0.000)(0.008 ± 0.000)(0.007 ± 0.000)
(Random network: CNN)(0.009 ± 0.000)(0.008 ± 0.000)(0.007 ± 0.000)

DPClusOTotal number of clusters1,7461,411938
Number of clusters with more than 2 Jamu1,296873453
(%)(74.2)(61.9)(48.3)
Number of Jamu formulas in the biggest cluster11810489
Figure 2

The network consisting of 0.7% Jamu pairs (correlation value above or equal to 0.596).

Figure 3

Degree distributions of three Jamu networks roughly follow power law. The x-axis corresponds to the log of degree of a node in the Jamu network and the y-axis corresponds to the log of the number of Jamu.

We determined five statistical indexes, that is, average degree, clustering coefficient, number of connected component, network diameter, and network density of each Jamu network and also of each random network. The clustering coefficient C of a node n is defined as C = 2e /(k (k − 1)), where k is the number of neighbors of n and e is the number of connected pairs between all neighbors of n. The network diameter is the largest distance between any two nodes. If a network is disconnected, its diameter is the maximum of all diameters of its connected components. A network's density is the ratio of the number of edges in the network over the total number of possible edges between all pairs of nodes (which is n(n − 1)/2, where n is the number of vertices, for an undirected graph). The average number of neighbors and the network density are the same for the real and random networks of the same size as it is shown in Table 3. In case of 0.7% and 0.5% real networks, the clustering coefficient is roughly the same and in case of 0.3% the clustering coefficient is somewhat larger. The number of connected components and the diameter of the Jamu networks gradually decrease as the network grows bigger by addition of more nodes and edges. Very different values corresponding to clustering coefficient, connected component, and network diameter imply that the Jamu networks are quite different from all 3 types of random networks. The differences between Jamu networks and ER random networks are the largest. Random networks constructed based on other two models are also substantially different from Jamu networks. Based on the fact that the random networks constructed based on all three types of models are different from the Jamu networks, it can be concluded that structure of Jamu networks is reasonably biased and thus might contain certain information about plant-disease relations. Specially, much higher value corresponding to clustering coefficient indicates that there are clusters in the networks worthy to be investigated. To extract clusters from the Jamu networks (step 2 in Figure 1) we applied DPClusO network clustering algorithm [14] to generate overlapping clusters based on density and periphery tracking.

3.2. Supervised Clustering Based on DPClusO

DPClusO is a general-purpose clustering algorithm and useful for finding overlapping cohesive groups in an undirected simple graph for any type of application. It ensures coverage and performs robustly in case of random addition, removal, and rearrangement of edges in protein-protein interaction (PPI) networks [14]. While applying DPClusO, the parameter values of density and cluster property that we used in this experiment are 0.9 and 0.5, respectively [15]. Table 3 shows the summary of clustering result by DPClusO. Because clusters consisting of two Jamu formulas are trivial clusters, for the next steps we only use clusters each of which consists of 3 or more Jamu formulas. The number of total clusters increases along with the larger dataset, although the threshold correlation between Jamu pairs decreases. We evaluated the clustering result using matching score to determine dominant disease for every cluster (step 3 in Figure 1). Matching score of a cluster is the ratio of the highest number of Jamu associated with the same disease to the total number of Jamu in the cluster. Thus matching score is a measure to indicate how strongly a disease is associated to a cluster. Figure 4 shows the distribution of the clusters with respect to matching score from three datasets. All datasets have the highest frequency of clusters at matching score >0.9 and overall most of the clusters have higher matching score, which means most of the DPClusO generated clusters can be confidently related to a dominant disease. Furthermore the number of clusters with matching score >0.9 is remarkably larger compared to the same in other ranges of matching score in case of the 0.3% dataset (Figure 4(c)). If we compare the ratio of frequency of clusters at matching score >0.9 for every dataset, the 0.3% dataset has the highest ratio with 40.84% (of 453), compared to 29.67% (of 873) and 21.91% (of 1296), in case of 0.5% and 0.7% datasets, respectively. Thus, the most reliable species to disease relations can be predicted at matching score >0.9 corresponding to the clusters generated from 0.3% dataset.
Figure 4

Distribution of clusters based on matching score.

Figure 5(a) shows the success rate for all 3 datasets with respect to threshold matching scores. Success rate is defined as the ratio of the number of clusters with matching score larger than the threshold to the total number of clusters. As expected it tends to produce lower success rate if we decrease correlation value to create the datasets. However more clusters are generated and more information can be extracted when we lower the threshold correlation value. The success rate increases rapidly as the matching score decreases from 0.9 to 0.6 and after that the slope of increase of success rate decreases. Therefore in this study we empirically decide 0.6 as the threshold matching score to predict plant-disease relations.
Figure 5

(a) Success rate and (b) number of predicted plants with respect to matching score thresholds.

3.3. Assignment of Plants to Disease

By using DPClusO resulting clusters, we assigned plants to classes of disease. Based on a threshold matching score we assigned dominant disease to a cluster. Then we assign a plant to a cluster by way of analyzing the ingredients of the Jamu formulas belonging to that cluster and determining the highest frequency plant, that is, the plant that is used for maximum number Jamu belonging to that cluster (step 4 in Figure 1). Thus we assign a disease and a plant to each cluster having matching score greater than a threshold. Our hypothesis is that the disease and the plant assigned to the same cluster are related. The total number of assigned plants depends on matching score value. Figure 5(b) shows the number of predicted plants that can be assigned to diseases in the context of matching score. With higher matching score value, the number of predicted plants assigned to classes of disease is supposed to remain similar or decrease but the reliability of prediction increases. In Figure 5(b) a sudden change in the number of predicted plants is seen at matching score 0.6 which we consider as empirical threshold in this work. Based on the 0.7% dataset, the largest number of plants (135 plants, Table 4) was assigned to diseases. There are 63 plants assigned to only one class of disease, whereas the other 72 plants are assigned to at least two or more classes of disease (Figure 6).
Table 4

List of plants assigned to each disease.

NumberPlants  nameHit-miss  status
A.  Disease:  blood  and  lymph diseases     
1 Tamarindus  indica Hit
2 Allium  sativum Hit
3 Tinospora  tuberculata Hit
4 Piper  retrofractum Hit   
5 Syzygium  aromaticum Hit
6 Bupleurum  falcatum Hit   
7 Graptophyllum  pictum Hit   
8 Plantago  major Hit   
9 Zingiber  officinale Hit
10 Cinnamomum   burmannii Hit
11 Soya  max Miss
12 Kaempferia  galanga Hit   
13 Curcuma  longa Hit
14 Piper  nigrum Hit   
15 Zingiber  aromaticum Hit
16 Phyllanthus  urinaria Hit
17 Oryza  sativa Hit   
18 Myristica  fragrans Hit
19 Alstonia  scholaris Hit
20 Syzygium  polyanthum Miss   
21 Andrographis  paniculata Hit
22 Sida  rhombifolia Miss   
23 Cyperus  rotundus Hit   
24 Sonchus  arvensis Miss   
25 Curcuma  aeruginosa Hit
26 Curcuma  xanthorrhiza Hit   

B.  Disease:  cancers   
1 Catharanthus  roseus Hit   

C.  Disease:  the  digestive  system   
1 Foeniculum  vulgare Hit   
2 Glycyrrhiza  uralensis Hit
3 Imperata  cylindrica Hit   
4 Zingiber  purpureum Hit
5 Physalis  peruviana Hit   
6 Punica  granatum Hit
7 Echinacea  purpurea Hit   
8 Zingiber  officinale Hit
9 Psidium  guajava Hit   
10 Baeckea  frutescens Hit
11 Amomum  compactum Hit   
12 Cinnamomum  burmannii Hit
13 Melaleuca  leucadendra Hit   
14 Caesalpinia  sappan Hit
15 Parkia  roxburghii Hit   
16 Rheum  tanguticum Hit   
17 Kaempferia  galanga Hit   
18 Coriandrum  sativum Hit   
19 Curcuma  longa Hit   
20 Zingiber  aromaticum Hit   
21 Phyllanthus  urinaria Hit   
22 Myristica  fragrans Hit   
23 Hydrocotyle  asiatica Hit
24 Carica  papaya Hit   
25 Mentha  arvensis Hit   
26 Lepiniopsis  ternatensis Hit   
27 Helicteres  isora Hit   
28 Andrographis  paniculata Hit   
29 Symplocos  odoratissima Hit   
30 Schisandra  chinensis Hit   
31 Blumea  balsamifera Hit   
32 Silybum  marianum Hit
33 Cinnamomum sintoc Hit   
34 Elephantopus  scaber Hit   
35 Curcuma  aeruginosa Hit   
36 Kaempferia  pandurata Hit   
37 Curcuma  xanthorrhiza Hit   
38 Curcuma  mangga Hit
39 Curcuma  zedoaria Hit   
40 Daucus  carota Hit
41 Matricaria  chamomilla Hit
42 Cymbopogon  nardus Hit

D.  Disease:  female-specific  diseases   
1 Foeniculum  vulgare Hit   
2 Imperata  cylindrica Hit   
3 Tamarindus  indica Hit   
4 Pluchea  indica Hit
5 Piper  retrofractum Hit   
6 Punica  granatum Hit   
7 Uncaria  rhynchophylla Hit   
8 Zingiber  officinale Hit   
9 Guazuma  ulmifolia Hit
10 Nigella  sativa Hit   
11 Terminalia  bellirica Hit   
12 Baeckea  frutescens Hit   
13 Phaseolus  radiatus Hit   
14 Amomum  compactum Hit
15 Sauropus  androgynus Hit   
16 Usnea  misaminensis Hit   
17 Cinnamomum burmannii Hit   
18 Melaleuca  leucadendra Hit   
19 Parameria  laevigata Hit   
20 Parkia  roxburghii Hit   
21 Piper  cubeba Hit   
22 Kaempferia  galanga Hit   
23 Coriandrum  sativum Hit   
24 Kaempferia  angustifolia Hit   
25 Curcuma  longa Hit   
26 Zingiber  aromaticum Hit   
27 Languas  galanga Hit   
28 Galla  lusitania Hit   
29 Quercus  lusitanica Hit   
30 Hydrocotyle  asiatica Hit   
31 Areca  catechu Hit   
32 Lepiniopsis  ternatensis Hit   
33 Helicteres  isora Hit
34 Piper  betle Hit   
35 Elephantopus  scaber Hit
36 Kaempferia  pandurata Hit   
37 Curcuma  xanthorrhiza Hit   
38 Sesbania  grandiflora Hit   

E.  Disease:  the  heart  and  blood  vessels   
1 Allium  sativum Hit   
2 Curcuma  longa Hit
3 Morinda  citrifolia Hit
4 Homalomena  occulta Hit
5 Hydrocotyle  asiatica Hit   
6 Alstonia  scholaris Hit
7 Syzygium  polyanthum Miss
8 Andrographis  paniculata Hit
9 Apium  graveolens Miss   
10 Imperata  cylindrica Hit   

F.  Disease:  male-specific  diseases   
1 Cucurbita  pepo Miss   
2 Serenoa  repens Miss   
3 Baeckea  frutescens Hit   
4 Phaseolus  radiatus Hit   
5 Curcuma  longa Hit   
6 Elephantopus  scaber Hit   

G.  Disease:  muscle  and  bone   
1 Foeniculum  vulgare Hit   
2 Clausena  anisum-olens Hit
3 Zingiber  purpureum Hit   
4 Allium  sativum Hit   
5 Strychnos  ligustrina Hit   
6 Tinospora  tuberculata Hit
7 Piper  retrofractum Hit   
8 Syzygium  aromaticum Hit   
9 Cola  nitida Hit
10 Ginkgo  biloba Hit
11 Panax  ginseng Hit   
12 Equisetum  debile Hit
13 Zingiber  officinale Hit   
14 Ganoderma  lucidum Hit   
15 Nigella  sativa Hit   
16 Terminalia  bellirica Hit
17 Baeckea  frutescens Hit
18 Amomum  compactum Hit   
19 Cinnamomum burmannii Hit   
20 Melaleuca  leucadendra Hit   
21 Parameria  laevigata Hit
22 Psophocarpus  tetragonolobus Hit
23 Parkia  roxburghii Hit   
24 Piper  cubeba Hit
25 Kaempferia  galanga Hit   
26 Coriandrum  sativum Hit   
27 Cola  acuminata Hit   
28 Coffea  arabica Hit   
29 Orthosiphon  stamineus Hit   
30 Curcuma  longa Hit   
31 Piper  nigrum Hit   
32 Alpinia  galanga Hit   
33 Vitex  trifolia Hit   
34 Zingiber  amaricans Hit
35 Zingiber  zerumbet Hit   
36 Zingiber  aromaticum Hit   
37 Languas  galanga Hit   
38 Massoia  aromatica Hit   
39 Morinda  citrifolia Hit   
40 Carum  copticum Hit
41 Panax  pseudoginseng Hit
42 Oryza  sativa Hit   
43 Myristica  fragrans Hit   
44 Pandanus  amaryllifolius Hit   
45 Eurycoma  longifolia Hit   
46 Hydrocotyle  asiatica Hit   
47 Areca  catechu Hit
48 Mentha  arvensis Hit
49 Lepiniopsis  ternatensis Hit   
50 Pimpinella  pruatjan Hit   
51 Andrographis  paniculata Hit   
52 Blumea  balsamifera Hit   
53 Cymbopogon  nardus Hit   
54 Sida  rhombifolia Hit   
55 Cinnamomum sintoc Hit   
56 Piper  betle Hit
57 Talinum  paniculatum Hit   
58 Elephantopus  scaber Hit   
59 Cyperus  rotundus Hit   
60 Curcuma  aeruginosa Hit   
61 Kaempferia  pandurata Hit
62 Curcuma  xanthorrhiza Hit   
63 Tribulus  terrestris Hit   
64 Corydalis  yanhusuo Hit   
65 Pausinystalia  yohimbe Hit   

H.  Disease:  nutritional  and metabolic  diseases   
1 Foeniculum  vulgare Hit   
2 Glycyrrhiza  uralensis Hit   
3 Zingiber  purpureum Hit   
4 Allium  sativum Hit   
5 Tinospora  tuberculata Hit   
6 Pandanus  conoideus Hit   
7 Syzygium  aromaticum Hit   
8 Punica  granatum Hit   
9 Zingiber  officinale Hit   
10 Guazuma  ulmifolia Hit   
11 Nigella  sativa Hit   
12 Amomum  compactum Hit
13 Cinnamomum burmannii Hit   
14 Parameria  laevigata Hit   
15 Caesalpinia  sappan Hit   
16 Soya  max Hit
17 Cocos  nucifera Hit   
18 Rheum  tanguticum Hit   
19 Piper  cubeba Hit
20 Murraya  paniculata Hit   
21 Kaempferia  galanga Hit
22 Coffea  arabica Hit
23 Orthosiphon  stamineus Hit   
24 Curcuma  longa Hit   
25 Piper  nigrum Hit
26 Zingiber  aromaticum Hit   
27 Aloe  vera Hit   
28 Phaleria  papuana Hit   
29 Galla  lusitania Hit   
30 Quercus  lusitanica Hit   
31 Morinda  citrifolia Hit   
32 Myristica  fragrans Hit
33 Momordica  charantia Hit   
34 Areca  catechu Hit   
35 Lepiniopsis  ternatensis Hit   
36 Alstonia  scholaris Hit   
37 Hibiscus  sabdariffa Hit   
38 Laminaria  japonica Hit   
39 Syzygium  polyanthum Hit   
40 Andrographis  paniculata Hit   
41 Sindora  sumatrana Hit
42 Cassia  angustifolia Hit   
43 Woodfordia  floribunda Hit   
44 Piper  betle Hit   
45 Spirulina Hit   
46 Stevia  rebaudiana Hit   
47 Theae  sinensis Hit   
48 Sonchus  arvensis Hit   
49 Curcuma  heyneana Hit   
50 Curcuma  aeruginosa Hit   
51 Kaempferia  pandurata Hit
52 Curcuma  xanthorrhiza Hit   
53 Curcuma  zedoaria Hit
54 Olea  europaea Hit   

I.  Disease  respiratory  diseases   
1 Foeniculum  vulgare Hit   
2 Clausena  anisum-olens Hit   
3 Glycyrrhiza  uralensis Hit   
4 Zingiber  purpureum Hit   
5 Piper  retrofractum Hit
6 Syzygium  aromaticum Hit   
7 Gaultheria  punctata Hit   
8 Panax  ginseng Hit   
9 Equisetum  debile Hit
10 Zingiber  officinale Hit   
11 Citrus  aurantium Hit
12 Nigella  sativa Hit
13 Amomum  compactum Hit   
14 Cinnamomum burmannii Hit   
15 Melaleuca  leucadendra Hit   
16 Parkia  roxburghii Hit   
17 Cocos  nucifera Hit   
18 Piper  cubeba Hit   
19 Kaempferia  galanga Hit   
20 Coriandrum  sativum Hit   
21 Curcuma  longa Hit   
22 Piper  nigrum Hit   
23 Zingiber  aromaticum Hit   
24 Languas  galanga Hit   
25 Mentha  piperita Hit   
26 Oryza  sativa Hit
27 Myristica  fragrans Hit   
28 Pandanus  amaryllifolius Hit
29 Hydrocotyle  asiatica Hit
30 Mentha  arvensis Hit   
31 Lepiniopsis  ternatensis Hit   
32 Helicteres  isora Hit   
33 Blumea  balsamifera Hit   
34 Cymbopogon  nardus Hit   
35 Piper  betle Hit   
36 Curcuma  xanthorrhiza Hit   
37 Salix  alba Hit
38 Matricaria  chamomilla Miss

J.  Disease:  skin  and  connective  tissue   
1 Strychnos  ligustrina Hit   
2 Merremia  mammosa Hit
3 Piper  retrofractum Hit
4 Santalum  album Hit   
5 Zingiber  officinale Hit
6 Citrus  aurantium Hit   
7 Citrus  hystrix Hit   
8 Cassia  siamea Hit   
9 Cocos  nucifera Hit   
10 Trigonella  foenum-graecum Hit   
11 Orthosiphon  stamineus Hit   
12 Curcuma  longa Hit   
13 Vetiveria  zizanioides Hit   
14 Aloe  vera Hit   
15 Rosa  chinensis Hit   
16 Jasminum  sambac Hit   
17 Phyllanthus  urinaria Hit   
18 Mentha  piperita Hit   
19 Oryza  sativa Hit   
20 Myristica  fragrans Hit
21 Hydrocotyle  asiatica Hit   
22 Lepiniopsis  ternatensis Hit   
23 Alstonia  scholaris Hit   
24 Andrographis  paniculata Hit   
25 Cymbopogon  nardus Hit   
26 Piper  betle Hit   
27 Theae  sinensis Hit   
28 Curcuma  heyneana Hit   
29 Kaempferia  pandurata Hit
30 Curcuma  xanthorrhiza Hit   
31 Melaleuca  leucadendra Hit   
32 Matricaria  chamomilla Miss

K.  Disease:  the  urinary  system   
1 Foeniculum  vulgare Hit
2 Imperata  cylindrica Hit
3 Strychnos  ligustrina Hit
4 Plantago  major Hit   
5 Zingiber  officinale Hit
6 Cinnamomum burmannii Hit
7 Strobilanthes  crispus Hit   
8 Kaempferia  galanga Hit
9 Orthosiphon  stamineus Hit   
10 Phyllanthus  urinaria Hit   
11 Blumea  balsamifera Hit
12 Sonchus  arvensis Hit   
13 Curcuma  xanthorrhiza Hit   

*indicates that plant will not assigned if we use matching score >0.7.

Figure 6

Distribution of 135 plants assigned based on 0.7% dataset with respect to the number of diseases they are assigned to.

3.4. Evaluation of the Supervised Clustering Based on DPClusO

We used previously published results [6] as gold standard to evaluate our results. The previous study assigned plants to 9 kinds of efficacy whereas we assigned the plants to 18 disease classes (16 from NCBI and 2 additional classes). For the sake of evaluation we got done a mapping of the 18 disease classes to 9 efficacy classes by a professional doctor, which is shown in Table 5. Table 6 shows the prediction result of plant-disease relations for all 3 datasets, corresponding to clusters with matching score greater than 0.6. Table 6 also shows corresponding efficacy, the number of assigned plants, number of correctly predicted plants, and true positive rates (TPR), respectively.
Table 5

Relation between disease classes in NCBI and efficacy classes reported by Afendi et al. [6].

Class of diseaseRef.Efficacy class
D1 Blood and lymph diseasesNCBI E7 Pain/inflammation (PIN)
D2 CancersNCBI E7 Pain/inflammation (PIN)
D3 The digestive systemNCBI E4 Gastrointestinal disorders (GST)
E7 Pain/inflammation (PIN)
D4 Ear, nose, and throatNCBI E7 Pain/inflammation (PIN)
D5 Diseases of the eyeNCBI E7 Pain/inflammation (PIN)
D6 Female-specific diseasesNCBI E5 Female reproductive organ problems (FML)
D7 Glands and hormonesNCBI E7 Pain/inflammation (PIN)
D8 The heart and blood vesselsNCBI E7 Pain/inflammation (PIN)
D9 Diseases of the immune systemNCBI E7 Pain/inflammation (PIN)
D10 Male-specific diseasesNCBI E6 Musculoskeletal and connective tissue disorders (MSC)
D11 Muscle and boneNCBI E6 Musculoskeletal and connective tissue disorders (MSC)
D12 Neonatal diseasesNCBI E7 Pain/inflammation (PIN)
D13 The nervous systemNCBI E7 Pain/inflammation (PIN)
D14 Nutritional and metabolic diseasesNCBI E2 Disorders of appetite (DOA)
E4 Gastrointestinal disorders (GST)
D15 Respiratory diseasesNCBI E8 Respiratory disease (RSP)
E7 Pain/inflammation (PIN)
D16 Skin and connective tissueNCBI E9 Wounds and skin infections (WND)
D17 The urinary system E1 Urinary related problems (URI)
D18 Mental and behavioural disorders E3 Disorders of mood and behavior (DMB)
Table 6

The prediction result of plant-disease relations using matching score >0.6.

Class of disease Corresponding efficacy0.7% dataset0.5% dataset0.3% dataset
Number of assigned plantsCorrect predictionTrue positive rateNumber of assigned plantsCorrect predictionTrue positive rateNumber of assigned plantsCorrect predictionTrue positive rate
D1E72622 0.85 2420 0.83 2420 0.83
D2E711 1.00 55 1.00 11 1.00
D3E4 4242 1.00 3333 1.00 2828 1.00
E738 0.90 30 0.91 25 0.89
D4E7000000
D5E7000000
D6E53838 1.00 3737 1.00 3232 1.00
D7E7000000
D8E7108 0.80 87 0.88 65 0.83
D9E7000011 1.00
D10E664 0.67 2031 0.33
D11E66565 1.00 7171 1.00 6060 1.00
D12E7000000
D13E7000055 1.00
D14E2 5444 0.81 4536 0.80 3526 0.74
E454 1.00 45 1.00 35 1.00
D15E7 3837 0.97 3434 1.00 3333 1.00
E831 0.82 30 0.88 29 0.88
D16E93231 0.97 3232 1.00 2727 1.00
D17E11313 1.00 99 1.00 88 1.00
D18E30055 1.00 44 1.00

Total assigned plants135129117
We determined TPR corresponding to a disease/efficacy class by calculating the ratio of the number of correct prediction to the number of all predictions. When a disease corresponds to more than one kind of efficacy, the highest TPR can be considered the TPR for the corresponding disease. For all 3 datasets the TPR corresponding to each disease is roughly 90% or more. The 0.3% dataset consists of Jamu pairs with higher correlation values and based on this dataset 117 plants are assigned to 14 disease classes. The 0.7% dataset contains more Jamu pairs and assigned plants to 11 disease classes, one less disease class compared to 0.5% dataset. The two disease classes covered by 0.3% dataset but not covered by 0.5% and 0.7% datasets are the nervous system (D13) and disease of the immune system (D9). The only disease class covered by 0.3% and 0.5% datasets but not covered by 0.7% dataset is mental and behavioural disorders (D18). The larger dataset network tends to have lower coverage of disease classes. The number of Jamu pairs, that is, the number of edges in the network, affect the number of DPClusO resulting clusters and number of Jamu formulas per cluster. As a consequence, for the larger dataset networks, the success rate becomes lower and the coverage of disease classes is lower but prediction of more plant-disease relations can be achieved.

4. Conclusions

This paper introduces a novel method called supervised clustering for analyzing big biological data by integrating network clustering and selection of clusters based on supervised learning. In the present work we applied the method for data mining of Jamu formulas accumulated in KNApSAcK database. Jamu networks were constructed based on correlation similarities between Jamu formulas and then network clustering algorithm DPClusO was applied to generate high density Jamu modules. For the analysis of the next steps potential clusters were selected by supervised learning. The successful clusters containing several Jamu related to the same disease might be useful for finding main ingredient plant for that disease and the lower matching score value clusters will be associated with varying plants which might be supporting ingredients. By applying the proposed method important plants from Jamu formulas for every classes of disease were determined. The plant to disease relations predicted by proposed network based method were evaluated in the context of previously published results and were found to produce a TPR of 90%. For the larger dataset networks, success rate and the coverage of disease classes become lower but prediction of more plant-disease relations can be achieved.
  11 in total

1.  Emergence of scaling in random networks

Authors: 
Journal:  Science       Date:  1999-10-15       Impact factor: 47.728

2.  Herbal medicines put into context.

Authors:  E Ernst
Journal:  BMJ       Date:  2003-10-18

3.  Growing network with local rules: preferential attachment, clustering hierarchy, and degree correlations.

Authors:  Alexei Vázquez
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2003-05-07

4.  KNApSAcK family databases: integrated metabolite-plant species databases for multifaceted plant research.

Authors:  Farit Mochamad Afendi; Taketo Okada; Mami Yamazaki; Aki Hirai-Morita; Yukiko Nakamura; Kensuke Nakamura; Shun Ikeda; Hiroki Takahashi; Md Altaf-Ul-Amin; Latifah K Darusman; Kazuki Saito; Shigehiko Kanaya
Journal:  Plant Cell Physiol       Date:  2011-11-28       Impact factor: 4.927

5.  BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network.

Authors:  Xiaohui Chen; Ming Chen; Kaida Ning
Journal:  Bioinformatics       Date:  2006-09-27       Impact factor: 6.937

6.  Efficacy prediction of jamu formulations by PLS modeling.

Authors:  Farit M Afendi; Latifah K Darusman; Aki Hirai Morita; Md Altaf-Ul-Amin; Hiroki Takahashi; Kensuke Nakamura; Ken Tanaka; Shigehiko Kanaya
Journal:  Curr Comput Aided Drug Des       Date:  2013-03       Impact factor: 1.606

7.  BisoGenet: a new tool for gene network building, visualization and analysis.

Authors:  Alexander Martin; Maria E Ochagavia; Laya C Rabasa; Jamilet Miranda; Jorge Fernandez-de-Cossio; Ricardo Bringas
Journal:  BMC Bioinformatics       Date:  2010-02-17       Impact factor: 3.169

Review 8.  Data Mining Methods for Omics and Knowledge of Crude Medicinal Plants toward Big Data Biology.

Authors:  Farit M Afendi; Naoaki Ono; Yukiko Nakamura; Kensuke Nakamura; Latifah K Darusman; Nelson Kibinge; Aki Hirai Morita; Ken Tanaka; Hisayuki Horai; Md Altaf-Ul-Amin; Shigehiko Kanaya
Journal:  Comput Struct Biotechnol J       Date:  2013-03-23       Impact factor: 7.271

9.  Modifying the DPClus algorithm for identifying protein complexes based on new topological structures.

Authors:  Min Li; Jian-er Chen; Jian-xin Wang; Bin Hu; Gang Chen
Journal:  BMC Bioinformatics       Date:  2008-09-25       Impact factor: 3.169

Review 10.  Topology of molecular interaction networks.

Authors:  Wynand Winterbach; Piet Van Mieghem; Marcel Reinders; Huijuan Wang; Dick de Ridder
Journal:  BMC Syst Biol       Date:  2013-09-16
View more
  3 in total

1.  Development of a biomarker database toward performing disease classification and finding disease interrelations.

Authors:  Shaikh Farhad Hossain; Ming Huang; Naoaki Ono; Aki Morita; Shigehiko Kanaya; Md Altaf-Ul-Amin
Journal:  Database (Oxford)       Date:  2021-03-11       Impact factor: 3.451

2.  Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines.

Authors:  Sony Hartono Wijaya; Farit Mochamad Afendi; Irmanida Batubara; Latifah K Darusman; Md Altaf-Ul-Amin; Shigehiko Kanaya
Journal:  BMC Bioinformatics       Date:  2016-12-07       Impact factor: 3.169

3.  Development and mining of a volatile organic compound database.

Authors:  Azian Azamimi Abdullah; Md Altaf-Ul-Amin; Naoaki Ono; Tetsuo Sato; Tadao Sugiura; Aki Hirai Morita; Tetsuo Katsuragi; Ai Muto; Takaaki Nishioka; Shigehiko Kanaya
Journal:  Biomed Res Int       Date:  2015-09-30       Impact factor: 3.411

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.