Literature DB >> 25032086

Technical Aspects of Nominal Partitions on Accuracy of Data Mining Classification of Intestinal Microbiota - Comparison between 7 Restriction Enzymes.

Abstract

The application of data mining analyses (DM) is effective for the quantitative classification of human intestinal microbiota (HIM). However, there remain various technical problems that must be overcome. This paper deals with the number of nominal partitions (NP) of the target dataset, which is a major technical problem. We used here terminal restriction fragment length polymorphism data, which was obtained from the feces of 92 Japanese men. Data comprised operational taxonomic units (OTUs) and subject smoking and drinking habits, which were effectively classified by two NP (2-NP; Yes or No). Using the same OTU data, 3-NP and 5-NP were examined here and results were obtained, focusing on the accuracies of prediction, and the reliability of the selected OTUs by DM were compared to the former 2-NP. Restriction enzymes for PCR were further affected by the accuracy and were compared with 7 enzymes. There were subjects who possess HIM at the border zones of partitions, and the greater the number of partitions, the lower the obtained DM accuracy. The application of balance nodes boosted and duplicated the data, and was able to improve accuracy. More accurate and reliable DM operations are applicable to the classification of unknown subjects for identifying various characteristics, including disease.

Entities: Chemical Disease Gene Species

Keywords: accuracy of classification; balance node; data mining analysis; decision tree; human intestinal microbiota; nominal partitions of data; operational taxonomic unit

Year: 2014 PMID： 25032086 PMCID： PMC4098652 DOI： 10.12938/bmfh.33.129

Source DB: PubMed Journal: Biosci Microbiota Food Health ISSN： 2186-3342

INTRODUCTION

Human intestinal microbiota (HIM) is related to our health, and practical research on the relationship with the human immune systems and diseases is now being widely performed. Our previous papers [1,2,3] have assessed HIM data obtained by data mining analysis (DM) for quantitative classification of the relationship between subject characteristics. The results were fruitful, but due to the unique application of DM to HIM, some accumulation of case studies is required for further DM operations. The selection of primer-restriction enzymes and the number of nominal partitions (NP) of assigned characteristics are important factors for reliable applications. This paper aims to compare the effects of both factors for obtaining accurate and dependable DM results, which are the major technical problems of practical applications. The number of NP, which is a partition of assigned characteristics and depends on the purpose of the analysis, directly affects the accuracy of the DM results. In other words, proper NP application to the data is necessary. Our previous paper [3] already dealt with a simple 2 nominal partition (2-NP), i.e., Yes or No, and examined the accuracy between the 7 restriction enzymes. Here, we aim to further examine another 2 types of NP, 3-NP and 5-NP, and to compare the 3 types of NP, including 2-NP, as shown in Table 1, with 2 characteristics, the latter of which was reported previously [3], but is included here for comparison. The original operational taxonomic unit (OTU) data applied in this paper were the same as reported in our previous papers [1,2,3,4], but the detailed NPs are different. As with the previous paper [3], dietary factors for healthy male subjects were controlled, which is an important starting point for the quantitative analysis of HIM.

Table 1.

2 to 5 nominal partition (NP) of the 92 subjects

N.: number; cess. P.: smoking cessation period; 5Y: 5 years; w.: week; d.: day; AlOH, A: alcohol; Shadows at '2-NP' indicated that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP. HIM are represented here as OTUs by terminal restriction fragment length polymorphism (T-RFLP) analysis. The relationship between OTUs and subject characteristics was assessed by cluster analysis, using the methods of Jin [4] and Andoh [5, 6], or by Pearson correlation coefficients and principal component analysis. To date, DM has been applied to the relationships between genes, single nucleotide polymorphisms (Merelli [7]) and inflammatory bowel disease (Merelli [8]), as well as to age-dependent genes (Kirschner [9]) and hormone levels (Modlin [10]), but has not been applied to general HIM. i.e., OTUs. OTUs are thought to contain numerous types of bacteria, and their composition directly affects the accuracy of DM classifications. We therefore applied 7 restriction enzymes for better comparisons of subject classification. DM will be applied to classify all OTU data, of which characteristic have various NPs, e.g., types or symptoms of diseases; thus, for effective DM operation, systematic comparisons are required and are examined here.

MATERIALS AND METHODS

As reported previously [4], to avoid the influence of dietary factors, we designed identical meals (1,879 kcal/day), which were fed for 3 days to 92 healthy male volunteers living in Japan. Age and body mass index (BMI) of the subjects were 21–59 years (average: 36.8 years) and 17.3–30.2 kg/m2 (average: 22.6 kg/m2), respectively. Fecal samples were analyzed by T-RFLP using 7 restriction enzymes [2, 4]. T-RFLP was applied due to its reproducibility, comparatively low cost and convenience with regard to DM operation. Studies were performed in accordance with the protocols approved by the Riken Research Ethics Committee (Wakou 2009-3rd 21-13), and the OTU data were accumulated by the Benno Laboratory, Riken, Japan. Bacterial DNA was isolated from feces using a modification of the method described by Matsuki [11]. Amplification of fecal 16S rRNA, restriction enzyme digestion, size fractionation of T-RFs and T-RFLP analysis were carried out as described previously [12,13,14]. Details of amplification and T-RFLP analysis with the 7 restriction enzymes, i.e., 516f-BslI, 516f-HaeIII, 27f-MspI, 27f-AluI, 35f-HhaI, 35f-MspI and 35f-AluI, were as described in our previous papers [2, 4]. The amounts for each OTU represent the fluorescence intensity and concentration. The obtained OTU data are abbreviated here as B--- (---: base pair number) for 516f-BslI, HA--- for 516f-HaeIII, M--- for 27fMspI, A—for 27f-AluI, QHh—for 35f-HhaI, QM—for 35f-MspI and QA—for 35f-AluI. We had 2 groups of OTUs: 516f- + 27f- (4 restriction enzymes), and 35f- (3 restriction enzymes). The component numbers of these 7 enzyme groups were 27·B, 33·HA, 20·M, 40·A, 31·QHh, 34·QM and 48·QA; thus, if we combined all the enzyme components of the 2 groups, the former had a maximum of 120 OTUs, and the latter had a maximum of 113 OTUs. On account of the balance between the number of subjects (92) and OTU components, we did not mix the data from the 2 groups to avoid the problem of field alignment sequences described in previous reports [2, 3]. Various sets of restriction enzymes were combined, and the data were arranged with the answers of the 92 subjects. The resulting 2-dimensional Excel data were analyzed using DM software (IBM-SPSS, Clementine14). A DM algorithm (Classification and Regression Tree (C&RT) modeling system), which is the most typical method of DM, provides a Decision tree (Dt). The Dt explicitly classifies the various groups of subjects according to the assigned characteristics, as shown in Table 1. C&RT divides subjects into two subsets by comparing the Gini coefficient according to the OTU data, such that the subjects within each subset are more homogeneous than in the previous subset. The C&RT system is flexible, and allows unequal misclassification costs to be considered when comparing to the other modeling systems of DM. A major specialty of DM and the constructed Dt is that a single selected OTU is used for each step of Dt. The default setting of the C&RT system grows a Dt to 5 steps. The balance nodes applied here are for correcting the imbalances in the dataset, which readily develop with higher NPs, and we conform to the specified test criteria and are able to obtain more accurate results. If necessary, balancing is carried out by boosting the occurrence of infrequent values at the time of Dt construction.

RESULTS AND DISCUSSION

Comparison of NPs and restriction enzymes

The Dt produced with 2-NP, as a simple example for understanding and saving space, is shown in Fig. 1, where smoking habit of subjects was explicitly classified into several nodes with certain OTUs. Applying 3-NP and 5-NP as shown in Table 1, the subjects were divided according to the various purposes of DM analysis. Here, the number of partitions was limited to 5 because this was lowest number of subject group (4 as SBH, 5 as SP5 and 6 as DB3) in Table 1. Appendix Fig. A1 shows the results of an actual Dt with 5-NPs for smoking habit, because most of the results with higher NPs required more space to show.

Fig. 1.

Decision-tree (Dt) by 2-NP for smoking habit with 53 OTUs. OTUs: 33HA+20M; marked as * in Table 2; large solid arrows: 7 nodes containing all 16 smokers, ‘SB’; large dotted arrow: node of 63 nonsmokers, ‘SA’.

Table 2.

Comparison of nominal partitions for accuracy of DM and smoking habit

R.Enz.: primer restriction enzymes; N.: number; NP: nominal partition; N. of wrongly classified subjects: number of misclassified subjects up to 5th step=accuracy; Combination of R.Enz. indicated sequences in DM processing; *: detailed Dt is shown in Fig. 1; $: detailed Dt is shown in Appendix Fig. A1; #1 – #3: compared with balance nodes in Table 6; &1 – &5: OTUs obtained up to 3rd step are shown in Table 4; Shadow at '2-NP' indicates that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP.

Flow-chart at utilization of balance nodes. m.rate: multiple rate for boosting data; Balance nodes are used to correct imbalances in a target dataset. Practical criteria for application were unclear between 10% and 20%. The applied results are shown in Table 6.

Table 6.

Application of balance nodes, accuracy and Dt configuration

The details of Dt and the pathways to reach the terminal node in these figures clearly show the species of related OTUs, which played a role in dividing the various subject nodes. The Dt also provides quantitative cut off values, namely the 92 men were divided at the 1st step by HA291 for 2 subsets at the left end of Fig. 1 and were subsequently divided. The 1st step was divided at 3.13 by HA291, and the lower 2nd step was recognized as 4.28. The specialty of this Dt was that only 7 OTUs were active out of 53, considering 2 OTUs, i.e., HA291 and HA83 being applied twice, which indicated that the remaining 46 OTUs were neglected in constructing this Dt. In other words, the 7 OTUs were closely related to subject smoking characteristics, and the other 46 OTUs were recognized as unrelated to smoking. When comparing Fig. 1 with similar results in previous reports [3] (Fig. 1), which had applied 80 OTUs (27·BslI+33·HaeIII+20·MspI), HA291 had the same cut-off value at the 1st step, but OTUs later than the 2nd step were different. In addition, 2 wrongly classified subjects were observed in Fig. 1. These were the effects of applied OTU combinations, but we focused only on the accuracies of Dt until the 5th step. With regard to smoking habit, Biedermann et al. [15] recently examined the effects of smoking cessation in 5 subjects, as compared with 10 controls, by T-RFLP and PCA. Their results showed an increase in Firmicutes and Actinobacteria and a lower proportion of Bacteroidetes and Proteobacteria at the phylum level. Similarly in Appendix Fig. A1, 12 OTUs out of 80 were selected to construct the Dt, including HA83. The 92 men were divided at the 1st step by B369 for subset 2 at the left end of Appendix Fig. A1. The 1st step was divided at 1.17 by B369, and the upper 2nd node (Node-1) included 86 subjects, with the lower 2nd node (Node-2) having only 6. These results were the main differences from the former classification methods for HIM, such as clustering, PCA and Pearson correlation coefficient, which considered all OTU data without any selections, and the results inevitably became obscure. Table 2 shows a comparison of the results for 2-NP, 3-NP and 5-NP, with some combination of 7 restriction enzymes for smoking habit. Similarly, with regard to drinking habit, the results shown in Table 3 also show the OTUs for the 1st step and the number of wrongly classified subjects among the 92. The latter indicates the accuracy of evaluation for each set of NP and restriction enzymes, the best value of which is 0.

Table 3.

Comparison of nominal partitions for accuracy of DM and drinking habit

All notations are the same as in Table 2.

Table 4.

Comparison of detailed Dts with better 3-NP, marked as &1 – &5 in Table 2

Smoking habit; all of their Dt 1st step: QM134; &2, &4 and &5 are as described in the text; Other notations are the same as for Tables 1 and 2.

All notations are the same as in Table 2. Tables 2 and 3 showed that accuracy is closely related to the combination of restriction enzymes, not only horizontally in the tables, but also vertically with the same restriction enzymes and different NPs. The best accuracies were recognized as having the same OTUs at the 1st step. Higher NPs gave worse accuracy, with the exception of smoking at 3-NP and 3 combinations of restriction enzymes, i.e., QHh+QM, QHh+QM+QA and QM+QA+QHh (marked as &2, &4 and &5 in the lower middle of Table 2), where only 1 subject was misclassified. Comparing the 2 restriction enzymes group, i.e., between 516f-+27f- and 35f-, the former generally seemed to have slightly better accuracy than the latter. Typical OTUs such as HA291 for heavy smokers [1,2,3] were only observed at 2-NP for smoking in Table 2, but A47 for drinking was widely obtained at the 1st step in Table 3. Comparing the 2 characteristics, i.e., smoking and drinking, the former was rather easier for classification than the latter, which was previously reported [3] only with 2-NP.

Detailed aspects of better accuracy

Tracing the details of the referred exceptional and better cases marked as &1 to &5 in the lower half of Tables 2 and 4 shows the detailed components of the Dt from the 1st step to the 3rd step. For all 5 cases, the 1st step was the same as with QM134, which indicates exceptional accuracy. The reason why these cases had such results is the structure of the Dt configurations. The 3 cases that the best values, i.e., &2, &4 and &5, revealed that the structure of OTUs was the same until the 2nd step, and that the 3rd step was slightly different. Furthermore, the restriction enzymes in these 3 cases, i.e., QHh+QM, QHh+QM+QA and QM+QA+QHh, had a similar Dt configuration until the 5th step. Even though the selected OTUs were different, the locations of missing nodes were similar at the 4th and 5th steps, which are not shown in Table 4. This suggests that OTUs constructed from individual Dt after the 4th step were replaceable with certain OTUs, and that QA was less workable for this classification than QHh and QM. Finally, OTUs for QM134 played the best role in subject classification for smoking with 3-NP and 3 restriction enzymes (35f-), while QHh178 and QHh574 at the 2nd step played secondary important roles. Smoking habit; all of their Dt 1st step: QM134; &2, &4 and &5 are as described in the text; Other notations are the same as for Tables 1 and 2.

Subject features for good classification

Although the values for accuracy were simply compared in Tables 2 and 3, each subject had their own individual OTU features, which were classified with varying levels of ease. In other words, some subjects might have cloudy or boundary features for being classified. Thus, for single utilization of the 4 restriction enzymes, i.e., B, HA, M and A, with 3-NP and 5-NP, the misclassified subjects were individually traced and examined. The number of misclassified subjects, redundantly observed subjects, the rate of wrongly observed subjects among the 92 and the rate of always properly classified subjects were examined and are listed in Table 5. Interestingly, values were recognized in the latter 2 rates, namely, that these 2 rates themselves were easily understood due to the features of OTUs. Furthermore, the intermediary values between these 2 rates, i.e., 100 - 26.1 - 55.4=18.5% for smoking and 29.4% for drinking, were the middle features of the 92 subjects, which were classified properly at either 3-NP or 5-NP. These features were closely combined with smoking or drinking characteristics. Differences and specificities were observed clearly with the values in Table 5; smoking was comparatively easy to classify, and drinking was more ambiguous than smoking, which were recognized with the physiological stresses to the subjects.

Table 5.

Comparison of wrongly classified subjects with NPs and characteristics

Single use of 4 R.Enz.: B, HA, M and A; (···): rates among 92 subjects (%).

Balance node, application of boosted apparent subjects

As shown in Table 1, with large NP, i.e., 5-NP, numbers of component data became small and imbalanced, e.g., SB5, SBH and DB3. If the minimum number vs. the maximum data was less than 15%, the obtained Dt was considered to be less stable, and was easily shifted using a slight change in the minimum component data. To overcome these problems, the DM software provides special methods for applying balance nodes, which boosts and duplicates subjects during Dt construction. Boosting refers to the multiple utilization of minor data components, which allows the total apparent data to be balanced. However, the total number of subjects increases naturally depending on the applied multiple rates for each component. After Dt was constructed, the original data for the 92 subjects without any boosting was applied to the obtained Dt, and the accuracy was normally examined. The detailed mechanisms of boosting and preparing the subjects are shown in Fig. 2 and Table 6 for the cases of smoking marked as #2 and #3 shown in the upper middle part of Table 2. In the left half of Table 6, the original dataset without boosting is shown, and in the middle of the table, multiple rates for boosting and number of apparent subjects are indicated. On the right side, the results examined normally with the original dataset, i.e., 92, are shown. Comparing the results in Table 6, with and without the balance nodes, the advancement improved, particularly at the case of imbalanced datasets, i.e., 5-NP. Namely, the accuracy improved from 77.2% to 89.1%, which shows the advantage of balance nodes. However, in the case of 3-NP, components that are less imbalanced (i.e., 28.1%), the progress was a slight, from 87.0 to 88.0, but the obtained Dt configuration was different.

Fig. 2.

Smoking habit; R.Enz.: 27B+33HA+20M; NP: nominal partition; N.: number; constr.: construction; N. of real subjects: original dataset: 92; m.rate: multiple rate at boosting the data; N. of apparent subjects: boosted subjects for Dt construction; #1 – #3: compared with balance nodes in Table 2; 1st step – 3rd step were connected only vertically, not related to the left end column, i.e., SAA – SP5; "-" at Dt configuration: missing OTU; Shadow at '2-NP' indicates that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP; Basic application schemes for balance nodes are shown in Fig. 2 as a flow-chart. The obtained OTUs up to the 3rd step were also shown in Table 6 for comparison. The configuration of the Dt indicates the effects of balance nodes. First, the obtained OTUs with the balance node became different from OTUs with normal DM. Second, HA291, which was recognized as the most related OTU to heavy smokers [2, 3], appeared after the application of balance nodes with 5-NP at the 1st step and the lower 2nd step, which are underlined in Table 6. This indicates that the Dt structure after applying balance nodes is similar to the Dt with 2-NP, which is shown in the upper middle part of Table 6. If the OTU dataset has imbalanced components, a more stable Dt configuration and OTUs are obtained with the application of balance nodes. Wide imbalances in a dataset, such as having uneven components, take place occasionally with HIM analyses of large NPs (5 or more).

Effects of nominal partitions

With regard to the selection of effective restriction enzymes to obtain the accurate DM results, Tables 2 and 3 gave us a good example for smoking and drinking habits. The applications of 2 to 3 combined restriction enzymes revealed better results. Furthermore, in these limited cases, the 516f- + 27f- group exhibited better results than the 35f- group. Focusing on the effects of NPs, which were also observed in Tables 2 and 3, the more NPs were applied, the less accuracy was generally obtained. This provided valuable information about both the selection of related OTUs, and confirmed an effective and stable method for DM processing. Moreover, we obtained in parallel lists of classified subjects, who were situated in the terminal Dt nodes. This means that one is able to classify or discriminate individual subjects, which were visually understood in Appendix Fig. A1 with 5-NP. The OTU of the 1st step indicated here in the figures and tables was the most related OTU to the assigned characteristics, and the 4th and 5th steps were thought to show some indirect effects, such as local effects in certain areas of OTUs. Focusing only to the OTU of the 1st step, with increasing NP (3-NP or more), less accuracy for DM was observed, as shown in Tables 2 and 3, which is an essential problem of DM processing. However, 5-NP has 4 borders within the dataset and gave worse accuracy when compared to 2-NP, which has only 1 border. The greater the number of NPs, the more the subjects are situated in the border zones of partitions. Therefore, to obtain a clear and simple Dt structure and steady OTU, it is preferable to utilize small NPs (2-NP or 3-NP) than large NPs (5-NP or more). On the other hand, there is a remedy for large NPs and imbalanced components; application of the balance node, as shown in Fig. 2 and Table 6. However, there is a limit on improving accuracy due to the principal mechanisms of a dataset, such as the existence of subjects situated at the border zone.

14 in total

1. Oligonucleotide microarray data mining: search for age-dependent gene expression.

Authors: Marc Kirschner; Gemma Pujol; Aurelian Radu
Journal: Biochem Biophys Res Commun Date: 2002-11-15 Impact factor: 3.575

2. Comparison of the fecal microbiota profiles between ulcerative colitis and Crohn's disease using terminal restriction fragment length polymorphism analysis.

Authors: Akira Andoh; Hirotsugu Imaeda; Tomoki Aomatsu; Osamu Inatomi; Shigeki Bamba; Masaya Sasaki; Yasuharu Saito; Tomoyuki Tsujikawa; Yoshihide Fujiyama
Journal: J Gastroenterol Date: 2011-01-21 Impact factor: 7.527

3. Analysis of the human intestinal microbiota from 92 volunteers after ingestion of identical meals.

Authors: J S Jin; M Touyama; R Kibe; Y Tanaka; Y Benno; T Kobayashi; M Shimakawa; T Maruo; T Toda; I Matsuda; H Tagami; M Matsumoto; G Seo; O Chonan; Y Benno
Journal: Benef Microbes Date: 2013-06-01 Impact factor: 4.205

4. Principal component analysis, hierarchical clustering, and decision tree assessment of plasma mRNA and hormone levels as an early detection strategy for small intestinal neuroendocrine (carcinoid) tumors.

Authors: Irvin M Modlin; Björn I Gustafsson; Ignat Drozdov; Boaz Nadler; Roswitha Pfragner; Mark Kidd
Journal: Ann Surg Oncol Date: 2008-12-03 Impact factor: 5.344

5. Dynamics of fecal microbiota in hospitalized elderly fed probiotic LKM512 yogurt.

Authors: Mitsuharu Matsumoto; Mitsuo Sakamoto; Yoshimi Benno
Journal: Microbiol Immunol Date: 2009-08 Impact factor: 1.955

6. Quantitative PCR with 16S rRNA-gene-targeted species-specific primers for analysis of human intestinal bifidobacteria.

Authors: Takahiro Matsuki; Koichi Watanabe; Junji Fujimoto; Yukiko Kado; Toshihiko Takada; Kazumasa Matsumoto; Ryuichiro Tanaka
Journal: Appl Environ Microbiol Date: 2004-01 Impact factor: 4.792

7. SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS.

Authors: Ivan Merelli; Andrea Calabria; Paolo Cozzi; Federica Viti; Ettore Mosca; Luciano Milanesi
Journal: BMC Bioinformatics Date: 2013-01-14 Impact factor: 3.169

8. IBDsite: a Galaxy-interacting, integrative database for supporting inflammatory bowel disease high throughput data analysis.

Authors: I Merelli; F Viti; L Milanesi
Journal: BMC Bioinformatics Date: 2012-09-07 Impact factor: 3.169

9. Identification of Heavy Smokers through Their Intestinal Microbiota by Data Mining Analysis.

Authors: Toshio Kobayashi; Kenji Fujiwara
Journal: Biosci Microbiota Food Health Date: 2013-04-27

10. Comparison of the accuracy and mechanism of data mining identification of the intestinal microbiota with 7 restriction enzymes.

Authors: Toshio Kobayashi; Kenji Fujiwara
Journal: Biosci Microbiota Food Health Date: 2013-10-30

3 in total

1. Characterization of gut microbiota profiles in coronary artery disease patients using data mining analysis of terminal restriction fragment length polymorphism: gut microbiota could be a diagnostic marker of coronary artery disease.

Authors: Takuo Emoto; Tomoya Yamashita; Toshio Kobayashi; Naoto Sasaki; Yushi Hirota; Tomohiro Hayashi; Anna So; Kazuyuki Kasahara; Keiko Yodoi; Takuya Matsumoto; Taiji Mizoguchi; Wataru Ogawa; Ken-Ichi Hirata
Journal: Heart Vessels Date: 2016-04-28 Impact factor: 2.037

2. Characteristics of Gut Microbiota in Patients With Diabetes Determined by Data Mining Analysis of Terminal Restriction Fragment Length Polymorphisms.

Authors: Yuta Nakamura; Yoshio Nagai; Toshio Kobayashi; Kentaro Furukawa; Yoichi Oikawa; Akira Shimada; Yasushi Tanaka
Journal: J Clin Med Res Date: 2019-05-10

Review 3. Numerical analyses of intestinal microbiota by data mining.

Authors: Toshio Kobayashi; Akira Andoh
Journal: J Clin Biochem Nutr Date: 2018-01-11 Impact factor: 3.114

3 in total