| Literature DB >> 25032086 |
Toshio Kobayashi1, Kenji Fujiwara2.
Abstract
The application of data mining analyses (DM) is effective for the quantitative classification of human intestinal microbiota (HIM). However, there remain various technical problems that must be overcome. This paper deals with the number of nominal partitions (NP) of the target dataset, which is a major technical problem. We used here terminal restriction fragment length polymorphism data, which was obtained from the feces of 92 Japanese men. Data comprised operational taxonomic units (OTUs) and subject smoking and drinking habits, which were effectively classified by two NP (2-NP; Yes or No). Using the same OTU data, 3-NP and 5-NP were examined here and results were obtained, focusing on the accuracies of prediction, and the reliability of the selected OTUs by DM were compared to the former 2-NP. Restriction enzymes for PCR were further affected by the accuracy and were compared with 7 enzymes. There were subjects who possess HIM at the border zones of partitions, and the greater the number of partitions, the lower the obtained DM accuracy. The application of balance nodes boosted and duplicated the data, and was able to improve accuracy. More accurate and reliable DM operations are applicable to the classification of unknown subjects for identifying various characteristics, including disease.Entities:
Keywords: accuracy of classification; balance node; data mining analysis; decision tree; human intestinal microbiota; nominal partitions of data; operational taxonomic unit
Year: 2014 PMID: 25032086 PMCID: PMC4098652 DOI: 10.12938/bmfh.33.129
Source DB: PubMed Journal: Biosci Microbiota Food Health ISSN: 2186-3342
2 to 5 nominal partition (NP) of the 92 subjects
N.: number; cess. P.: smoking cessation period; 5Y: 5 years; w.: week; d.: day; AlOH, A: alcohol; Shadows at '2-NP' indicated that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP.
Fig. 1.Decision-tree (Dt) by 2-NP for smoking habit with 53 OTUs. OTUs: 33HA+20M; marked as * in Table 2; large solid arrows: 7 nodes containing all 16 smokers, ‘SB’; large dotted arrow: node of 63 nonsmokers, ‘SA’.

Comparison of nominal partitions for accuracy of DM and smoking habit
R.Enz.: primer restriction enzymes; N.: number; NP: nominal partition; N. of wrongly classified subjects: number of misclassified subjects up to 5th step=accuracy; Combination of R.Enz. indicated sequences in DM processing; *: detailed Dt is shown in Fig. 1; $: detailed Dt is shown in Appendix Fig. A1; #1 – #3: compared with balance nodes in Table 6; &1 – &5: OTUs obtained up to 3rd step are shown in Table 4; Shadow at '2-NP' indicates that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP.
Application of balance nodes, accuracy and Dt configuration
Smoking habit; R.Enz.: 27B+33HA+20M; NP: nominal partition; N.: number; constr.: construction; N. of real subjects: original dataset: 92; m.rate: multiple rate at boosting the data; N. of apparent subjects: boosted subjects for Dt construction; #1 – #3: compared with balance nodes in Table 2; 1st step – 3rd step were connected only vertically, not related to the left end column, i.e., SAA – SP5; "-" at Dt configuration: missing OTU; Shadow at '2-NP' indicates that the results have been reported previously [3], but are shown here for comparison to 3-NP and 5-NP; Basic application schemes for balance nodes are shown in Fig. 2 as a flow-chart.
Comparison of nominal partitions for accuracy of DM and drinking habit
All notations are the same as in Table 2.
Comparison of detailed Dts with better 3-NP, marked as &1 – &5 in Table 2
Smoking habit; all of their Dt 1st step: QM134; &2, &4 and &5 are as described in the text; Other notations are the same as for Tables 1 and 2.
Comparison of wrongly classified subjects with NPs and characteristics
Single use of 4 R.Enz.: B, HA, M and A; (···): rates among 92 subjects (%).
Fig. 2.Flow-chart at utilization of balance nodes. m.rate: multiple rate for boosting data; Balance nodes are used to correct imbalances in a target dataset. Practical criteria for application were unclear between 10% and 20%. The applied results are shown in Table 6.