Literature DB >> 32328156

Research on Hybrid Feature Selection Method Based on Iterative Approximation Markov Blanket.

Keding Li1, Canyi Huang2, Jianqiang Du2, Bin Nie2, Guoliang Xu3, Wangping Xiong2, Jigen Luo2.   

Abstract

The basic experimental data of traditional Chinese medicine are generally obtained by high-performance liquid chromatography and mass spectrometry. The data often show the characteristics of high dimensionality and few samples, and there are many irrelevant features and redundant features in the data, which bring challenges to the in-depth exploration of Chinese medicine material information. A hybrid feature selection method based on iterative approximate Markov blanket (CI_AMB) is proposed in the paper. The method uses the maximum information coefficient to measure the correlation between features and target variables and achieves the purpose of filtering irrelevant features according to the evaluation criteria, firstly. The iterative approximation Markov blanket strategy analyzes the redundancy between features and implements the elimination of redundant features and then selects an effective feature subset finally. Comparative experiments using traditional Chinese medicine material basic experimental data and UCI's multiple public datasets show that the new method has a better advantage to select a small number of highly explanatory features, compared with Lasso, XGBoost, and the classic approximate Markov blanket method.
Copyright © 2020 Canyi Huang et al.

Entities:  

Year:  2020        PMID: 32328156      PMCID: PMC7166270          DOI: 10.1155/2020/8308173

Source DB:  PubMed          Journal:  Comput Math Methods Med        ISSN: 1748-670X            Impact factor:   2.238


1. Introduction

At present, due to the rapid development of scientific and technological level, the information acquisition technology and storage capacity have been greatly improved, and the data obtained carry more sufficient information, for which the scale is getting larger and larger. In the field of basic research of materials about traditional Chinese medicine, high-performance liquid phase (waters H-class) and mass spectrometry (synapt G2-si) are usually used to obtain experimental data. These data often involve thousands of substances, which are characterized by high-dimensional data and easily cause dimensional disasters. At the same time, due to the limitation of the experimental times, the characteristic of small samples is also presented, which easily leads to problems, such as overfitting. Conventional statistical analysis methods, such as multiple linear regression, principal component regression, and ridge regression, choose regression coefficients to reflect the relationship between variables [1-3], however, which cannot effectively delete irrelevant features and redundant features, and achieve the purpose of screening important substances for basic data of traditional Chinese medicines with high dimensionality and a small amount. At the same time, the traditional feature selection methods, such as Lasso and K-split Lasso [4], only can delete irrelevant features and redundant features to some extent and cannot meet the data processing requirements of high-dimensional small samples when dealing with data. Therefore, in view of the problem that high-dimensional small sample data of Chinese medicine contain more irrelevant information and redundant information, it is urgent to find an analytical model that can select effective features from high-dimensional small sample data, and improve the accuracy and operation of the model to provide technical support for researchers. Next, this article will introduce the research-related work in Section 2. The new method is elaborated in Section 3. In Section 4, two basic data on TCM materials and three public UCI data are used to analyze in the new method, which is also compared with several existing algorithms to further verify the feasibility and effectiveness. Finally, the full text is summarized in Section 5.

2. Related Work

Feature selection is an effective method to solve dimensionality disasters and achieve feature dimensionality reduction. It can preserve the effective features that are most beneficial to regression (or classification) by analyzing the intrinsic relationship between features and target variables and features [5, 6], so that the redundant features and unrelated features to the target variable are better eliminated, aiming to reduce the complexity of the algorithm and improve the accuracy of the algorithm. According to the combination with machine learning, feature selection methods can be divided into filtering, encapsulation, embedded, and integrated [7]. Filtering is independent of a specific machine learning model, in which feature sorting and feature space search are generally used to obtain feature subsets including some special typical methods, for example, mutual information, symmetric uncertainty, and maximum information coefficient [8-10]. Encapsulation is to integrate the learning algorithm into the feature selection process, that is, the classification algorithm is regarded as a black box to evaluate the feature subset performance, which is to achieve the maximum classification accuracy rate. Embedded incorporates the feature selection process as the part into the learning algorithm. This method is used to solve the problem of high reconstruction cost when encapsulating different datasets. The integrated method is to gain the results, respectively, by learning using multiple feature selection methods firstly and then integrate each result with a certain rule. The method is better than the single feature selection method, which is suitable for solving the problem of instability of the feature selection method. The feature selection method has attracted the attention of many domestic and foreign scholars. For example, in the field of biomedicine, Yao et al. [11] proposed a multimodal modal feature selection method based on hypergraph for multitask feature selection and finally selected effective brain region information; Sun et al. [12] proposed a hybrid feature selection algorithm based on Lasso, which can select a subset of information genes with strong classification ability; Mingquan et al. [13] proposed information gene selection method based on symmetry uncertainty and support vector machine (SVM) recursive feature elimination, which can effectively eliminate genes unrelated to categories. At the same time, feature selection methods are also well applied in other fields. Nagaraja [14] used partial least squares regression and optimized experimental design to select features with strong correlation with categories; Hu et al. [15] proposed feature selection algorithm by joint spectral clustering and neighborhood mutual information, which can remove signature-independent features. However, the research methods mentioned in the above literature can only remove irrelevant features or eliminate redundant features to a certain extent and cannot meet the data processing needs of high-dimensional small sample problems of traditional Chinese medicine. Therefore, some researchers have conducted in-depth discussion and research to do a two-stage analysis of feature correlation and redundancy and approximated the approximate Markov blanket (AMB) to the feature selection process to achieve the purpose of screening effective and fewer features [16]. Among them, the literature [17] proposed a method of approximating the Markov blanket using cross entropy. The method first uses the Pearson coefficient to calculate the correlation between features and removes the irrelevant features and then uses the approximate Markov blanket to perform redundant features: deletion; the paper [18] proposed a maximum correlation minimum redundancy feature selection algorithm using approximate Markov blankets. The method first uses the criterion of maximum correlation minimum redundancy for feature correlation ordering and then does approximate calculation by combining mutual information with Marco to remove irrelevant features and redundant features; the literature [19] proposed a feature selection method based on the maximum information coefficient and approximate Markov blanket (FCBF-MIC), which firstly measured correlation between features and categories by symmetric uncertainty to delete the features that are not related to categories or weakly correlated. Secondly, the Markov carpet is approximated by using the maximum information coefficient, thereby achieving the purpose to delete redundant features. However, after the analysis and discussion of the experiment, it is found that the above method is more strict because of the definition of the approximate Markov blanket, which makes it impossible to select a small number of highly explanatory features in the high-dimensional small sample data of Chinese medicine, so it is still needed for us to do further research and exploration of Chinese medicine data analysis methods. In a feature selection study, higher-quality feature selection methods should exhibit the following characteristics [20]: (1) interpretability, meaning that the features selected in the model have scientific significance; (2) acceptable model stability; (3) avoidance of deviations in the hypothesis test; and (4) model calculation complexity within a manageable range. At the same time, in the literature [21], a standard of optimal feature subsets is proposed into four categories: unrelated features, weakly correlated and redundant features, weakly correlated nonredundant features, and strongly correlated features. It is considered that the optimal feature subset should contain the latter two in this paper. Through a large number of experimental comparisons, the standard has been proved to have lower time complexity and better feature selection results [22, 23]. In view of this, this paper proposes a hybrid feature selection method based on iterative approximate Markov blanket (CI_AMB), which is divided into two phases: in the first phase, it first uses the maximum information coefficient to measure correlation between the per-dimensional features and target variables and achieves the filtering of unrelated features and the acquisition of candidate feature subsets according to some evaluation criteria; in the second stage, the candidate feature subsets are sorted and classified into K subsets and then iteratively cull redundant features to obtain weakly correlated nonredundant features and strongly correlated features based on the maximum approximate Markov carpet of information coefficients. Not only can the algorithm effectively filter the irrelevant features and eliminate redundant features, but also reduces the time complexity of the model and improves the interpretation degree of the model. It is a new model suitable for high-dimensional small sample data analysis of traditional Chinese medicine.

3. Research on Hybrid Feature Selection Method Based on Iterative Approximation Markov Blanket (CI_AMB)

The maximum information coefficient (MIC) is a new information-based metric proposed by Reshef et al. [24] in 2011. It not only better reflects the correlation between features and target variables, and features and features, but also makes up for the problem that metrics such as mutual information cannot be normalized and sensitive to discretization and that metrics such as information gain and symmetry uncertainty cannot effectively measure the nonfunction dependence between features. In many experimental analyses, the characteristic that the largest information coefficient has good stability and the ability to metric the relationship among the features are also effectively demonstrated [25-27]. The Markov blanket is a method that minimizes subset of features to keep maximizing the target variable information and meanwhile makes the remaining feature subset to be independent of the target variable under the conditions that subset of features has been selected [19, 28]. Although the Markov carpet can achieve the effect of feature dimension reduction, because its independent conditions are too strict and the relationships discovered belong to the NP-hard problem, the feature selection method often adopts the strategy of approximating the Markov blanket. Therefore, combining the advantages of the largest information coefficient, in this paper, we use MIC to approximate the Markov blanket (see Definition 1) in order to better eliminate the redundant features, so that the optimal feature subset screening and model optimization are achieved.

Definition 1 .

(approximate Markov blanket). Assume that there are two different features in the feature set, respectively, if It is considered that f is an approximate Markov blanket of f, that is to say, f is retained while f is a redundant feature and removed from the feature set.

Definition 2 .

(weakly correlated nonredundant features and strongly correlated features). Only when satisfying the condition that there is no an approximate Markov blanket to feature f, the feature f is a weakly correlated nonredundant feature or a strongly correlated feature, namely, f ∈ {F − firrelevant − fredundant}, where F is the feature complete set and firrelevant and fredundant are the irrelevant feature set and the redundant feature set, respectively. The CI_AMB method is mainly divided into two stages. In the first stage, it firstly uses the MIC method to measure the correlation between each feature and the target variable and achieves the filtering of better irrelevant features according to the evaluation criteria to achieve the acquisition of candidate feature subsets. The features selected by the MIC method are usually highly correlated with the redundant features accompanied, in which the more amounts of the redundant features not only increase the time complexity and space complexity of the model, but also reduce the degree of interpretation of the model. Therefore, in the second stage, the new method further analyzes the redundancy of the feature, that is, according to the feature score obtained by the MIC method, the features of the candidate subset are arranged in ascending order and equally divided into K parts. And then, the approximate Markov blanket (AMB) is used to iteratively eliminate redundant features, so that weakly correlated nonredundant features and strongly correlated features can be selected (Algorithm 1). The flow of the algorithm construction is shown in Figure 1.
Algorithm 1

CI_AMB algorithm.

Figure 1

CI_AMB model.

The specific construction process of the model is as follows: Phase 1. Filtering irrelevant features Step 1. MIC calculation: MIC calculation is performed on the original data with m features, that is, the maximum information coefficient is calculated for each feature by formula (2) and obtains a score sequence Tlist = (t1, t2,…, t) of all features, and the Tlist value is [0, 1]. It is worth noting that the closer the score of the feature is to 1, the stronger the correlation of feature and the target variable is, and the closer the score is to 0, the weaker the correlation is: where MI(O, x, y) refers to the maximum mutual information of O under mesh partitioning [19, 29], O is the ordered pair set of samples, x means dividing the value range of feature X into x segments, y means dividing the value range of dependent variable Y into y segments, and B(n) is the upper limit of the meshing. Generally, the value of B(n) is B(n) = n0.6, and n is the sample size. Step 2. Determining the candidate feature subset: MIC calculation is used to obtain the score sequence Tlist, and the descending order is arranged and the sequence Tlist is intercepted according to a certain ratio, and then the current top ranked feature subset is selected; if the selected feature subset satisfies the best of evaluation index RMSE, the candidate feature subset D (m′ dimension feature, m′ < m) can be directly selected, but if not, the progress of filtering operation and judgment is continued: Step 3. Data division and initialization: the candidate feature subsets D are arranged in reverse order according to the feature scores, thereby obtaining an aligned candidate feature subset Dlist in order to ensure the maximum retention for the features with high important correlation in the regression tasks by ranking the features in the later processing, then subdividing the candidate feature subset Dlist into K groups, and defining Dlist(i) is the i − th(1 ≤ i ≤ K) feature after dividing the candidate feature subset into K groups subset, while initializing the optimal feature set Tbest to be empty. Phase 2. Eliminating redundant features Step 4. Feature redundancy analysis: first, the redundant features are removed from the first one feature subset Dlist(1) by using the AMB method (i.e., Definition 1), and then the nonredundant features are filtered into Tbest. Secondly, Tbest and the second one feature subset Dlist(2) are merged as the current feature subset, and it will be analysed by the AMB method to delete the redundant features, and then the Tbest is updated currently. Therefore, the optimal feature subset Tbest with the remaining the m″ dimension (m″ Step 5. Model evaluation: compare and evaluate various strategies by using the weakly correlated nonredundant and strongly correlated optimal feature subsets (Tbest) obtained in the above steps.

4. Experimental Design

4.1. Experimental Data Description

The five experimental datasets were used in this paper including the traditional Chinese medicine material basic experimental data (WYHXB and NYWZ) of the Modern Chinese Medicine Preparation Ministry of Education, the Residential Building Dataset (RBuild), Communities and Crime on the UCI dataset (CCrime), and BlogFeedback (BlogData for short), and the basic information of each dataset is described in Table 1. Among them, there are 798 features, 1 dependent variable, and 54 samples in WYHXB data, and 10283 features, 1 dependent variable, and 54 samples in NYWZ data; BlogData is data describing blog posts, which includes 280 features, 1 dependent variable, and 60021 samples; RBuild is data describing residential buildings, which includes 103 features, 1 dependent variable, and 372 samples; CCrime is data describing community crime, which includes 127 features, 1 dependent variable, and 1994 samples. It is worth noting that the UCI datasets obtained from the UCI Machine Learning Repository generally have more missing values; therefore, the mean filling method is used for data processing during the experiment. In this paper, using BlogData, RBuild, and CCrime of the UCI dataset is to compare the regression effects of the new model on the public dataset to verify the reliability and generalization of the new model in our experiments.
Table 1

Basic dataset information (default task: regression).

DatasetsNumber of samplesNumber of attributes
WYHXB54799 (798 + 1)
NYWZ5410284 (10283 + 1)
BlogData60021281 (280 + 1)
RBuild372104 (103 + 1)
CCrime1994128 (127 + 1)
Both WYHXB and NYWZ are the basic experimental data of Shenfu injection in the treatment of cardiogenic shock. The experimenters used the left anterior descending coronary artery near the cardiac tip to replicate the metaphase cardiogenic shock rat model and gave the Shenfu injection (unit: ml·kg−1) to the shock rat models divided into 7 groups (0.1, 0.33, 1.0, 3.3, 10, 15, and 20, respectively) by the dose of Shenfu injection, in which included 6 rats in each group, and set the model group and blank group in whole experiment meanwhile. After 60 minutes of administration, the pharmacodynamic indicators of the red blood cell flow rate (m/s) were collected. The substance information contained in the Shenfu injection is called exogenous substance (i.e., WYHXB data, as shown in Table 2), and the substance information of the experimental individual itself is called endogenous substance (i.e., NYWZ data, as shown in Table 3). In the two data, the material information is characteristic, and the red blood cell flow rate is the dependent variable.
Table 2

Partial data of basic experiments with traditional Chinese medicine substances (WYHXB).

0.34_237.0119 m/z0.35_735.1196 m/z0.36_588.0942 m/z0.36_590.0903 m/zRed blood cell flow rate (μ·m/s)
0.48808302.16027.8589750
100.07862.01603.807121400
11.699252.50587.610054.85059785
143.643284.1130456.607790
7.7508954.453500670
18.24990014.6621680
28.5783002.3551850
2.91064016.16243.41406620
Table 3

Partial data of basic experiments with traditional Chinese medicine substances (NYWZ).

11.10_787.5077 m/z12.29_526.1784 m/z12.29_531.2005 m/z12.47_631.3847 m/zRed blood cell flow rate (μ·m/s)
53.371911557.6764.3291795.792200
43.47177971.33875.4651842.392750
76.5073399.9870.1611562.811980
153.14551027.4916.0641619.621860
16.319710694.4942.6991612.422100
42.283611048.1714.5361649.232000
55.50214702.83748.8441632.92481
153.2178912.8835.241647.552970

4.2. Results and Discussion

The programming tool used in this experiment is Python 3.6, the operating system is Windows 10, the memory is 8 GB, and the CPU is Intel (R) Core (TM) i5-3230M.

4.2.1. Filtering of Irrelevant Features

In order to ensure the reliability of the new model, the RMSE (root mean square error) of the two regression models of GBDT [30] and XGBoost [31] was adopted as the comprehensive evaluation index, that is, the average value of the two regression models RMSE was taken as the evaluation index, and then the characteristics of the original dataset were filtered by the certain ratio P gradually (if the number of features has a decimal number, the result is rounded in the experiment), so that it can be sure that the corresponding RMSE value is the best when what the ratio P is taken. And it is more appropriate to judge how many irrelevant features are deleted to achieve the purpose of effectively filtering the irrelevant features, and the experimental results are shown in Table 4.
Table 4

Experimental results of five datasets of filter-independent features.

P WYHXBNYWZBlogDataRBuildCCrime
Number of featuresAve-RMSENumber of featuresAve-RMSENumber of featuresAve-RMSENumber of featuresAve-RMSENumber of featuresAve-RMSE
0.95 758234.9603289768233.32486326612.64578497354.101779 120 0.131535
0.9718235.0198199254233.32486325212.64578492354.0901791140.131695
0.85 678 234.800187 8740233.32486323812.64578487354.1342521070.131858
0.8 638235.133101 8226 233.324863 22412.64578482354.1465411010.131792
0.75598235.1046487712233.38836721012.64578477353.914801950.131897
0.7 558235.1321287198233.38836719612.645784 72 353.914801 880.131853
0.65518235.1916636683233.38547918212.64578466353.923275820.131902
0.6478235.2027566169233.39460416812.64578461354.042364760.132113
0.55438235.2631385655233.39460415412.64578456354.050328690.132164
0.5 399235.9624215141233.357302 140 12.645784 51354.053246630.132310
0.45359235.9414284627233.35575712612.64972346354.770411570.132497
0.4319236.3994124113233.35408611212.65115741354.849084500.132620
0.35279236.5740983599233.3542489812.65724236355.659524440.133428
0.3239376.5467893084233.3583748412.66429330355.714190380.133759
0.25199406.7685862570233.3992757012.67159525355.700106310.134865
0.2159445.6217652056233.4374865612.67694420355.714027250.136386
0.15119545.5213451542233.5394854212.67718115355.722452190.137433
0.179553.3261001028233.5505402812.67734310355.785519120.139937
According to the experimental results in the above Table 4, when P = 0.85 in the WYHXB data, the corresponding average RMSE mean value is the best, and 120 irrelevant features are filtered (the original features are 798); when P = 0.8 in the NYWZ data, the corresponding average RMSE is the best, and 2057 irrelevant features are filtered (10283 original features); when P = 0.5 in the BlogData data, the corresponding RMSE is the best, and 140 irrelevant features are filtered (280 original features); when P = 0.7 in RBuild data, its corresponding RMSE mean is the best, and 31 irrelevant features are filtered (103 original features); when CCrime data takes P = 0.95, its corresponding RMSE mean is the best, and 7 irrelevant features (127 original features) were filtered. As a result, after filtering the irrelevant features by the above MIC method, a candidate feature subset of five sets of experimental data can be obtained. By further analyzing the candidate feature subsets, it can be found that the RMSE of the original data has little difference with the RMSE of the candidate feature subsets (the experimental results are shown in Table 5); therefore, the features deleted in this experiment have little effect on the accuracy of the model and the process finally filters out irrelevant features and better preserves the features associated with the target variables.
Table 5

Comparison of experimental data between raw data and candidate feature subsets.

Original dataCandidate feature subset
Number of featuresRMSENumber of featuresRMSE
WYHXB798234.967849678234.800187
NYWZ10283234.0526998226233.324863
BlogData28012.64578414012.645784
RBuild103352.47367472353.914801
CCrime1280.1313771200.131535

4.2.2. Elimination of Redundant Features

Through the above experiments, filtering of irrelevant features can be achieved by obtaining candidate feature subsets. However, according to the construction of the new model, it is necessary to divide the candidate feature subsets (ascending order) equally in the experimental process, but different partitioning strategies will affect the final experimental results, so further discussion and analysis of the parameter K are needed (the value range of K is set to 1 to 15) to determine the optimal K value to ensure the reliability of the model results. At the same time, in order to avoid the contingency of the experiment as much as possible, the experiment still adopts the RMSE of GBDT and XGBoost as the comprehensive evaluation index (i.e., the mean RMSE of the two). After experimental analysis (results shown in Figures 2–6), it can be found that when k = 5 of WYHXB data, its corresponding RMSE value is the best; when k = 6 of NYWZ data, its corresponding RMSE value is the best; when the k = 5 of the BlogData data, the corresponding RMSE value is the best; when the k = 3 of the RBuild data, the corresponding RMSE value is the best; when k = 14 in the CCrime data, the corresponding RMSE value is the best. After the division of the candidate feature subsets, the redundancy of the features can be analyzed in the later experiments, so as to select the optimal feature subsets.
Figure 2

WYHXB parameter K selection.

Figure 3

NYWZ parameter K selection.

Figure 4

BlogData parameter K selection.

Figure 5

RBuild parameter K selection.

Figure 6

CCrime parameter K selection.

For further analyzing the model, each dataset was randomly divided into a training set and a test set with the ratio of 6 : 4, and XGBoost [31], Lasso [32], FCBF-MIC [19], and the improved algorithm (CI_AMB) were used for training and learning; the test set was subjected to regression experiment (model parameters selected were consistent with the above experimental results), and RSME was used as the model index. At the same time, in order to ensure the reliability of the model results, each test data was tested 10 times, and then the average value was taken as the final experimental results. In order to verify the effect and effectiveness of the feature selection during the experiment, the original data were also compared using the regression model of GBDT and XGBoost. The experimental results are shown in Tables 6–7:
Table 6

Comparison of experimental results between CI_AMB and other methods (RMSE evaluation index of GBDT).

Original dataCI_AMBXGBoostLassoFCBF-MIC
Number of featuresRMSENumber of featuresRMSENumber of featuresRMSENumber of featuresRMSENumber of featuresRMSE
WYHXB798267.511580(19 + 61) 232.7352 83269.164489255.966115265.0474
NYWZ10283258.4021220(59 + 161) 234.8831 212263.3908215256.217260265.2352
BlogData28022.724748(5 + 43) 7.4822 4314.56604718.7933924.2629
RBuild103458.030235(16 + 19) 417.1441 23458.278026466.85463461.7130
CCrime127 0.1067 37(3 + 34)0.1091370.1176310.112150.1231
Average value201.3550 178.4708 201.1034199.5887203.2763
Table 7

Comparison of experimental results of CI_AMB with other methods (RMSE evaluation index of XGBoost).

Original dataCI_AMBXGBoostLassoFCBF-MIC
Number of featuresRMSENumber of featuresRMSENumber of featuresRMSENumber of featuresRMSENumber of featuresRMSE
WYHXB798227.906180(19 + 61) 205.0669 83221.877489214.056015229.7367
NYWZ10283219.7160220(59 + 161) 201.5748 212220.3312215225.171260225.1525
BlogData2808.635648(5 + 43) 4.1587 4310.09494710.2909910.8045
RBuild103264.519535(16 + 19) 255.1114 23269.892826261.30953278.6242
CCrime1270.144737(3 + 34) 0.1443 370.1487310.148350.1492
Average value144.1844 133.2112 144.4690142.1952148.8934
It can be seen from the experimental results in the above table that the feature selection of the CI_AMB method is performed on the test set of five sets of raw data, and the experimental results are as follows: the number of original features of the WYHXB data is 798, and after the redundancy feature is removed, the final number of optimal feature subsets selected is 80, including 19 strongly correlated features and 61 weakly correlated nonredundant features. The number of original features of NYWZ data is 10283. After the elimination of redundant features, the final number of optimal feature subsets that can be screened is 220, including 59 strongly correlated features and 161 weakly correlated nonredundant features; the original number of BlogData data is 280, after the redundant features are eliminated, and finally, the number of optimal feature subsets that can be screened out is 48, including 5 strongly correlated features and 43 weakly correlated nonredundant features. The number of original features of RBuild data is 103, after the elimination of redundant features. Finally, the number of optimal feature subsets that can be screened out is 35, including 16 strongly correlated features and 19 weakly correlated nonredundant features; the number of original features of CCrime data is 127, after doing the redundancy, and the final number of suboptimal set of features can be selected to 37, including 3 strongly correlated feature and 34 weakly correlated nonredundant feature. It is worth noting that after filtering the irrelevant features and eliminating the redundant features, the obtained strongly correlated features and weakly correlated nonredundant features are distinguished according to the degree of correlation between the features and the target variables, that is, if the MIC score is greater than 0.6, it is a strongly correlated feature, and if not, it is a weakly correlated nonredundant feature. After the CI_AMB feature selection, it can be found that (1) compared with the original data (in the case of no feature selection), the new method has the slightly inferior result (0.0024 greater error than the result of the original data) in the CCrime data (using the RMSE of GBDT as the evaluation index, Table 6), but in other datasets, the results are better than the original data (see Table 6 and 7); (2) compared with XGBoost, Lasso, and FCBF-MIC, while the number of features is similar, the RMSE values of the evaluation models in the CI_AMB method are better than in other methods. At the same time, in order to observe and compare the experimental results more intuitively, the trend graphs of the two evaluation indicators (GBDT and XGBoost) were plotted (Figures 7 and 8) to reflect the overall fluctuation of the RMSE. Combining the above table with the experimental results of Figures 7 and 8, it can be observed that the improved algorithm is generally superior to other algorithms, indicating that the new model is effective in removing the effects of irrelevant features and redundant features. In summary, not only can the improved algorithm better filter out the strongly correlated features and weakly correlated nonredundant features, but also improves the regression accuracy of the model to some extent.
Figure 7

The average RMSE trend of the five sets of datasets.

Figure 8

The average RMSE trend of the five sets of datasets.

5. Conclusions

Aiming at the problem that the basic experimental data of TCM present high dimensionality and few samples and contain more irrelevant information and redundant information, a hybrid feature selection method based on iterative approximation Markov blanket is proposed. The method performs two-stage feature analysis by the maximum information coefficient and iterative approximation Markov blanket, respectively, to do filtering of unrelated features and culling of redundant features, so as to achieve the screening of optimal feature subsets. Through the experimental comparison between the basic data of Chinese medicine and UCI dataset, it is proved that the improved algorithm significantly reduces the feature dimension and improves the interpretation degree of the model. It is a kind of analysis suitable method for high-dimensional small sample data of traditional Chinese medicine. In the next research work, we will continue to optimize the algorithm and ensure the reasonable setting of relevant parameters can be further studied when building the model.
  5 in total

1.  Assessment of the orthogonality in two-dimensional separation systems using criteria defined by the maximal information coefficient.

Authors:  Ahmad Mani-Varnosfaderani; Mostafa Ghaemmaghami
Journal:  J Chromatogr A       Date:  2015-08-28       Impact factor: 4.759

2.  Fridge: Focused fine-tuning of ridge regression for personalized predictions.

Authors:  Kristoffer H Hellton; Nils Lid Hjort
Journal:  Stat Med       Date:  2018-01-03       Impact factor: 2.373

3.  Detecting novel associations in large data sets.

Authors:  David N Reshef; Yakir A Reshef; Hilary K Finucane; Sharon R Grossman; Gilean McVean; Peter J Turnbaugh; Eric S Lander; Michael Mitzenmacher; Pardis C Sabeti
Journal:  Science       Date:  2011-12-16       Impact factor: 47.728

4.  A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.

Authors:  Andrea Bommert; Jörg Rahnenführer; Michel Lang
Journal:  Comput Math Methods Med       Date:  2017-08-01       Impact factor: 2.238

5.  Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree.

Authors:  Chang Zhou; Hua Yu; Yijie Ding; Fei Guo; Xiu-Jun Gong
Journal:  PLoS One       Date:  2017-08-08       Impact factor: 3.240

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.