| Literature DB >> 33265188 |
Diego Marcondes1, Adilson Simonis1, Junior Barrera1.
Abstract
This paper uses a classical approach to feature selection: minimization of a cost function applied on estimated joint distributions. However, in this new formulation, the optimization search space is extended. The original search space is the Boolean lattice of features sets (BLFS), while the extended one is a collection of Boolean lattices of ordered pairs (CBLOP), that is (features, associated value), indexed by the elements of the BLFS. In this approach, we may not only select the features that are most related to a variable Y, but also select the values of the features that most influence the variable or that are most prone to have a specific value of Y. A local formulation of Shannon's mutual information, which generalizes Shannon's original definition, is applied on a CBLOP to generate a multiple resolution scale for characterizing variable dependence, the Local Lift Dependence Scale (LLDS). The main contribution of this paper is to define and apply the LLDS to analyse local properties of joint distributions that are neglected by the classical Shannon's global measure in order to select features. This approach is applied to select features based on the dependence between: i-the performance of students on university entrance exams and on courses of their first semester in the university; ii-the congress representative party and his vote on different matters; iii-the cover type of terrains and several terrain properties.Entities:
Keywords: feature selection; local lift dependence scale; mutual information; variable dependence; variable selection
Year: 2018 PMID: 33265188 PMCID: PMC7512664 DOI: 10.3390/e20020097
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Example of multi-resolution tree for feature selection. The circle nodes and solid lines form a BLFS. The rectangular nodes and dashed lines around each circle node form a Boolean lattice. The whole tree is a CBLOP.
Figure 2Discretization of the joint performance on Mathematics and Physics of Statistics students that enrolled at the University of São Paulo in 2011 and 2012 by the tertiles of the Mahalanobis distance inside each year.
The Lift Function between the weighted mean grade, discretized by year, and the joint performance on Mathematics and Physics, discretized by the Mahalanobis distance inside each year, of Statistics students that enrolled at the University of São Paulo in 2011 and 2012. The numbers in parentheses represent the quantity of students in each category.
| Mathematics and Physics | Weighted Mean Grade | Relative Frequency | ||
|---|---|---|---|---|
| Tertile 1 | Tertile 2 | Tertile 3 | ||
| Tertile 1 | 0.975 (9) | 1.46 (13) | 0.563 (5) | 0.34 |
| Tertile 2 | 1.01 (9) | 0.935 (8) | 1.05 (9) | 0.33 |
| Tertile 3 | 1.01 (9) | 0.584 (5) | 1.4 (12) | 0.33 |
| Relative Frequency | 0.342 | 0.329 | 0.329 | 1 |
The Lift Function between the weighted mean grade and the joint performance on . The numbers in parentheses represent the quantity of students in each category.
| Performance in | Weighted Mean Grade | Relative Frequency | ||
|---|---|---|---|---|
| Tertile 1 | Tertile 2 | Tertile 3 | ||
| Tertile 1 | 1.33 (1277) | 1.1 (1018) | 0.566 (533) | 0.34 |
| Tertile 2 | 0.992 (921) | 1.06 (951) | 0.954 (871) | 0.33 |
| Tertile 3 | 0.669 (630) | 0.848 (775) | 1.49 (1377) | 0.33 |
| Relative Frequency | 0.339 | 0.329 | 0.333 | 1 |
The Lift Function between the weighted mean grade and the performance on Mathematics. The numbers in parentheses represent the quantity of students in each category.
| Performance in Mathematics | Weighted Mean Grade | Relative Frequency | ||
|---|---|---|---|---|
| Tertile 1 | Tertile 2 | Tertile 3 | ||
| Tertile 1 | 1.3 (1398) | 1.06 (1111) | 0.631 (667) | 0.38 |
| Tertile 2 | 0.935 (843) | 1.11 (972) | 0.956 (847) | 0.32 |
| Tertile 3 | 0.689 (587) | 0.8 (661) | 1.51 (1267) | 0.30 |
| Relative Frequency | 0.339 | 0.329 | 0.333 | 1 |
Features of the Congressional Voting Records dataset.
| ID | Matter (Feature) |
|---|---|
| HI | Handicapped infants |
| WP | Water project cost sharing |
| AB | Adoption of the budget resolution |
| PF | Physician fee freeze |
| SA | El Salvador aid |
| RG | Religious groups in schools |
| ST | Anti satellite test ban |
| AN | Aid to Nicaraguan contras |
| MM | MX missile |
| IM | Immigration |
| SC | Synfuels corporation cutback |
| ES | Education spending |
| SR | Superfund right to sue |
| CR | Crime |
| DF | Duty Free exports |
| EA | Export administration act South Africa |
Selected profiles obtained applying Algorithm 3 to the Congressional Voting Records dataset with the restriction that only the profiles with relative frequency greater than are considered. The instances with missing data were excluded at each iteration of the algorithm, i.e., L is calculated using only the instances that have all the observations on the features .
| Party | Features ( | LF | Profile ( | Sample Size |
|---|---|---|---|---|
| democrat | (AB,PF,SA,RG,MM,ES,SR,EA) | 1.94 | (y,n,n,n,y,n,n,y) | 277 |
| (HI,AB,PF,SA,RG,MM,ES,SR,EA) | 1.94 | (y,y,n,n,n,y,n,n,y) | 275 | |
| (AB,PF,RG,ST,MM,ES,SR,EA) | 1.94 | (y,n,n,y,y,n,n,y) | 279 | |
| (HI,AB,PF,RG,ST,MM,ES,SR,EA) | 1.94 | (y,y,n,n,y,y,n,n,y) | 277 | |
| (AB,PF,SA,RG,ST,MM,ES,SR,EA) | 1.94 | (y,n,n,n,y,y,n,n,y) | 276 | |
| (HI,AB,PF,SA,RG,ST,MM,ES,SR,EA) | 1.94 | (y,y,n,n,n,y,y,n,n,y) | 274 | |
| (AB,PF,SA,RG,MM,ES,SR,CR,EA) | 1.94 | (y,n,n,n,y,n,n,n,y) | 275 | |
| (AB,PF,RG,ST,MM,ES,SR,CR,EA) | 1.94 | (y,n,n,y,y,n,n,n,y) | 276 | |
| (AB,PF,SA,RG,ST,MM,ES,SR,CR,EA) | 1.94 | (y,n,n,n,y,y,n,n,n,y) | 274 | |
| (AB,PF,RG,ST,MM,ES,SR,DF,EA) | 1.94 | (y,n,n,y,y,n,n,y,y) | 269 | |
| (AB,PF,SA,RG,ST,MM,ES,SR,DF,EA) | 1.94 | (y,n,n,n,y,y,n,n,y,y) | 266 | |
| republican | (WP,PF,SC,ES,CR) | 2.65 | (n,y,n,y,y) | 342 |
| (AB,PF,AN,SC,CR,DF) | 2.64 | (n,y,n,n,y,n) | 369 | |
| (PF,AN,IM,ES,CR,DF) | 2.64 | (y,n,y,y,y,n) | 361 | |
| (PF,AN,SC,CR,DF) | 2.64 | (y,n,n,y,n) | 373 | |
| (AB,PF,AN,SC,ES) | 2.63 | (n,y,n,n,y) | 376 | |
| (HI,AB,PF,AN,SC,ES) | 2.63 | (n,n,y,n,n,y) | 373 | |
| (AB,PF,AN,SC,ES,CR) | 2.63 | (n,y,n,n,y,y) | 368 | |
| (HI,AB,PF,AN,SC,ES,CR) | 2.63 | (n,n,y,n,n,y,y) | 365 | |
| (PF,AN,SC,DF) | 2.63 | (y,n,n,n) | 380 | |
| (AB,PF,AN,SC,DF) | 2.63 | (n,y,n,n,n) | 376 | |
| (PF,AN,IM,ES,DF) | 2.63 | (y,n,y,y,n) | 368 | |
| (AB,PF,AN,SC,ES,DF) | 2.63 | (n,y,n,n,y,n) | 360 | |
| (HI,AB,PF,AN,SC,CR,DF) | 2.63 | (n,n,y,n,n,y,n) | 365 | |
| (PF,AN,SC,ES,CR,DF) | 2.63 | (y,n,n,y,y,n) | 356 | |
| (AB,PF,AN,SC,ES,CR,DF) | 2.63 | (n,y,n,n,y,y,n) | 353 | |
| (HI,AB,PF,AN,SC,ES,CR,DF) | 2.63 | (n,n,y,n,n,y,y,n) | 350 |
y = yes; n = no.
Features of the Covertype dataset that are considered in this application.
| ID | Feature |
|---|---|
| E | Elevation |
| A | Aspect |
| S | Slope |
| HH | Horizontal distance to hydrology |
| HR | Horizontal distance to roadways |
| HF | Horizontal distance to fire points |
| H9 | Hillshade 9 a.m. |
| HN | Hillshade Noon |
| H3 | Hillshade 3 p.m. |
| VH | Vertical distance to hydrology |
The Lift Function between the cover type of the terrain and the features Elevation, Horizontal distance to hydrology and Horizontal distance to fire points discretized by the sample quintiles of the Mahalanobis distance to zero. The numbers in parentheses represent the sample size of each category.
| Cover Type | Relative Frequency | |||||||
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
| Quintile 1 | 0.0766 (3244) | 0.961 (54,473) | 4.94 (35,344) | 5 (2747) | 1.78 (3385) | 4.9 (17,010) | 0 (0) | 0.20 |
| Quintile 2 | 0.444 (18,816) | 1.6 (90,872) | 0.0573 (410) | 0 (0) | 2.98 (5663) | 0.103 (357) | 0.0205 (84) | 0.20 |
| Quintile 3 | 0.949 (40,195) | 1.33 (75,562) | 0 (0) | 0 (0) | 0.234 (445) | 0 (0) | 0 (0) | 0.20 |
| Quintile 4 | 1.66 (70,427) | 0.8 (45,314) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.112 (461) | 0.20 |
| Quintile 5 | 1.87 (79,158) | 0.301 (17,080) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 4.87 (19,965) | 0.20 |
| Relative Frequency | 0.365 | 0.488 | 0.0615 | 0.00473 | 0.0163 | 0.0299 | 0.0353 | 1 |
Features selected applying Algorithm 2 to the Covertype dataset.
| Features ( | Window ( | |
|---|---|---|
| (E,HH,HF) | Quintile 5 | 0.38 |
| (E,A,HH,HF) | Quintile 5 | 0.38 |
| (E,HH) | Quintile 5 | 0.37 |
| (E,A,HH) | Quintile 5 | 0.37 |
| (E,HH,VH,HF) | Quintile 5 | 0.37 |
| (E,A,HH,VH,HF) | Quintile 5 | 0.36 |
| (E,HH,HF) | Quintiles 1 & 5 | 0.36 |
| (E,A,HH,VH) | Quintile 5 | 0.36 |
| (E,HH,VH) | Quintile 5 | 0.36 |
| (E,HH) | Quintiles 1 & 5 | 0.36 |
Profiles selected applying Algorithm 3 to the Covertype dataset for .
| Cover Type | Features | LF Maximum | Profile |
|---|---|---|---|
| 1 | (E,HH,HF) | 1.87 | Quintile 5 |
| 2 | (E,HH,HR) | 1.63 | Quintile 2 |
| 3 | (E,HH,HR,HF) | 4.96 | Quintile 1 |
| 4 | (E,HH,HF) | 5 | Quintile 1 |
| 5 | (E,HR,HF) | 3.31 | Quintile 2 |
| 6 | (E,HH,HF) | 4.90 | Quintile 1 |
| 7 | (E,HH) | 4.89 | Quintile 5 |
Among other profiles.