| Literature DB >> 31510671 |
Héctor Climente-González1,2,3,4, Chloé-Agathe Azencott1,2,3, Samuel Kaski5, Makoto Yamada4,6.
Abstract
MOTIVATION: Finding non-linear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity and computational overhead. Here we propose block HSIC Lasso, a non-linear feature selector that does not present the previous drawbacks.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31510671 PMCID: PMC6612810 DOI: 10.1093/bioinformatics/btz333
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Summary description of benchmark datasets
| Type | Dataset | Features ( | Samples ( | Classes |
|---|---|---|---|---|
| Image | AR10P | 2400 | 130 | 10 |
| PIE10P | 2400 | 210 | 10 | |
| PIX10P | 10 000 | 100 | 10 | |
| ORL10P | 10 000 | 100 | 10 | |
| Microarray | CLL-SUB-111 | 11 340 | 111 | 3 |
| GLIOMA | 4434 | 50 | 4 | |
| SMK-CAN-187 | 19 993 | 187 | 2 | |
| TOX-171 | 5748 | 171 | 4 | |
|
| 15 972 | 7216 | 19 | |
| scRNA-seq |
| 25 393 | 13 302 | 8 |
|
| 23 395 | 1140 | 10 | |
| GWA data | RA versus controls | 352 773 | 3451 | 2 |
| T1D versus controls | 352 853 | 3443 | 2 | |
| T2D versus controls | 353 046 | 3456 | 2 |
Fig. 1.Percentage of true causal features extracted by different feature selectors. Each data point represents the mean over 10 replicates, and the error bars represent the standard error of the mean. Lines are discontinued when the algorithm required more memory than the provided (50 GB). Note that in some conditions mRMR’s line cannot be seen due to the overlap with LARS
Fig. 2.Computational resources used by the different methods. (A) Time elapsed in a multiprocess setting. (B) Memory usage in a single-core setting. (C) Number of correct features retrieved on synthetic data (, 20 causal features) by block HSIC Lasso at different block sizes B and number of permutations M
Fig. 3.Random forest classification accuracy of microarray gene expression samples after feature extraction by the different methods. The gray line represents the mean accuracy of 10 classifiers trained on all the dataset
Fig. 4.Manhattan plot of the GWA datasets using P-values from the genotypic test. A constant of was added to all P-values to allow plotting P-values of 0. SNPs in black are the SNPs selected by block HSIC Lasso (B = 20), 10 per phenotype. When SNPs are located within the boundaries of a gene (±50 kb), the gene name is indicated. The red line represents the Bonferroni threshold with