| Literature DB >> 32366216 |
Xiaotian Dai1, Guifang Fu2, Randall Reese3.
Abstract
BACKGROUND: Feature screening plays a critical role in handling ultrahigh dimensional data analyses when the number of features exponentially exceeds the number of observations. It is increasingly common in biomedical research to have case-control (binary) response and an extremely large-scale categorical features. However, the approach considering such data types is limited in extant literature. In this article, we propose a new feature screening approach based on the iterative trend correlation (ITC-SIS, for short) to detect important susceptibility loci that are associated with the polycystic ovary syndrome (PCOS) affection status by screening 731,442 SNP features that were collected from the genome-wide association studies.Entities:
Keywords: Categorical data analysis; Feature screening; GWAS; Sure screening consistency; Ultrahigh dimensionality
Year: 2020 PMID: 32366216 PMCID: PMC7199379 DOI: 10.1186/s12859-020-3492-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Success rates of four feature screening approaches in selecting each and all truly influential feature X within thresholds d=20,40,60 for Simulation Study 1
| Run Time | ||||||||
| DC-SIS | 0 | 20927.89 | 0.45 | 0.46 | 0.27 | 0.60 | 0.48 | |
| MMLE-SIS | 0 | 13982.14 | 0.77 | 0.78 | 0.25 | 0.39 | 0.78 | |
| TC-SIS | 0.03 | 28202.48 | 0.99 | 0.67 | 0.29 | 0.77 | 0.45 | 5.042s |
| ITC-SIS | 0.22 | 31.69 | 1.00 | 0.99 | 0.47 | 0.76 | 0.99 | 11.025s |
| Run Time | ||||||||
| DC-SIS | 0 | 20927.89 | 0.54 | 0.55 | 0.30 | 0.61 | 0.56 | |
| MMLE-SIS | 0 | 13982.14 | 0.79 | 0.79 | 0.27 | 0.39 | 0.79 | |
| TC-SIS | 0.03 | 28202.48 | 0.99 | 0.67 | 0.31 | 0.77 | 0.45 | 5.042s |
| ITC-SIS | 0.89 | 31.69 | 1.00 | 1.00 | 0.95 | 0.94 | 1.00 | 11.025s |
| Run Time | ||||||||
| DC-SIS | 0 | 20927.89 | 0.61 | 0.62 | 0.31 | 0.62 | 0.63 | |
| MMLE-SIS | 0 | 13982.14 | 0.79 | 0.79 | 0.27 | 0.39 | 0.79 | |
| TC-SIS | 0.03 | 28202.48 | 0.99 | 0.67 | 0.31 | 0.78 | 0.45 | 5.042s |
| ITC-SIS | 0.94 | 31.69 | 1.00 | 1.00 | 0.97 | 0.97 | 1.00 | 11.025s |
Values of π used to simulate data in Simulation Study 2
| 0.3 | 0.4 | 0.6 | 0.7 | 0.2 | 0.4 | 0.3 | 0.8 | 0.4 | 0.2 | |
| 0.6 | 0.1 | 0.1 | 0.4 | 0.8 | 0.7 | 0.9 | 0.2 | 0.7 | 0.6 |
Success rates of four feature screening approaches in selecting each truly influential feature X within thresholds d=15, respectively for Simulation Study 2
| MMLE-SIS | 150.340 | 0.384 | 0.746 | 0.756 | 0.404 | 0.822 | 0.844 | 0.354 | 0.742 | 0.320 | 0.400 |
| DC-SIS | 64.990 | 0.900 | 0.984 | 0.990 | 0.898 | 0.998 | 0.998 | 0.894 | 0.984 | 0.888 | 0.894 |
| PC-SIS | 93.018 | 0.862 | 0.974 | 0.980 | 0.864 | 0.994 | 0.998 | 0.860 | 0.966 | 0.854 | 0.862 |
| TC-SIS | 54.674 | 0.916 | 0.988 | 0.994 | 0.922 | 1.000 | 0.998 | 0.912 | 0.982 | 0.904 | 0.908 |
Values of (κ,κ) used to simulate data in Simulation Study 3
| 0 | 0 | 0.2 | 0 | -0.2 | 0.2 | 0 | 0.1 | -0.2 | 0.2 | |
| 0.7 | 1 | 0.8 | 0.9 | 1.2 | 1 | 1 | 1 | 1.2 | 0.8 |
Success rates of four feature screening approaches in selecting each truly influential feature X within thresholds d=15, respectively for Simulation Study 3
| MMLE-SIS | 508.672 | 0.072 | 0.060 | 0.066 | 0.078 | 0.554 | 0.054 | 0.388 | 0.098 | 0.032 | 0.204 |
| DC-SIS | 125.258 | 0.876 | 0.884 | 0.886 | 0.904 | 0.886 | 0.880 | 0.880 | 0.910 | 0.878 | 0.906 |
| PC-SIS | 171.829 | 0.820 | 0.806 | 0.816 | 0.842 | 0.876 | 0.806 | 0.876 | 0.830 | 0.820 | 0.858 |
| TC-SIS | 112.627 | 0.876 | 0.876 | 0.882 | 0.904 | 0.924 | 0.874 | 0.920 | 0.906 | 0.878 | 0.900 |
Values of used to simulate data in Simulation Study 4
| 0 | -5 | 2 | -6 | 1 | |
| 3 | -3 | 4 | -4 | 3 | |
| 5 | -1 | 6 | -2 | 5 |
Success rates of four feature screening approaches in selecting each truly influential feature X within thresholds d=15, respectively for Simulation Study 4
| MMLE-SIS | 41.934 | 1.000 | 0.856 | 0.868 | 0.842 | 0.870 |
| DC-SIS | 46.470 | 1.000 | 0.860 | 0.850 | 0.838 | 0.866 |
| PC-SIS | 93.270 | 1.000 | 0.794 | 0.758 | 0.778 | 0.790 |
| TC-SIS | 41.976 | 1.000 | 0.860 | 0.858 | 0.842 | 0.862 |
Fig. 1The MSPE used to select d1 in Real Data Analyses
Model Selection of Three approaches Applied in Real Data Analyses
| Two-stage Method | Model size | AIC | Misclassification Rate |
|---|---|---|---|
| DC-SIS + Multiple Logistic Regression | 70 | 4188.33 | 21.74% |
| TC-SIS + Multiple Logistic Regression | 86 | 3601.02 | 19.51% |
| ITC-SIS ( | 88 | 3581.96 | 19.27% |
Fig. 2The Manhattan plot of the ITC scores in Real Data Analyses