| Literature DB >> 36164412 |
Abhik Ghosh1, María Jaenada2, Leandro Pardo2.
Abstract
Coronavirus disease 2019 (COVID19) has triggered a global pandemic affecting millions of people. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causing the COVID-19 disease is hypothesized to gain entry into humans via the airway epithelium, where it initiates a host response. The expression levels of genes at the upper airway that interact with the SARS-CoV-2 could be a telltale sign of virus infection. However, gene expression data have been flagged as suspicious of containing different contamination errors via techniques for extracting such information, and clinical diagnosis may contain labelling errors due to the specificity and sensitivity of diagnostic tests. We propose to fit the regularized logistic regression model as a classifier for COVID-19 diagnosis, which simultaneously identifies genes related to the disease and predicts the COVID-19 cases based on the expression values of the selected genes. We apply a robust estimating methods based on the density power divergence to obtain stable results ignoring the effects of contamination or labelling errors in the data and compare its performance with respect to the classical maximum likelihood estimator with different penalties, including the LASSO and the general adaptive LASSO penalties.Entities:
Keywords: COVID-19; Density power divergence; Gene expression; High-dimensional data; Sparse logistic regression
Year: 2022 PMID: 36164412 PMCID: PMC9491676 DOI: 10.1007/s42519-022-00295-3
Source DB: PubMed Journal: J Stat Theory Pract ISSN: 1559-8608
Accuracy measures when training the logistic regression model with uncontaminated data
| Training with all data | Training with subsamples | |||||||
|---|---|---|---|---|---|---|---|---|
| MS | Rate | TP | TN | MS | Rate | TP | TN | |
| LASSO | 24 | 0.950 | 0.926 | 0.965 | 18.400 | 0.908 | 0.872 | 0.932 |
| Ad LASSO | 9 | 0.929 | 0.904 | 0.944 | 7.600 | 0.903 | 0.868 | 0.925 |
| AW-LASSO | 24 | 0.929 | 0.904 | 0.944 | 7.600 | 0.903 | 0.868 | 0.925 |
| Ad DPD-LASSO | 12 | 0.954 | 0.936 | 0.965 | 9.8 | 0.932 | 0.915 | 0.943 |
| Ad DPD-LASSO | 11 | 0.954 | 0.936 | 0.965 | 9.6 | 0.937 | 0.915 | 0.951 |
| Ad DPD-LASSO | 9 | 0.950 | 0.915 | 0.972 | 9.0 | 0.935 | 0.917 | 0.947 |
| Ad DPD-LASSO | 9 | 0.950 | 0.915 | 0.972 | 9.2 | 0.940 | 0.915 | 0.957 |
| Ad DPD-LASSO | 9 | 0.950 | 0.915 | 0.972 | 7.8 | 0.939 | 0.911 | 0.957 |
| AW DPD-LASSO | 18 | 0.958 | 0.947 | 0.965 | 11.8 | 0.930 | 0.909 | 0.944 |
| AW DPD-LASSO | 18 | 0.958 | 0.936 | 0.972 | 11.8 | 0.933 | 0.913 | 0.946 |
| AW DPD-LASSO | 19 | 0.958 | 0.936 | 0.972 | 11.6 | 0.933 | 0.911 | 0.947 |
| AW DPD-LASSO | 19 | 0.950 | 0.926 | 0.965 | 11.6 | 0.934 | 0.909 | 0.950 |
| AW DPD-LASSO | 19 | 0.950 | 0.926 | 0.965 | 11.8 | 0.935 | 0.913 | 0.950 |
| LASSO | 12 | 0.761 | 0.468 | 0.951 | 12.800 | 0.759 | 0.457 | 0.956 |
| Ad LASSO | 7 | 0.777 | 0.553 | 0.924 | 6.600 | 0.781 | 0.568 | 0.919 |
| AW-LASSO | 7 | 0.777 | 0.553 | 0.924 | 6.600 | 0.781 | 0.568 | 0.919 |
| Ad DPD-LASSO | 6 | 0.845 | 0.766 | 0.896 | 8.0 | 0.820 | 0.679 | 0.912 |
| Ad DPD-LASSO | 6 | 0.845 | 0.766 | 0.896 | 7.8 | 0.785 | 0.545 | 0.942 |
| Ad DPD-LASSO | 6 | 0.840 | 0.755 | 0.896 | 7.0 | 0.816 | 0.662 | 0.917 |
| Ad DPD-LASSO | 6 | 0.840 | 0.755 | 0.896 | 7.0 | 0.787 | 0.549 | 0.942 |
| Ad DPD-LASSO | 6 | 0.845 | 0.766 | 0.896 | 6.6 | 0.813 | 0.660 | 0.914 |
| AW DPD-LASSO | 7 | 0.840 | 0.745 | 0.903 | 6.6 | 0.821 | 0.732 | 0.879 |
| AW DPD-LASSO | 7 | 0.840 | 0.755 | 0.896 | 6.8 | 0.829 | 0.747 | 0.882 |
| AW DPD-LASSO | 7 | 0.845 | 0.766 | 0.896 | 6.4 | 0.822 | 0.730 | 0.882 |
| AW DPD-LASSO | 7 | 0.845 | 0.766 | 0.896 | 6.4 | 0.817 | 0.717 | 0.882 |
| AW DPD-LASSO | 7 | 0.840 | 0.745 | 0.903 | 6.0 | 0.814 | 0.711 | 0.882 |
Fig. 1AUC for the different methods with uncontaminated (top) and contaminated (bottom) data
Fig. 2Venn diagrams of gene sets selected by penalized DPD-based methods for different values of under uncontaminated and contaminated data
Fig. 3Venn diagrams of gene sets selected by different methods under uncontaminated and contaminated data
Fig. 4Correlation between the genes identified with DPD-based methods
Estimated coefficients and OR associated with the selected genes with adaptive penalized DPD-based methods
| Gene name | Coef. | OR | Coef. | OR | Coef. | OR | Coef. | OR | Coef. | OR |
|---|---|---|---|---|---|---|---|---|---|---|
| IFI6 | 1.42 | 4.14 | 1.75 | 5.76 | 1.22 | 3.37 | 1.27 | 3.56 | 1.31 | 3.71 |
| RGPD2 | 0.59 | 0.49 | 0.63 | 0.61 | 0.60 | |||||
| PLK4 | 0.50 | 1.64 | 0.68 | 1.97 | 0.48 | 1.62 | 0.52 | 1.68 | 0.55 | 1.74 |
| DGKI | 0.27 | 1.32 | 0.44 | 1.55 | – | – | – | – | – | – |
| TIMP1 | 0.33 | 0.25 | 0.31 | 0.30 | 0.29 | |||||
| TRO | 0.74 | 2.10 | 0.85 | 2.35 | 0.80 | 2.22 | 0.81 | 2.25 | 0.80 | 2.23 |
| FAM83A | 0.57 | 1.76 | 0.92 | 2.51 | 0.57 | 1.76 | 0.61 | 1.84 | 0.67 | 1.95 |
| KRT13 | 0.72 | 0.65 | 0.81 | 0.79 | 0.76 | |||||
| IGLL5 | 0.33 | 1.40 | 0.43 | 1.53 | 0.20 | 1.23 | 0.22 | 1.25 | 0.26 | 1.30 |
| SPECC1L-ADORA2A | 0.71 | 0.66 | – | – | – | – | – | – | ||
| HBA1 | 0.78 | – | – | – | – | – | – | – | – | |
| IFI6 | 1.25 | 3.49 | 1.41 | 4.10 | 1.48 | 4.40 | 0.95 | 2.59 | 1.00 | 2.71 |
| IFI44L | 0.31 | 1.37 | 0.41 | 1.50 | 0.57 | 1.77 | 0.26 | 1.29 | 0.28 | 1.32 |
| RGPD2 | 0.64 | 0.55 | 0.51 | 0.65 | 0.64 | |||||
| PPEF2 | 0.52 | 1.69 | 0.63 | 1.89 | 0.72 | 2.05 | 0.42 | 1.52 | 0.44 | 1.56 |
| PLK4 | 0.79 | 2.20 | 0.97 | 2.64 | 1.26 | 3.54 | 0.58 | 1.78 | 0.63 | 1.87 |
| DGKI | 0.55 | 1.73 | 0.77 | 2.17 | 0.91 | 2.47 | 0.46 | 1.58 | 0.49 | 1.63 |
| TIMP1 | 0.38 | 0.33 | 0.32 | 0.42 | 0.41 | |||||
| TRO | 0.42 | 1.52 | 0.49 | 1.63 | 0.50 | 1.64 | 0.34 | 1.41 | 0.36 | 1.43 |
| FAM83A | 0.60 | 1.82 | 0.73 | 2.08 | 0.70 | 2.02 | 0.49 | 1.63 | 0.51 | 1.67 |
| ADM | 0.32 | 1.37 | 0.43 | 1.54 | 0.65 | 1.92 | 0.23 | 1.26 | 0.27 | 1.31 |
| WDR74 | 0.17 | 1.19 | 0.18 | 1.20 | 0.27 | 1.31 | 0.16 | 1.17 | 0.17 | 1.18 |
| HBA1 | 0.67 | 0.61 | 0.51 | 0.84 | 0.81 | |||||
| DCUN1D3 | 0.97 | 0.96 | 0.83 | 0.96 | 0.94 | |||||
| KRT13 | 0.69 | 0.62 | 0.55 | 0.74 | 0.72 | |||||
| ICAM4 | 0.78 | 0.65 | 0.58 | 0.76 | 0.74 | |||||
| IGLL5 | 0.39 | 1.48 | 0.47 | 1.60 | 0.60 | 1.81 | 0.28 | 1.32 | 0.31 | 1.36 |
| SPECC1L-ADORA2A | 0.65 | 0.59 | 0.50 | 0.77 | 0.74 | |||||
| SMARCA1 | – | – | – | – | 0.26 | 1.30 | – | – | – | – |
| AL928654.3 | – | – | – | – | – | 0.95 | 0.96 | |||
Accuracy measures when training the logistic regression model with uncontaminated data for the problem of differentiating between covid19 and other virus
| Training with all data | Training with subsamples | |||||||
|---|---|---|---|---|---|---|---|---|
| MS | Rate | TP | TN | MS | Rate | TP | TN | |
| LASSO | 17 | 0.919 | 0.989 | 0.756 | 12.600 | 0.902 | 0.977 | 0.732 |
| Ad LASSO | 6 | 0.904 | 0.968 | 0.756 | 6.800 | 0.911 | 0.957 | 0.805 |
| AW-LASSO | 23 | 0.956 | 0.989 | 0.878 | 6.800 | 0.911 | 0.957 | 0.805 |
| Ad DPD-LASSO | 8 | 0.963 | 0.968 | 0.951 | 8.600 | 0.932 | 0.957 | 0.873 |
| Ad DPD-LASSO | 10 | 0.970 | 0.989 | 0.927 | 7.800 | 0.930 | 0.968 | 0.844 |
| Ad DPD-LASSO | 10 | 0.970 | 0.989 | 0.927 | 7.600 | 0.935 | 0.974 | 0.844 |
| Ad DPD-LASSO | 10 | 0.970 | 0.989 | 0.927 | 7.400 | 0.933 | 0.972 | 0.844 |
| Ad DPD-LASSO | 8 | 0.956 | 0.979 | 0.902 | 6.800 | 0.939 | 0.968 | 0.873 |
| AW DPD-LASSO | 9 | 0.963 | 0.979 | 0.927 | 9.600 | 0.947 | 0.972 | 0.888 |
| AW DPD-LASSO | 10 | 0.963 | 0.979 | 0.927 | 9.800 | 0.945 | 0.968 | 0.893 |
| AW DPD-LASSO | 10 | 0.963 | 0.979 | 0.927 | 9.000 | 0.942 | 0.970 | 0.878 |
| AW DPD-LASSO | 10 | 0.963 | 0.979 | 0.927 | 8.800 | 0.942 | 0.970 | 0.878 |
| AW DPD-LASSO | 10 | 0.963 | 0.979 | 0.927 | 9.200 | 0.932 | 0.964 | 0.859 |
| LASSO | 5 | 0.807 | 0.989 | 0.390 | 3.2 | 0.750 | 0.991 | 0.195 |
| Ad LASSO | 4 | 0.844 | 0.957 | 0.585 | 2.2 | 0.753 | 0.989 | 0.210 |
| AW-LASSO | 4 | 0.844 | 0.957 | 0.585 | 2.2 | 0.753 | 0.989 | 0.210 |
| Ad DPD-LASSO | 5 | 0.881 | 0.926 | 0.780 | 3.4 | 0.760 | 0.974 | 0.268 |
| Ad DPD-LASSO | 5 | 0.881 | 0.926 | 0.780 | 3.0 | 0.759 | 0.974 | 0.263 |
| Ad DPD-LASSO | 5 | 0.807 | 0.989 | 0.390 | 3.0 | 0.759 | 0.974 | 0.263 |
| Ad DPD-LASSO | 5 | 0.807 | 0.989 | 0.390 | 3.0 | 0.759 | 0.974 | 0.263 |
| Ad DPD-LASSO | 5 | 0.807 | 0.989 | 0.390 | 3.0 | 0.759 | 0.974 | 0.263 |
| AW DPD-LASSO | 5 | 0.904 | 0.947 | 0.805 | 9.2 | 0.855 | 0.879 | 0.800 |
| AW DPD-LASSO | 5 | 0.904 | 0.947 | 0.805 | 7.6 | 0.855 | 0.883 | 0.790 |
| AW DPD-LASSO | 5 | 0.904 | 0.947 | 0.805 | 6.4 | 0.870 | 0.906 | 0.785 |
| AW DPD-LASSO | 5 | 0.904 | 0.947 | 0.805 | 4.8 | 0.862 | 0.923 | 0.722 |
| AW DPD-LASSO | 5 | 0.904 | 0.947 | 0.805 | 4.4 | 0.865 | 0.930 | 0.717 |
Estimated coefficients and OR associated with the selected genes with adaptive penalized DPD-based methods for differentiating between viral ARIs
| Gene name | Coef. | OR | Coef. | OR | Coef. | OR | Coef. | OR | Coef. | OR |
|---|---|---|---|---|---|---|---|---|---|---|
| LGR6 | 1.17 | 3.23 | 0.74 | 2.10 | 0.76 | 2.15 | 0.88 | 2.41 | 0.89 | 2.43 |
| TIMP1 | 0.43 | 0.42 | 0.36 | 0.40 | ||||||
| TRO | 2.17 | 1.61 | 4.98 | 1.58 | 4.85 | 1.60 | 4.95 | 1.74 | 5.69 | |
| SMARCA1 | 1.13 | 1.13 | 3.09 | 1.09 | 2.98 | 1.04 | 2.83 | 1.21 | 3.35 | |
| WDR74 | 0.79 | 0.90 | 2.45 | 1.04 | 2.82 | 0.91 | 2.48 | 1.10 | 2.99 | |
| AL928654.3 | 0.66 | 0.61 | 0.76 | 0.62 | ||||||
| ICAM4 | 0.48 | 0.42 | 0.41 | 0.38 | ||||||
| IGLL5 | 0.03 | 1.03 | 0.11 | 1.12 | 0.05 | 1.06 | – | – | 0.06 | 1.06 |
| GSTA2 | 0.34 | 1.40 | 0.34 | 1.40 | – | – | 0.30 | 1.35 | ||
| LGR6 | 1.26 | 3.52 | 0.44 | 1.55 | 0.48 | 1.61 | 0.53 | 1.70 | 0.67 | 1.96 |
| GSTA2 | 0.35 | 1.42 | 0.54 | 1.71 | 0.59 | 1.80 | 0.63 | 1.88 | 0.77 | 2.17 |
| TRO | 1.90 | 6.68 | 0.82 | 2.28 | 0.92 | 2.51 | 1.03 | 2.81 | 1.44 | 4.23 |
| SMARCA1 | 1.36 | 3.90 | 1.23 | 3.41 | 1.34 | 3.82 | 1.46 | 4.31 | 1.85 | 6.37 |
| WDR74 | 0.99 | 2.69 | 0.62 | 1.86 | 0.67 | 1.95 | 0.72 | 2.05 | 0.85 | 2.33 |
| IGLL5 | 0.38 | 1.47 | 0.40 | 1.48 | 0.43 | 1.54 | 0.47 | 1.59 | 0.56 | 1.76 |
| TIMP1 | 0.28 | – | – | – | – | – | – | – | – | |
| ICAM4 | 0.41 | – | – | – | – | – | – | – | – | |
| PLEK | – | – | 0.85 | 0.84 | 0.82 | 0.79 | ||||
| PDGFRB | – | – | 0.77 | 0.76 | 0.75 | 0.75 | ||||
| PCSK5 | – | – | 0.03 | 1.03 | 0.03 | 1.03 | 0.04 | 1.04 | 0.04 | 1.04 |
Estimated coefficients and OR associated with the selected genes with penalized MLE methods
| Gene name | LASSO | Ad-LASSO | AW-LASSO | |||
|---|---|---|---|---|---|---|
| Coef. | OR | Coef. | OR | Coef. | OR | |
| IFI6 | 0.98 | 1.02 | 2.78 | 0.71 | 2.03 | |
| LGR6 | 0.28 | 1.33 | 0.04 | 1.04 | 0.00 | 1.00 |
| RGPD2 | 0.76 | 0.99 | 0.83 | |||
| PPEF2 | 0.04 | 1.04 | 0.08 | 1.09 | 0.07 | 1.07 |
| PLK4 | 0.99 | 0.02 | 1.02 | 0.08 | 1.08 | |
| TIMP1 | 0.97 | 0.62 | 0.70 | |||
| TRO | 0.02 | 1.02 | 0.65 | 1.92 | 0.22 | 1.24 |
| SMARCA1 | 0.00 | 1.00 | 0.23 | 1.26 | 0.04 | 1.04 |
| FAM83A | 1.00 | 0.08 | 1.08 | 0.09 | 1.10 | |
| DCUN1D3 | 0.88 | 0.92 | 0.85 | |||
| ICAM4 | 0.02 | 1.02 | 0.97 | 0.85 | ||
| GPR153 | – | – | 0.98 | – | – | |
| H2AC20 | – | – | 0.28 | 1.33 | – | – |
| GLUL | – | – | 0.76 | – | – | |
| AFF1 | – | – | 0.97 | – | – | |
| CASP3 | – | – | 0.00 | 1.00 | – | – |
| RNF39 | - | – | 1.00 | – | – | |
| CDKN1A | – | – | 0.80 | – | – | |
| FBXW2 | – | – | 0.88 | – | – | |
| RALGDS | – | – | 0.99 | – | – | |
| TOLLIP | – | – | 0.92 | – | – | |
| BORCS7 | – | – | 0.06 | 1.06 | – | – |
| CKAP2 | – | – | 0.02 | 1.02 | – | – |
| IFI44L | 1.02 | 2.78 | – | – | 0.09 | 1.09 |
| DGKI | 0.08 | 1.09 | – | – | 0.07 | 1.07 |
| PCSK5 | 0.80 | – | – | 0.01 | 1.01 | |
| ADM | 0.62 | – | – | 0.92 | ||
| WDR74 | 0.65 | 1.92 | – | – | 0.22 | 1.24 |
| AL928654.3 | 0.23 | 1.26 | – | – | 0.92 | |
| HBA1 | 0.08 | 1.08 | – | – | 0.95 | |
| EIF3CL | 0.99 | – | – | 0.98 | ||
| KRT13 | 0.92 | – | – | 0.90 | ||
| TGM3 | 0.06 | 1.06 | – | – | 1.00 | |
| IGLL5 | 0.92 | – | – | 0.11 | 1.11 | |
| SPECC1L-ADORA2A | 0.97 | – | - | 0.93 | ||
Estimated coefficients and OR associated with the selected genes with penalized MLE methods for differentiating between ARIs
| Gene name | LASSO | Ad-LASSO | AW-LASSO | |||
|---|---|---|---|---|---|---|
| Coef. | OR | Coef. | OR | Coef. | OR | |
| LGR6 | 0.20 | 1.22 | 0.42 | 1.52 | 0.59 | 1.81 |
| TIMP1 | 0.98 | 0.76 | ||||
| TRO | 0.20 | 1.22 | 0.25 | 1.29 | 0.38 | 1.47 |
| SMARCA1 | 0.09 | 1.10 | 0.14 | 1.15 | ||
| WDR74 | 0.20 | 1.22 | 0.26 | 1.29 | 0.17 | 1.19 |
| GPR153 | – | – | – | – | 0.92 | |
| GIPC2 | – | – | – | – | 0.02 | 1.02 |
| GLUL | – | – | – | – | 0.72 | |
| ORC4 | – | – | – | – | 0.38 | 1.46 |
| ZSCAN23 | – | – | – | – | 0.07 | 1.07 |
| LYRM2 | – | – | – | – | 0.12 | 1.13 |
| PMPCB | – | – | – | – | 0.09 | 1.09 |
| CPNE3 | – | – | – | – | 0.04 | 1.04 |
| RALGDS | – | – | – | – | 0.87 | |
| BORCS7 | – | – | – | – | 0.14 | 1.15 |
| KIAA0586 | – | – | – | – | 0.04 | 1.04 |
| SNRPN | – | – | – | – | 0.07 | 1.08 |
| NUP88 | – | – | – | – | 0.07 | 1.07 |
| HS3ST3B1 | – | – | – | – | 0.92 | |
| R3HDM4 | – | – | – | – | 0.94 | |
| TRIP10 | – | – | – | – | 0.97 | |
| TPM4 | – | – | – | – | 0.97 | |
| PLEK | 0.82 | 0.79 | – | – | ||
| GSTA2 | 0.27 | 1.31 | 0.36 | 1.43 | – | – |
| PDGFRB | 0.93 | – | – | – | - | |
| NRSN1 | 0.11 | 1.11 | – | – | – | |
| PCSK5 | 0.06 | 1.06 | – | – | – | – |
| HCAR2 | 0.97 | – | – | – | – | |
| AL928654.3 | 0.98 | – | – | – | – | |
| DCUN1D3 | 0.96 | – | – | – | ||
| KRT13 | 0.99 | – | – | – | – | |
| ICAM4 | 0.97 | – | – | – | – | |
| IGLL5 | 0.08 | 1.08 | – | – | – | – |