| Literature DB >> 31212673 |
Md Shahjaman1, Md Rezanur Rahman2, S M Shahinul Islam3, Md Nurul Haque Mollah4.
Abstract
Background and objectives: Identification of cancer biomarkers that are differentially expressed (DE) between two biological conditions is an important task in many microarray studies. There exist several methods in the literature in this regards and most of these methods designed especially for unpaired samples, those are not suitable for paired samples. Furthermore, the traditional methods use p-values or fold change (FC) values to detect the DE genes. However, sometimes, p-value based results do not comply with FC based results due to the smaller pooled variance of gene expressions, which occurs when variance of each individual condition becomes smaller. There are some methods that combine both p-values and FC values to solve this problem. But, those methods also show weak performance for small sample cases in the presence of outlying expressions. To overcome this problem, in this paper, an attempt is made to propose a hybrid robust SAM-FC approach by combining rank of FC values and rank of p-values computed by SAM statistic using minimum β-divergence method, which is designed for paired samples. Materials andEntities:
Keywords: DEGs; FC; cancer biomarkers; candidate drugs; minimum β-divergence estimation; p-value; paired samples; robustness
Mesh:
Substances:
Year: 2019 PMID: 31212673 PMCID: PMC6631768 DOI: 10.3390/medicina55060269
Source DB: PubMed Journal: Medicina (Kaunas) ISSN: 1010-660X Impact factor: 2.430
Figure 1Plot of average p-values against FC values by different test procedures for simulated dataset. For small sample cases (n1 = n2 = 3) (a) without outlying sample and (c) one outlying sample. For large sample cases (n1 = n2 = 15): (b) without outlying sample and (d) one or two outlying samples.
Figure 2M-A Plot and scatter plot of β-weights using data type 1 for small sample case (n1 = n2 = 3). (a) Without outlying sample. (b) One outlying sample in each of 5% genes. (c) Scatter plot of the smallest β-weight for (a). (d) Scatter plot of the smallest β-weight for (b); where we considered the minimum value of β-weights of n = 3 fold change expressions corresponding to n1 = 3 samples of first condition and n2 = 3 samples of second condition as the smallest β-weight.
Figure 3Plot of average false discovery rate (FDR) versus top 300 differentially expressed genes (DEGs) estimated by different methods using data type 1. For small sample cases (n1 = n2 = 3): (a) without outlying sample and (c) one outlying sample across the genome. For large sample cases (n1 = n2 = 15): (b) without outlying sample and (d) one or two outlying samples across the genome.
Performance evaluation based on simulated gene expression profiles using data type 1.
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| 0.461 (0.132) | 0.017 (0.027) | 0.983 (0.973) | 0.539 (0.868) | 0.033 (0.052) | 0.538 (0.864) | 0.459 (0.131) | 0.090 (0.026) |
|
| 0.889 (0.141) | 0.003 (0.026) | 0.997 (0.974) | 0.111 (0.859) | 0.007 (0.052) | 0.112 (0.854) | 0.889 (0.141) | 0.178 (0.028) |
|
| 0.802 (0.145) | 0.006 (0.026) | 0.994 (0.974) | 0.198 (0.855) | 0.012 (0.051) | 0.202 (0.850) | 0.802 (0.145) | 0.160 (0.029) |
|
| 0.924 (0.145) | 0.002 (0.026) | 0.998 (0.974) | 0.076 (0.855) | 0.005 (0.051) | 0.081 (0.850) | 0.924 (0.145) | 0.185 (0.029) |
|
| 0.785 (0.002) | 0.007 (0.031) | 0.993 (0.969) | 0.215 (0.998) | 0.013 (0.060) | 0.214 (0.998) | 0.785 (0.002) | 0.157 (0.000) |
|
| 0.926 (0.010) | 0.002 (0.031) | 0.998 (0.969) | 0.074 (0.990) | 0.005 (0.060) | 0.080 (0.989) | 0.926 (0.010) | 0.185 (0.002) |
|
| 0.909 (0.146) | 0.003 (0.026) | 0.997 (0.974) | 0.091 (0.854) | 0.006 (0.051) | 0.096 (0.849) | 0.909 (0.146) | 0.182 (0.029) |
|
| 0.924 (0.855) | 0.002 (0.010) | 0.998 (0.990) | 0.076 (0.145) | 0.005 (0.019) | 0.081 (0.153) | 0.924 (0.850) | 0.185 (0.166) |
|
|
|
|
|
|
|
|
|
|
|
| ||||||||
|
|
|
|
|
|
|
|
|
|
|
| 0.927 (0.144) | 0.003 (0.026) | 0.997 (0.974) | 0.073 (0.856) | 0.005 (0.051) | 0.089 (0.852) | 0.927 (0.144) | 0.185 (0.028) |
|
| 0.927 (0.893) | 0.003 (0.004) | 0.997 (0.996) | 0.073 (0.107) | 0.005 (0.007) | 0.089 (0.122) | 0.927 (0.893) | 0.185 (0.178) |
|
| 0.927 (0.144) | 0.003 (0.026) | 0.997 (0.974) | 0.073 (0.856) | 0.005 (0.051) | 0.089 (0.521) | 0.927 (0.144) | 0.185 (0.029) |
|
| 0.927 (0.145) | 0.003 (0.026) | 0.997 (0.974) | 0.073 (0.855) | 0.005 (0.051) | 0.089 (0.851) | 0.927 (0.145) | 0.185 (0.029) |
|
| 0.909 (0.012) | 0.006 (0.031) | 0.993 (0.969) | 0.090 (0.988) | 0.011 (0.059) | 0.102 (0.990) | 0.909 (0.011) | 0.181 (0.002) |
|
| 0.927 (0.926) | 0.003 (0.003) | 0.997 (0.997) | 0.073 (0.074) | 0.005 (0.005) | 0.089 (0.090) | 0.927 (0.926) | 0.185 (0.185) |
|
| 0.927 (0.927) | 0.003 (0.003) | 0.997 (0.997) | 0.073 (0.073) | 0.005 (0.005) | 0.089 (0.089) | 0.927 (0.927) | 0.185 (0.185) |
|
| 0.927 (0.927) | 0.003 (0.003) | 0.997 (0.997) | 0.073 (0.073) | 0.005 (0.005) | 0.089 (0.089) | 0.927 (0.927) | 0.185 (0.185) |
|
|
|
|
|
|
|
|
|
|
Average performance results of eight methods (t-test, SAM, LIMMA, Wilcoxon, WAD, RP, FCROS, and Proposed) based on 100 datasets of data type 1 for both small and large sample cases n1 = n2 = 3 and 15. Each dataset for each case included 300 true DEGs. The performance measures TPR, FPR, TNR, FNR, FDR, MER, AUC, and pAUC were calculated for each methods based on top 300 estimated DEGs considering the rest of 9700 genes were EEGs. The performance measure pAUC was calculated at FPR = 0.2 for each method for each dataset. The values inside the parenthesis (.) indicate the average performance results in presence of one or two outlying sample across the genome.
Performance evaluation based on spike gene expression profiles using data type 2 for sample size n1 = n2 = 9.
| Performance Results for Sample Size | ||||||||
|---|---|---|---|---|---|---|---|---|
| Methods | TPR | TNR | FPR | FNR | FDR | MER | AUC | pAUC |
|
| 0.818 (0.217) | 0.979 (0.909) | 0.021 (0.091) | 0.182 (0.783) | 0.182 (0.783) | 0.038 (0.163) | 0.816 (0.216) | 0.161 (0.043) |
|
| 0.810 (0.540) | 0.978 (0.947) | 0.022 (0.053) | 0.190 (0.460) | 0.190 (0.460) | 0.040 (0.096) | 0.805 (0.534) | 0.158 (0.102) |
|
| 0.833 (0.215) | 0.981 (0.909) | 0.019 (0.091) | 0.167 (0.785) | 0.167 (0.785) | 0.035 (0.163) | 0.832 (0.214) | 0.165 (0.042) |
|
| 0.830 (0.217) | 0.980 (0.909) | 0.020 (0.091) | 0.170 (0.783) | 0.170 (0.783) | 0.035 (0.163) | 0.828 (0.216) | 0.164 (0.043) |
|
| 0.831 (0.297) | 0.980 (0.918) | 0.020 (0.082) | 0.169 (0.703) | 0.169 (0.703) | 0.035 (0.146) | 0.830 (0.284) | 0.165 (0.046) |
|
| 0.824 (0.743) | 0.980 (0.970) | 0.020 (0.030) | 0.176 (0.257) | 0.176 (0.257) | 0.037 (0.053) | 0.822 (0.741) | 0.163 (0.146) |
|
| 0.834 (0.799) | 0.981 (0.977) | 0.019 (0.023) | 0.166 (0.201) | 0.166 (0.201) | 0.035 (0.042) | 0.833 (0.798) | 0.166 (0.158) |
|
| 0.837 (0.832) | 0.981 (0.981) | 0.019 (0.019) | 0.163 (0.168) | 0.163 (0.168) | 0.032 (0.035) | 0.837 (0.831) | 0.170 (0.165) |
|
|
|
|
|
|
|
|
|
|
The summary statistics (TPR, TNR, FPR, FNR, FDR, MER, AUC, and pAUC) were calculated based on top 1944 DEGs estimated by different methods (t-test, Wilcoxon, SAM, LIMMA, WAD, RP, FCROS, and Proposed). The results inside the parenthesis (.) indicate the summary statistics in presence of one outlying sample across the genome.
Performance evaluation based on spike gene expression profiles using data type 2 for small sample cases n1 = n2 = 3.
| Average Performance Results for Small Sample Case | ||||||||
|---|---|---|---|---|---|---|---|---|
| Methods | TPR | TNR | FPR | FNR | FDR | MER | AUC | pAUC |
|
| 0.6939 (0.2253) | 0.9645 (0.9102) | 0.0355 (0.0898) | 0.3061 (0.7747) | 0.3061 (0.7747) | 0.0636 (0.1610) | 0.6888 (0.2234) | 0.1337 (0.0432) |
|
| 0.3405 (0.2238) | 0.9235 (0.9100) | 0.0765 (0.0900) | 0.6595 (0.7762) | 0.6595 (0.7762) | 0.1371 (0.1613) | 0.3278 (0.2178) | 0.0553 (0.0388) |
|
| 0.7701 (0.1456) | 0.9733 (0.9009) | 0.0267 (0.0991) | 0.2299 (0.8544) | 0.2299 (0.8544) | 0.0478 (0.1776) | 0.7683 (0.1445) | 0.1522 (0.0281) |
|
| 0.7675 (0.2068) | 0.9730 (0.9080) | 0.0270 (0.0920) | 0.2325 (0.7932) | 0.2325 (0.7932) | 0.0483 (0.1649) | 0.7659 (0.2060) | 0.1519 (0.0405) |
|
| 0.7639 (0.2850) | 0.9726 (0.9171) | 0.0274 (0.0829) | 0.2361 (0.7150) | 0.2361 (0.7150) | 0.0491 (0.1486) | 0.7627 (0.2715) | 0.1516 (0.0435) |
|
| 0.7711 (0.1898) | 0.9735 (0.9060) | 0.0265 (0.0940) | 0.2289 (0.8102) | 0.2289 (0.8102) | 0.0476 (0.1684) | 0.7696 (0.1796) | 0.1527 (0.0277) |
|
| 0.7685 (0.3853) | 0.9732 (0.9287) | 0.0268 (0.0713) | 0.2315 (0.6147) | 0.2315 (0.6147) | 0.0481 (0.1278) | 0.7671 (0.3757) | 0.1523 (0.0675) |
|
| 0.7716 (0.7582) | 0.9735 (0.9720) | 0.0265 (0.0280) | 0.2284 (0.2418) | 0.2284 (0.2418) | 0.0475 (0.0502) | 0.7702 (0.7565) | 0.1529 (0.1499) |
|
| 0.000 (0.000) | 0.000 (0.000) | 0.000 (0.000) | 0.000 (0.000) | 0.000 (0.000) | 0.000 (0.000) | 0.000 (0.000) | 0.000 (0.000) |
Average performance results by eight methods (t-test, Wilcoxon, SAM, LIMMA, WAD, RP, FCROS, and Proposed) based on 100 bootstrap datasets of data type 2 for small sample case (n1 n2 = 3). In each bootstrap dataset, 3 expression data points are selected from each condition for each gene from original dataset. The summary statistics (TPR, TNR, FPR, FNR, FDR, MER, AUC, and pAUC) were calculated based on top 1944 DEGs estimated by each method. The results inside the parenthesis (.) indicate the summary statistics in presence of one outlying sample across the genome.
Figure 4Comparison of the top selected genes by different methods for the head and neck cancer dataset. Venn diagram of top 50 genes estimated by (a) t-test, SAM, and LIMMA, and proposed or by (b) WAD, RP, FCROS, and proposed method. (c) Plot of ordered smallest β-weight for each gene and (d) histogram of β-weights. Where the smallest β-weight represents the minimum value of 22 β-weights for 22 paired samples for each gene. The outlier genes are indicated in red color. The gray line indicates the maximum value of cutoff, δ = 0.2 for outlying genes.
KEGG pathways for the two (2) differentially expressed (DE) genes identified by the proposed method only.
| KEGG ID | Pathway Name | Adjusted |
|---|---|---|
| hsa00591 | Linoleic acid metabolism | 0.004 |
| hsa00983 | Drug metabolism-other enzymes | 0.006 |
| hsa00140 | Steroid hormone biosynthesis | 0.008 |
| hsa00830 | Retinol metabolism | 0.009 |
| hsa00982 | Drug metabolism-cytochrome P450 | 0.009 |
| hsa04976 | Bile secretion | 0.009 |
| hsa00980 | Metabolism of xenobiotics by cytochrome | 0.010 |
| hsa05204 | Chemical carcinogenesis | 0.010 |
| hsa01100 | Metabolic pathways | 0.010 |
Disease association results of two (2) genes identified by proposed method only.
| ID | Disease Name | Adjusted |
|---|---|---|
| umls:C0040479 | Torsades de Pointes | 0.0005 |
| umls:C0019196 | Hepatitis C | 0.0016 |
| umls:C0029463 | Osteosarcoma | 0.0041 |
| umls:C1458155 | Mammary Neoplasms | 0.040 |
| umls:C0033578 | Prostatic Neoplasms | 0.051 |
Figure 5Protein–protein interaction (PPI) network using 2 genes detected by the proposed method.
Figure 6Kaplan–Meier plots using 2 genes (CYP3A4 and NOVA1) detected by the proposed method.
Candidate drugs for the two (2) DE genes identified by the proposed method obtained from GLAD4U and drug bank databases.
| ID | Name of the Drug | Adjusted |
|---|---|---|
| PA163522472 | darunavir | 7.11 × 10−4 |
| PA165111677 | Cremophor EL | 7.11 × 10−4 |
| PA165958385 | nilvadipine | 7.11 × 10−4 |
| PA448333 | alprazolam | 7.11 × 10−4 |
| PA449591 | felodipine | 7.11 × 10−4 |
| PA451753 | triazolam | 7.11 × 10−4 |
| PA164712364 | Androgen, progestogen and estrogen in combination | 8.53 × 10−4 |
| PA164713223 | Quinine and derivatives | 8.53 × 10−4 |
| PA164776964 | desloratadine | 8.53 × 10−4 |
| PA165983955 | acetaminophen glucuronide | 8.53 × 10−4 |
| DB01211 | Clarithromycin | 2.22 × 10−3 |
| DB00199 | Erythromycin | 3.11 × 10−3 |
| DB01267 | Paliperidone | 7.55 × 10−3 |