| Literature DB >> 23844055 |
Bin Peng1, Dianwen Zhu, Bradley P Ander, Xiaoshuai Zhang, Fuzhong Xue, Frank R Sharp, Xiaowei Yang.
Abstract
The discovery of genetic or genomic markers plays a central role in the development of personalized medicine. A notable challenge exists when dealing with the high dimensionality of the data sets, as thousands of genes or millions of genetic variants are collected on a relatively small number of subjects. Traditional gene-wise selection methods using univariate analyses face difficulty to incorporate correlational, structural, or functional structures amongst the molecular measures. For microarray gene expression data, we first summarize solutions in dealing with 'large p, small n' problems, and then propose an integrative Bayesian variable selection (iBVS) framework for simultaneously identifying causal or marker genes and regulatory pathways. A novel partial least squares (PLS) g-prior for iBVS is developed to allow the incorporation of prior knowledge on gene-gene interactions or functional relationships. From the point view of systems biology, iBVS enables user to directly target the joint effects of multiple genes and pathways in a hierarchical modeling diagram to predict disease status or phenotype. The estimated posterior selection probabilities offer probabilitic and biological interpretations. Both simulated data and a set of microarray data in predicting stroke status are used in validating the performance of iBVS in a Probit model with binary outcomes. iBVS offers a general framework for effective discovery of various molecular biomarkers by combining data-based statistics and knowledge-based priors. Guidelines on making posterior inferences, determining Bayesian significance levels, and improving computational efficiencies are also discussed.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23844055 PMCID: PMC3700986 DOI: 10.1371/journal.pone.0067672
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1An Example of KEGG Pathway.
Figure 2Gene and Pathway selection results in Scenario 1.
The top figure corresponds to the posterior distribution of gene with effect size , and second figure . The two smaller figures on the bottom demonstrate the posterior pathway selection probabilities, with the left one corresponding to , and right one . The labeled red lines indicate causal genes or causal pathways (those containing causal genes). These distributions were obtained by averaging over the 100 simulated sets of data.
Figure 3Gene and Pathway selection results in Scenario 2.
The top figure corresponds to the posterior probabilities of gene selection with effect size , and second figure . The two smaller figures on the bottom demonstrate the posterior probabilities of pathway selection, with the left one corresponds to , and right one . The red lines indicate causal genes or causal pathways (those containing causal genes). These distributions were obtained by averaging over the 100 simulated sets of data.
Figure 4Posterior Gene Selection Probabilities when P = 2000.
The top figure shows the result for Scenario 3, and the bottom one Scenario 4.
Figure 5Mean Square Error for Gene Selections.
Averaged over 100 simulated data in Scenario 1 for two set of gene effect sizes . The top one is for and bottom one .
Figure 6ROC Curves for iBVS and YS-BVS (Yang & Song's BVS).
Figure 7Gene and Pathway Selection Results for Stroke Data.
Top 30 genes selected using BVS on Stroke Data.
| No | BVS.ID | Post.Prob. | Probe.Set.ID | Gene.Symbol | Gene.Title |
| 1 | 196 | 0.951 | 206177_s_at | ARG1 | arginase, liver |
| 2 | 61 | 0.26 | 202635_s_at | POLR2K | polymerase (RNA) II (DNA directed) polypeptide K, 7.0kDa |
| 3 | 356 | 0.184 | 205067_at | IL1B | interleukin 1, beta |
| 4 | 486 | 0.15 | 1552912_a_at | IL23R | interleukin 23 receptor |
| 5 | 634 | 0.126 | 235086_at | THBS1 | thrombospondin 1 |
| 6 | 514 | 0.125 | 207445_s_at | CCR9 | chemokine (C-C motif) receptor 9 |
| 7 | 576 | 0.114 | 207113_s_at | TNF | tumor necrosis factor |
| 8 | 103 | 0.096 | 203939_at | NT5E | 5'-nucleotidase, ecto (CD73) |
| 9 | 541 | 0.091 | 206126_at | CXCR5 | chemokine (C-X-C motif) receptor 5 |
| 10 | 95 | 0.087 | 219308_s_at | AK5 | adenylate kinase 5 |
| 11 | 559 | 0.085 | 214146_s_at | PPBP | pro-platelet basic protein (chemokine (C-X-C motif) ligand 7) |
| 12 | 524 | 0.082 | 210549_s_at | CCL23 | chemokine (C-C motif) ligand 23 |
| 13 | 339 | 0.076 | 205291_at | IL2RB | interleukin 2 receptor, beta |
| 14 | 530 | 0.074 | 216598_s_at | CCL2 | chemokine (C-C motif) ligand 2 |
| 15 | 472 | 0.071 | 205445_at | PRL | prolactin |
| 16 | 343 | 0.069 | 207072_at | IL18RAP | interleukin 18 receptor accessory protein |
| 17 | 26 | 0.067 | 223359_s_at | PDE7A | phosphodiesterase 7A |
| 18 | 397 | 0.066 | 211333_s_at | FASLG | Fas ligand (TNF superfamily, member 6) |
| 19 | 1098 | 0.059 | 52255_s_at | COL5A3 | collagen, type V, alpha 3 |
| 20 | 394 | 0.058 | 241819_at | TNFSF8 | tumor necrosis factor (ligand) superfamily, member 8 |
| 21 | 89 | 0.056 | 212739_s_at | NME4 | non-metastatic cells 4, protein expressed in |
| 22 | 158 | 0.056 | 203302_at | DCK | deoxycytidine kinase |
| 23 | 334 | 0.055 | 205327_s_at | ACVR2A | activin A receptor, type IIA |
| 24 | 448 | 0.054 | 210755_at | HGF | hepatocyte growth factor (hepapoietin A; scatter factor) |
| 25 | 119 | 0.054 | 205757_at | ENTPD5 | ectonucleoside triphosphate diphosphohydrolase 5 |
| 26 | 346 | 0.053 | 205403_at | IL1R2 | interleukin 1 receptor, type II |
| 27 | 344 | 0.053 | 206618_at | IL18R1 | interleukin 18 receptor 1 |
| 28 | 1107 | 0.053 | 204614_at | SERPINB2 | serpin peptidase inhibitor, clade B (ovalbumin), member 2 |
| 29 | 560 | 0.052 | 215101_s_at | CXCL5 | chemokine (C-X-C motif) ligand 5 |
| 30 | 80 | 0.051 | 1553587_a_at | POLE4 | polymerase (DNA-directed), epsilon 4 (p12 subunit) |
We list the detailed information on the top 30 genes. BVS.ID refers to the variables in the model: e.g. 196 refers to in our model. Post.Prob. is the posterior probability of the particular variable.
Top Pathways Selected via BVS.
| No | KEGG.ID | Name | Top.genes.extracted | Total # of genes |
| 1 | Hsa05214 | Glioma - Homo sapiens (human) | BVS.ID356 (IL1B), BVS.ID486 (IL23R) | 253 |
| 2 | Hsa04060 | Cytokine-cytokine receptor interaction- Homo sapiens (human) | BVS.ID61 (POLR2K) | 160 |
| 3 | Hsa05222 | Small cell lung cancer - Homo sapiens (human) | BVS.ID196 (ARG1) | 106 |
| 4 | Hsa04623 | Cytosolic DNA-sensing pathway- Homo sapiens (human) | BVS.ID196 (ARG1) | 55 |
| 5 | Hsa04640 | Hematopoietic cell lineage - Homo sapiens (human) | 107 |
We list the 5 pathways that have the highest posterior probabilities. Top.genes.extracts refers to the gene with highest posterior probability within a pathway. and Total # of genes refers to the total number of genes within a pathway.