| Literature DB >> 22087238 |
Madhuchhanda Bhattacharjee1, Mikko J Sillanpää.
Abstract
Both molecular marker and gene expression data were considered alone as well as jointly to serve as additive predictors for two pathogen-activity-phenotypes in real recombinant inbred lines of soybean. For unobserved phenotype prediction, we used a bayesian hierarchical regression modeling, where the number of possible predictors in the model was controlled by different selection strategies tested. Our initial findings were submitted for DREAM5 (the 5th Dialogue on Reverse Engineering Assessment and Methods challenge) and were judged to be the best in sub-challenge B3 wherein both functional genomic and genetic data were used to predict the phenotypes. In this work we further improve upon this previous work by considering various predictor selection strategies and cross-validation was used to measure accuracy of in-data and out-data predictions. The results from various model choices indicate that for this data use of both data types (namely functional genomic and genetic) simultaneously improves out-data prediction accuracy. Adequate goodness-of-fit can be easily achieved with more complex models for both phenotypes, since the number of potential predictors is large and the sample size is not small. We also further studied gene-set enrichment (for continuous phenotype) in the biological process in question and chromosomal enrichment of the gene set. The methodological contribution of this paper is in exploration of variable selection techniques to alleviate the problem of over-fitting. Different strategies based on the nature of covariates were explored and all methods were implemented under the bayesian hierarchical modeling framework with indicator-based covariate selection. All the models based in careful variable selection procedure were found to produce significant results based on permutation test.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22087238 PMCID: PMC3210128 DOI: 10.1371/journal.pone.0026959
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Normal Q–Q plots and Box-plots for the given sets of data.
For Q–Q plot the observed values are in X-axes and expected normal values are in Y-axes.
Figure 2Percentile distributions of the original data and 5-folds created for k( = 5)-fold cross validation.
Correlation of prediction for different variable selection processes and model types are presented according to data types (carried out separately for each phenotype).
| Cross validation subset | Variable selection | Further data reduction | Model | In-data/goodness of fit | Out-data/out-of-sample prediction | ||
| Pheno-type-1 | Pheno-type-2 | Pheno-type-1 | Pheno-type-2 | ||||
| SFP-data | |||||||
| Split-sample | None | Shrinkage | Indicator model |
|
| −0.05 | 0.08 |
|
|
| (−0.12) | (0.01) | ||||
| Best performance in DREAM5 | NA | NA | NA | NA | |||
|
| T-test based | None | Non-indicator model | 0.44 | 0.49 | 0.33 |
|
| (0.34) | (0.47) | (0.25) |
| ||||
| Vague prior | Indicator model | 0.44 | 0.59 |
| 0.25 | ||
| (0.34) | (0.56) |
| (0.25) | ||||
| Expression-data | |||||||
| Split-sample | Correlation based | Shrinkage | Indicator model | 0.79 | 0.88 | 0.22 | −0.06 |
| (0.77) | (0.86) | (0.26) | (0.03) | ||||
| Best performance in DREAM5 | NA | NA | (0.31) | (0.26) | |||
|
| Correlation based | Supervised PCA | Non-indicator model |
|
| 0.36 | 0.37 |
|
|
| (0.32) | (0.32) | ||||
| Common subset selection | Non-indicator model | 0.48 | 0.71 |
| 0.47 | ||
| (0.42) | (0.66) |
| (0.38) | ||||
| Common subset selection | Indicator model | 0.46 | 0.69 | 0.39 |
| ||
| (0.40) | (0.62) | (0.33) |
| ||||
| SFP and Expression-data | |||||||
| Split-sample | Genes: Correlation based | Shrinkage | Indicator model |
|
| 0.19 | 0.18 |
|
|
| (0.31) | (0.24) | ||||
| Best performance in DREAM5 | NA | NA | (0.31) | (0.24) | |||
|
| SFP: T-test based | Common subset selection | Non-indicator model | 0.63 | 0.87 |
|
|
| (0.57) | (0.84) |
|
| ||||
| Common subset selection | Indicator model | 0.61 | 0.77 | 0.48 | 0.47 | ||
| (0.56) | (0.73) | (0.44) | (0.42) | ||||
Pearson-correlation is presented first followed by Spearman correlation within brackets.
SFPs are ranked according to their (absolute) t-statistics (marginal) and entries are selected from top.
Genes are ranked according to their (absolute) correlation between expression and phenotype. The top 10% were selected. Expression information on 260 plants used for this purpose.
Shrinkage parameter based (a-priori independent) prior distribution for inclusion-indicator variable in model was used with shrinkage of 0.1 for SFPs and 0.01 for gene expression data.
Vague/Uniform(0,1) prior distribution for inclusion-indicator variable in model was used for individual SFP/gene.
Top components from PCA of the gene expression data involving only those genes selected first based on phenotype/expression correlation.
Correlations of expression with phenotype were computed for each gene based on a) all 260 plants in the data b) also for each of the 5 learning sets created by 5-fold cross validation. Genes common in these 6 sets with highest correlation were identified and top subsets used for analysis.
The results presented are the best using a top subset for each phenotype separately. The cumulative top sets were created and explored for prediction with up to 50 SFPs and/or 100 gene expression measurements.
Predictive results obtained using the same top subset producing best results with a non-indicator model, however since the model is indicator based the effective number of covariates are less than that used in the non-indicator based model.
Figure 3Scatter plots of SFP-specific t-statistics and phenotype/expression correlations of the probes common between the SFP data and gene expression data.
Figure 4Correlations calculated between observed and predicted phenotypes with varying numbers of covariates in the model.
Figure 5Deviance (in vertical axis) with varying number of SFPs (in X-axis) and expression data (in Y-axis) into the model.
Figure 6Comparison of variable selection measures.
In axis (absolute) t-statistic/correlation (i.e. marginal estimates) are presented and in Y-axis estimated weighted genetic variation (for SFPs) or weighted coefficients (for genes) from joint distribution based on vague priors (i.e. joint estimates) are presented.
Figure 7GO biological process enrichment estimated using indicator model (with 100 SFPs and/or 200 gene expressions).
Figure 8Chromosomal enrichment estimated using indicator model (with 100 SFPs and/or 200 gene expressions).