| Literature DB >> 29462216 |
Deepak Mav1, Ruchir R Shah1, Brian E Howard1, Scott S Auerbach2, Pierre R Bushel3, Jennifer B Collins4, David L Gerhold5, Richard S Judson6, Agnes L Karmaus6, Elizabeth A Maull2, Donna L Mendrick7, B Alex Merrick2, Nisha S Sipes2, Daniel Svoboda1, Richard S Paules2.
Abstract
Changes in gene expression can help reveal the mechanisms of disease processes and the mode of action for toxicities and adverse effects on cellular responses induced by exposures to chemicals, drugs and environment agents. The U.S. Tox21 Federal collaboration, which currently quantifies the biological effects of nearly 10,000 chemicals via quantitative high-throughput screening(qHTS) in in vitro model systems, is now making an effort to incorporate gene expression profiling into the existing battery of assays. Whole transcriptome analyses performed on large numbers of samples using microarrays or RNA-Seq is currently cost-prohibitive. Accordingly, the Tox21 Program is pursuing a high-throughput transcriptomics (HTT) method that focuses on the targeted detection of gene expression for a carefully selected subset of the transcriptome that potentially can reduce the cost by a factor of 10-fold, allowing for the analysis of larger numbers of samples. To identify the optimal transcriptome subset, genes were sought that are (1) representative of the highly diverse biological space, (2) capable of serving as a proxy for expression changes in unmeasured genes, and (3) sufficient to provide coverage of well described biological pathways. A hybrid method for gene selection is presented herein that combines data-driven and knowledge-driven concepts into one cohesive method. Our approach is modular, applicable to any species, and facilitates a robust, quantitative evaluation of performance. In particular, we were able to perform gene selection such that the resulting set of "sentinel genes" adequately represents all known canonical pathways from Molecular Signature Database (MSigDB v4.0) and can be used to infer expression changes for the remainder of the transcriptome. The resulting computational model allowed us to choose a purely data-driven subset of 1500 sentinel genes, referred to as the S1500 set, which was then augmented using a knowledge-driven selection of additional genes to create the final S1500+ gene set. Our results indicate that the sentinel genes selected can be used to accurately predict pathway perturbations and biological relationships for samples under study.Entities:
Mesh:
Year: 2018 PMID: 29462216 PMCID: PMC5819766 DOI: 10.1371/journal.pone.0191105
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1S1500+ gene selection workflow.
To compile the S1500+ gene set, a combination of modular data-driven algorithms as well as manual crowd-sourced knowledge-based gene nominations was used to optimize for pathway coverage and the ability to extrapolate to the whole transcriptome.
Fig 2Dimension reduction plot.
X-axis shows the percentage of the total principal components (eigengenes) and the Y-axis shows percentage of variability captured. The red line represents the expected relationship given statistically independent gene expression, whereas the blue curve shows the observed relationship.
Fig 3Clustered experiments.
K-means clustering (k = 10) was used to cluster experiment data using the first 20 principal components. Fold change values shown are for the top 20 eigengenes. The columns denote principal component indices and percentage of captured variability in parentheses.
Pathway coverage.
| Pathways Covered (total 1320) | Mean Coverage | Median Coverage | Mean Multiplicity | |
|---|---|---|---|---|
| 659 | 0.12 | 0.08 | 7.73 | |
| 1320 | 0.26 | 0.25 | 10.38 | |
| 1320 | 0.43 | 0.43 | 11.44 | |
| 541 (443, 695) | 0.07 (0.05, 0.09) | 0.07 (0.05,0.09) | 5.88 (5.60, 6.19) | |
| 852 (759, 946) | 0.13 (0.11, 0.15) | 0.12 (0.11,0.14) | 5.87 (5.66, 6.18) | |
| 906 | 0.17 | 0.16 | 12.97 |
Pathways covered are calculated relative to the 1320 canonical pathways (MSigDB genesets) in MSigDB version 4.0. The pathway level coverage is defined as fraction of genes from pathway that overlap selected gene set. Mean and Median coverage values are derived from pathway level coverage of 1320 canonical pathways from MSigDB (v4.0). The gene level multiplicity metric represents number of pathways a gene is part of. Mean multiplicity is computed using gene level multiplicity metrics across all selected genes. For “Random 1500” and “Random 2739” gene sets parenthesized values represent mean and range (min, max) across 20 alternative randomizations.
Summary of gene and pathway level extrapolation performance of S1500 gene set on cross-validated training set.
| Pearson Correlation | Concordance Rate | Significance Overlap | Mean Squared Error | |
|---|---|---|---|---|
| 0.79 (0.64, 0.99) | 0.94 (0.91, 1.00) | 0.34 (0.27, 0.50) | 0.22 (0.12, 0.32) | |
| 0.79 (0.65, 0.99) | 0.93 (0.91, 1.00) | 0.33 (0.26, 0.51) | 0.23 (0.12, 0.31) | |
| 0.79 (0.44, 0.91) | 0.85 (0.55, 0.93) | 0.51 (0.43, 0.72) | 0.10 (0.02, 0.62) | |
| 0.75 (0.51, 0.89) | 0.82 (0.55, 0.92) | 0.41 (0.34, 0.68) | 0.12 (0.03, 0.56) | |
values represent mean and range (min, max) across 20-fold cross validation. Gene-level analyses were conducted using fold change. Pathway-level analyses were conducted on GSEA scores.
a Pearson correlations reflect agreement between extrapolated and measured values
b Concordance rates reflect the agreement between the extrapolated and the measured data calculated as (TP + TN)/(TP+TN+FP+FN)
c Significance overlap relays the proportion of genes/pathways having values (i.e. fold change or GSEA scores) in the top 1% in both the measured and extrapolated datasets
d Mean squared error measures the average squared difference between the extrapolated and measured values
Summary of gene and pathway level extrapolation performance of the S1500 and S1500+ gene sets using independent test set.
| Pearson Correlation | Concordance Rate | Significance Overlap | Mean Squared Error | |
|---|---|---|---|---|
| 0.72 | 0.93 | 0.33 | 0.22 | |
| 0.72 (0.72, 0.73) | 0.93 (0.93, 0.93) | 0.34 (0.33, 0.34) | 0.24 (0.24, 0.25) | |
| 0.75 | 0.94 | 0.37 | 0.20 | |
| 0.76 (0.75, 0.76) | 0.93 (0.93, 0.93) | 0.38 (0.37, 0.38) | 0.22 (0.22, 0.22) | |
| 0.81 | 0.87 | 0.52 | 0.07 | |
| 0.74 (0.73, 0.75) | 0.84 (0.84, 0.84) | 0.39 (0.37, 0.40) | 0.10 (0.09, 0.10) | |
| 0.87 | 0.90 | 0.60 | 0.05 | |
| 0.78 (0.77, 0.79) | 0.86 (0.86, 0.86) | 0.44 (0.42, 0.46) | 0.08 (0.08, 0.08) | |
for the evaluation of random gene lists, a random set of genes was selected 20 times then averaged. The values presented for random gene lists are the mean followed by the minimum and maximum values in brackets.
Fig 4Pathway performance analysis for Follicular Lymphoma vs. Tonsillectomy case study comparison (concordance Venn diagrams).
All significantly enriched pathways were identified using enrichment score >0.5 and Kolmogorov Smirnov p-value < 0.001 were included for this analysis. Recall is the percentage of the observed up-/down-regulated genes (Obs-Up and Obs-Down) that were also correctly predicted as up-/down-regulated (Pred-Up/Down). Precision is the percentage of the predicted up- and down-regulated genes that were observed as up- and down-regulated.