| Literature DB >> 31469829 |
Shaoke Lou1,2, Kellie A Cotter3, Tianxiao Li1,2, Jin Liang4, Hussein Mohsen1,2,5, Jason Liu1,2, Jing Zhang1,2, Sandra Cohen6, Jinrui Xu1,2, Haiyuan Yu4,7, Mark A Rubin3,8, Mark Gerstein1,2.
Abstract
There has been much effort to prioritize genomic variants with respect to their impact on "function". However, function is often not precisely defined: sometimes it is the disease association of a variant; on other occasions, it reflects a molecular effect on transcription or epigenetics. Here, we coupled multiple genomic predictors to build GRAM, a GeneRAlized Model, to predict a well-defined experimental target: the expression-modulating effect of a non-coding variant on its associated gene, in a transferable, cell-specific manner. Firstly, we performed feature engineering: using LASSO, a regularized linear model, we found transcription factor (TF) binding most predictive, especially for TFs that are hubs in the regulatory network; in contrast, evolutionary conservation, a popular feature in many other variant-impact predictors, has almost no contribution. Moreover, TF binding inferred from in vitro SELEX is as effective as that from in vivo ChIP-Seq. Second, we implemented GRAM integrating only SELEX features and expression profiles; thus, the program combines a universal regulatory score with an easily obtainable modifier reflecting the particular cell type. We benchmarked GRAM on large-scale MPRA datasets, achieving AUROC scores of 0.72 in GM12878 and 0.66 in a multi-cell line dataset. We then evaluated the performance of GRAM on targeted regions using luciferase assays in the MCF7 and K562 cell lines. We noted that changing the insertion position of the construct relative to the reporter gene gave very different results, highlighting the importance of carefully defining the exact prediction target of the model. Finally, we illustrated the utility of GRAM in fine-mapping causal variants and developed a practical software pipeline to carry this out. In particular, we demonstrated in specific examples how the pipeline could pinpoint variants that directly modulate gene expression within a larger linkage-disequilibrium block associated with a phenotype of interest (e.g., for an eQTL).Entities:
Mesh:
Substances:
Year: 2019 PMID: 31469829 PMCID: PMC6742416 DOI: 10.1371/journal.pgen.1007860
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 1Overall flow of GRAM.
The model predicts functional effects given the genotype in three steps: the first step predicts a universal regulatory activity using TF binding features; the second step predicts a cell type-specific modifier score using the TF binding score and expression profiles; the final step integrates the results from the previous two steps to predict the expression-modulating effect of the variant.
Fig 2Preliminary selection of predictive features.
(a) Enrichment of TF binding peaks in emVAR and non-emVAR sets. The x-axis represents a ratio of variants overlapping with the TF peaks over all variants in the same set. The TFs are sorted by p-values in hypergeometric distribution test in an decreasing order. The number in the bracket indicates the observed motif break event count. TFs with a sufficient number of observations are highlighted in bold. (b) Motif break scores in reference and alternative alleles for TFs with sufficient observed event count.
Fig 3Model based feature selection.
(a) Importance of the top-ranked features for SELEX- and ChIP-Seq-derived models. The features are sorted according to the mean of LASSO stability selection and Random Forest importance scores. (b) Regulatory network degree of relevant TFs for the top-ranked and bottom-ranked TFs in LASSO stability selection and Random Forest. (c) Comparison of the performance of different feature sets, including cell-line specific ChIP-Seq TF binding scores and SELEX TF binding scores, as well as features defined from previous disease-association prediction tools.
Fig 4Performance of the GRAM multi-step model.
(a) ROC curve for regulatory activity prediction. (b) The prediction of the cell type modifier score using TF expression profiles. (c) ROC for the model trained with both ChIP-Seq and SELEX DeepBind features on GM12878. (d) LASSO cross validation results with different regularization parameters of the final GRAM generalized model using SELEX features on a multiple cell line dataset.
Fig 5Experimental validation.
(a) The AUROC value versus the different absolute log2 odds cutoff [0.5, 2.0] in the MCF7 cell line luciferase assay; The x-axis represents the log odds ratio from the luciferase assay. (b) The AUROC value versus the different absolute log2 odds cutoff [0.5, 2.0] in K562 cell line luciferase assay; (c) Experimental results (in odds ratio) for luciferase assay in K562 cell line. The 5’ terminal and 3’ terminal insertions are compared.
Fig 6Fine mapping of variants in prostate cancer.
(a) General pipeline of the fine-mapping analysis. The first panel shows the position of the variants in the LD block (chr6:160081543–161382029, tag SNP rs9364554). The second panel shows the FunSeq scores of these variants, where little variation and significance is observed. The third panel shows the average GRAM score over the patients, with three highest-average-scored variants labelled in specific colors. Personalized GRAM scores for the three highest-scored variants in three selected patients are presented subsequently. (b) ~ (d) Correlation between the GRAM score of variants with high scores and the expression of relative target genes.
Pseudocode of GRAM.
| i: variant id |
| j: TF id |
| V: the total number of variants |
| N: the total number of TF |
| c: cell type or sample |
| Step1: simple Universal score to be a regulatory element using randomForest classifier, |
| Step2: TF binding and gene expression cell type modifier score, |
| Step3: molecular effect score, |
| Objective function: |
Time complexity for training: The complexity analysis for both Random Forest and LASSO depend on the implementation. Simply, the Random Forest worse case training cost is O(MKN2logN) [59], where N is the total number of rows, K is the number of split features for Random Forest, and M is the number of trees; the time complexity of LASSO is O(K2N) and almost linearly in N when K≪N, where N is the total number of rows, K is the number of features [60].
The 2x2 categorical matrix for computation of Vodds.
| Reads | Reference | Alternative |
|---|---|---|
| Assay | n1 | n3 |
| Null-control | n2 | n4 |