| Literature DB >> 31510655 |
Surag Nair1, Daniel S Kim2, Jacob Perricone3, Anshul Kundaje1,4.
Abstract
MOTIVATION: Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31510655 PMCID: PMC6612838 DOI: 10.1093/bioinformatics/btz352
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Improved training methods and new architecture design enhances model performance (A) Model architecture for the ResNet model. The RNA-seq inputs and mean accessibility (if used) are concatenated after the convolutional layers. (B) The validation set loss over training steps for a model (Basset architecture for sequence mode) with and without two-stage learning (without mean accessibility as an input feature). In two-stage learning the weights of the convolutional layer of the model are initialized from a model first trained to map sequence to chromatin accessibility for all training cell types. (C) The test set AUPRC of the original Basset model, Factorized model and ResNet model under 4 training paradigms: with and without mean accessibility as an input feature, and with (Tune) and without (Freeze) fine-tuning convolution layers in second stage. Numbers reported on a fixed training, validation and test split with 103 training, 10 validation and 10 test cell types. Models using mean accessibility as an input feature significantly outperform models without mean accessibility. (D) Five-fold cross-validation performance of the ResNet model compared to the Factorized model with and without mean locus accessibility as an input to the model. Each fold contains a split over 123 cell types in the dataset. All models trained using 2-stage scheme with all weights tunable in second stage. Wilcoxon signed rank test (single-tailed) was performed with n = 5, n.s. not significant, *P<0.05. (E) Binned AUPRC of Factorized model without mean accessibility, ResNet model without mean accessibility and ResNet model with mean accessibility. Loci are binned by the fraction of training cell types that are accessible, and AUPRC is computed for predictions on test cell types for each bin. Note that AUPRC is computed for the minority class—when fraction of accessible cell types >0.5, AUPRC is computed on non-accessible regions. Gray bars indicate the fraction of loci having a certain fraction of accessible cell types. Numbers reported on a training, validation and test split same as for (C)
Fig. 2.Cell-type specific TF motifs distilled from the ResNet model (A) Gradient × input contribution scores of each nucleotide (columns) in an example genomic sequence (chr8: 128929715–128931715) across different cellular contexts (tasks shown as rows). The obtained nucleotide-resolution contribution scores for the same genomic sequence can differ between cell-types reflecting differential chromatin accessibility and differences in regulation of the sequence, as shown in this example locus. (B) Summary of motifs learned by the model for individual cell types. The TF-MoDISco method is used to distill consolidated motifs learned by the model for each cell type using a subset of sequences which are accessible in the respective cell type. The returned motifs are then matched to known motifs of TFs using Tomtom (Gupta )
Fig. 3.Predictive cis-sequence features and trans-regulators inferred from the models: (A) TF-MoDISco motifs (left) and row-normalized log TPM RNA-expression values (right) for each training cell type, where each row is a matching motif and TF. (B) t-SNE embedding of 250 additional cell types (points) based on the RNA-seq profiles of 1630 TFs (left) compared to t-SNE embedding of the same cell types based on the predicted chromatin accessibility profiles