| Literature DB >> 36128426 |
Mengjun Wu1,2, Manfred Schmid3, Torben Heick Jensen3, Albin Sandelin1.
Abstract
The RNA exosome degrades transcripts in the nucleoplasm of mammalian cells. Its substrate specificity is mediated by two adaptors: the 'nuclear exosome targeting (NEXT)' complex and the 'poly(A) exosome targeting (PAXT)' connection. Previous studies have revealed some DNA/RNA elements that differ between the two pathways, but how informative these features are for distinguishing pathway targeting, or whether additional genomic features that are informative for such classifications exist, is unknown. Here, we leverage the wealth of available genomic data and develop machine learning models that predict exosome targets and subsequently rank the features the models use by their predictive power. As expected, features around transcript end sites were most predictive; specifically, the lack of canonical 3' end processing was highly predictive of NEXT targets. Other associated features, such as promoter-proximal G/C content and 5' splice sites, were informative, but only for distinguishing NEXT and not PAXT targets. Finally, we discovered predictive features not previously associated with exosome targeting, in particular RNA helicase DDX3X binding sites. Overall, our results demonstrate that nucleoplasmic exosome targeting is to a large degree predictable, and our approach can assess the predictive power of previously known and new features in an unbiased way.Entities:
Year: 2022 PMID: 36128426 PMCID: PMC9477074 DOI: 10.1093/nargab/lqac071
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Figure 1.Overview of the computational framework for identifying molecular features for NEXT/PAXT targeting.
Figure 2.Exosome target dataset, feature design and machine learning framework. (A) Schematic overview of nucleoplasmic exosome target definition based on RNA-seq libraries of siRNA-depleted cofactors of NEXT and PAXT pathways versus wild-type (WT) cells. (B) Characterization of nucleoplasmic exosome targets. Bar plots on the left show the percentage (%) (Y-axis) of specific RNA biotypes for each exosome target category [X-axis; the ‘Others’ biotype consists of 17 types: unprocessed_pseudogene, transcribed_unitary_pseudogene, polymorphic_pseudogene, processed_pseudogene, rRNA_pseudogene, misc_RNA, TEC, snRNA, snoRNA, histone_coding, miRNA, NAT, intragenic, intergenic, TtT3, overlapping, nTtT, ambiguous; for details, see (28)]. Combined violin box plots in the middle show the distribution of the number of exons (Y-axis) for each exosome target category; combined violin box plots on the right show the distribution of TU length (Y-axis) for each exosome target category. (C) Schematic representation of the feature classes. For detailed feature descriptions, see Table 1 and the ‘Materials and Methods’ section. (D) Schematic overview of the machine learning framework.
Molecular features used in this study
| Feature definition | Feature extraction | No. before filtering | No. after filtering | |
|---|---|---|---|---|
|
| ||||
| Chromatin environment | Levels of histone modification and chromatin remodeling (ChIP-seq for H3K4me1, H3k4me2, H3K4me3, H3K9ac, H3K9me3, H3K27me3, H3K27ac, H3K36me3, H4K20me1, H2A.Z; DNase-seq) | Sum of signals in ±500 bp window around TSS | 11 | 11 |
| Transcription levels | Nascent RNA levels by NET-seq | Sum of signals in −100 to +500 bp window around TSS | 1 | 1 |
| Polymerase II loading (Pol II ChIP-seq) | Sum of signals in ±500 bp window around TSS | 1 | 1 | |
| Sequence features | G/C content | G or C content in ±500 bp window around TSS | 1 | 1 |
| GC spread | Calculated (see the ‘Materials and Methods’ section) in +1 to +500 bp window to TSS | 1 | 1 | |
| Presence of TATA box | Frequency of motif hit in −50 to +1 bp window to TSS | 1 | 0 | |
| Presence of INR element | Frequency of motif hit in −15 to +10 bp window to TSS | 1 | 0 | |
|
| ||||
| Sequence features | Presence of 5′ SS motif | Frequency of motif hit in +1 to +500 bp window to TSS | 1 | 1 |
| Presence of PAS motif | Same as above | 1 | 1 | |
| RBP motifs and binding sites | Presence of RBP binding motifs | Same as above | 193 | 77 |
| Presence of RBP binding sites by CLIP-seq | Frequency of binding sites in +1 to +500 bp window to TSS | 401 | 26 | |
|
| ||||
| Sequence features | PAS strength | PAS score calculated by APARENT (see the ‘Materials and Methods’ section) | 1 | 1 |
| Cleaved PAS | Frequency of motif hit in −50 to −1 bp window upstream of TES | 1 | 1 | |
| Presence of 5′ SS motif | Frequency of motif hit in −500 to +1 bp window to TES | 1 | 1 | |
| Presence of PAS motif | Same as above | 1 | 1 | |
| RBP motifs and binding sites | Presence of RBP binding motifs | Same as above | 193 | 96 |
| Presence of RBP binding sites by CLIP-seq | Frequency of binding sites in −500 to +1 bp window to TES | 401 | 58 | |
|
| ||||
| Chromatin environment | Levels of histone modification and chromatin remodeling (ChIP-seq for H3K4me1, H3k4me2, H3K4me3, H3K9ac, H3K9me3, H3K27me3, H3K27ac, H3K36me3, H4K20me1, H2A.Z; DNase-seq) | Sum of signals in ±500 bp window around TES | 11 | 11 |
The features are divided into four classes by their locations with respect to TUs. To avoid misleading high z-score from features of low information content, features are further filtered by entropy; only those with entropy > 0.8 across three exosome target categories are retained. The columns ‘No. before filtering’ and ‘No. after filtering’ correspond to the number of features before and after entropy filtering.
Figure 3.Predictive model of NEXT versus non-NEXT/PAXT targets. (A) Number of features retained after iterative selection, per feature class. Bar plot shows the number of features (X-axis) for each feature class (Y-axis) after iterative selection. Numbers in the parentheses show the number of initial features in each feature class. (B) Classification performance by random forest using single feature classes or combinations thereof. Bar plots in the upper panel show the average performance (F1 score on the left, AUC on the right) over 10 repetitions for initial feature set and consistent feature set after iterative selection, as indicated by bar color. Error bars show the standard deviation of the performance over 10 repetitions. The lower panel shows which feature class (dots) or combinations thereof (dots connected by black lines) were used for classification. (C) Feature importance of Class 1 features. Bar plot (left panel) shows the feature importance score (X-axis) of top ranked features (Y-axis) ordered by importance score. The distributions of feature values of selected features, split by NEXT and non-NEXT/PAXT targets, are shown as density plots to the right. (D–F) Feature importance of Class 2, 3 and 4 features, organized as in panel (C). Distributions to the right are shown as either density plots or histograms, depending on data type.
Figure 4.Predictive model of PAXT versus non-NEXT/PAXT targets. Panels (A)–(F) are organized as those in Figure 3A–F, but based on PAXT versus non-NEXT/PAXT targets.
Figure 5.Predictive model of PAXT versus NEXT targets. Panels (A)–(F) are organized as those in Figure 3A–F, but based on NEXT and PAXT targets.
Figure 6.Exosome target pathway-specific features. Schematic Venn diagram in the top panel shows the definition of pathway-specific features. Lower panels show the corresponding Venn diagram of Class 2 (upper box) and Class 3 (lower box); the pathway-specific and shared features are listed in colored boxes. Arrows in front of NEXT- and non-NEXT/PAXT-specific features indicate whether the feature is depleted (arrow downward) or enriched (arrow upward) in NEXT or non-NEXT/PAXT targets compared to the other two targets.
Figure 7.Multiclass classification. Bar plots (upper panel) showing the average accuracy over 10 repetitions for all features of the four classes and combined consistent significant features in four feature classes across all three comparisons. An error bar shows the standard deviation of the performance over 10 repetitions. The lower panel shows the feature class (dark dots) or combination of feature classes (dark dots connected by black solid line) used for classification.