Literature DB >> 30500881

DeeReCT-PolyA: a robust and generic deep learning method for PAS identification.

Zhihao Xia¹, Yu Li², Bin Zhang³, Zhongxiao Li², Yuhui Hu³, Wei Chen³, Xin Gao².

Abstract

MOTIVATION: Polyadenylation is a critical step for gene expression regulation during the maturation of mRNA. An accurate and robust method for poly(A) signals (PASs) identification is not only desired for the purpose of better transcripts' end annotation, but can also help us gain a deeper insight of the underlying regulatory mechanism. Although many methods have been proposed for PAS recognition, most of them are PAS motif- and human-specific, which leads to high risks of overfitting, low generalization power, and inability to reveal the connections between the underlying mechanisms of different mammals.
RESULTS: In this work, we propose a robust, PAS motif agnostic, and highly interpretable and transferrable deep learning model for accurate PAS recognition, which requires no prior knowledge or human-designed features. We show that our single model trained over all human PAS motifs not only outperforms the state-of-the-art methods trained on specific motifs, but can also be generalized well to two mouse datasets. Moreover, we further increase the prediction accuracy by transferring the deep learning model trained on the data of one species to the data of a different species. Several novel underlying poly(A) patterns are revealed through the visualization of important oligomers and positions in our trained models. Finally, we interpret the deep learning models by converting the convolutional filters into sequence logos and quantitatively compare the sequence logos between human and mouse datasets.
AVAILABILITY AND IMPLEMENTATION: https://github.com/likesum/DeeReCT-PolyA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Poly A

Year: 2019 PMID： 30500881 PMCID： PMC6612895 DOI： 10.1093/bioinformatics/bty991

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Polyadenylation, as a critical and pervasive process (Proudfoot, 1991) during the maturation of mRNA, is essentially composed of two coupled steps: a cleavage at the poly(A) site and an addition of an adenosine tail. Studies have found a great number of poly(A) signals (PASs) in most eukaryotes (Shen et al., 2008a,b). It is well accepted that the recognition of polyadenylation sites requires a signal residing in the ∼10–30 nt upstream region of the cleavage site (Proudfoot, 2011), which consists of 6 nt known as PAS motifs. Studies on PAS motifs and their surrounding regions are crucial, as they provide insights on how transcription is ended thereby determining the fate of the RNA transcripts (Shaw and Kamen, 1986) and the association of their mutations with diseases (Lin ; Pastrello ). In the past few decades, numerous works have been done for this purpose, to characterize the sequence information around PASs and identify core elements and their roles in polyadenylation. It has been shown that the surrounding sequences of a poly(A) site are U-rich in general (Tian ). Similarly, GU-rich elements found in the downstream of the cleavage site are also believed to regulate polyadenylation and have been regarded as informative patterns to specify the true PASs (Zarudnaya ). Previous works have made great efforts in the prediction of PAS in mRNA and genomic DNA sequences. Specifically, most methods aim to identify true PAS from pseudo ones which have the same hexamer (e.g. ATTAAA) as true PAS but do not involve in polyadenylation. Earlier studies (Helden ; Matis ; Tabaska and Zhang, 1999) focus on exploring the statistical information of PAS surrounding sequences. Based on prior knowledge of DNA sequences, many carefully hand-crafted features have been proposed by experts which formed the basis of most PAS recognition models (Akhtar ; Cheng ; Hu ; Liu ; Salamov and Solovyev, 1997; Tabaska and Zhang, 1999). Recently, several methods have greatly advanced the accuracy of human poly(A) recognition. Kalkatawi provided a benchmark (denoted as the Dragon human data) containing 14 740 sequences for 12 human PAS motif variants. Based on many expert-crafted features, they proposed an artificial neural network-based method and a random forest (RF) based method. After that, Xie derived a new set of latent features for DNA sequences with hidden Markov model (HMM) and fed these features to an SVM model for classification (referred as HSVM hereafter), which significantly increased the accuracy on the Dragon human data. Very recently, Magana-Mora developed a new set of hand-crafted features in combination with multiple classification models, including decision tree, RF etc. Their model, Omni-PolyA, improved the results of HSVM. However, methods like RF, HSVM and Omni-PolyA have two limitations. First, they are all feature-based methods which require a large amount of prior knowledge and cannot cope with the rapidly increasing size of data. Second and more importantly, they are not robust or generic, in the sense that they require training a separate model for each of the 12 human motif variants. And due to the features that are hand crafted for human PAS motifs, they cannot be extended to different species. In this article, we propose an accurate and robust deep learning model, i.e. Deep Regulatory Code and Tools for Polyadenylation (DeeReCT-PolyA), for PAS recognition, which is PAS motif variant agnostic. For example, PAS motifs are slightly different between human and mouse. We train one single generic convolutional neural network (CNN) that can deal with all 12 human PAS motif variants which still significantly reduces the error rate on both two standard human benchmarks, compared with the state-of-the-art methods which require training 12 separate models. Further experiments on C57BL/6J (BL) and SPRET/EiJ (SP) mouse data demonstrate that our model can consistently perform very well across different species. Moreover, we adopt transfer learning in PAS identification by transferring a deep neural network pre-trained with one dataset to recognize PAS motifs on a new dataset of a different species and show transfer learning can further improve the accuracy and more importantly, address the problem of insufficient training data. We propose several methods to visualize our model and investigate the biological significance of the model. We reveal some novel patterns in poly(A) regulation and the similarities of such patterns across different species including human, BL and SP mouse.

2 Materials and methods

2.1 The proposed CNN for PAS recognition

We propose a deep neural network-based method, DeeReCT-PolyA, for automatic feature extraction and PAS identification. Figure 1 illustrates the architecture of DeeReCT-PolyA. The first layer is a convolutional layer consisting of 16 filters, which are essentially motif detectors. When processing a raw DNA sequence (encoded with one-hot encoding), each motif detector will search sequence patterns that can discriminate true PASs from pseudo ones. The outputs of the convolutional layer are divided into several groups and normalized within each group by a subsequent filter-group normalization layer (GN). We propose such a novel normalization layer with the motivation that some motif detectors are correlated and show it improves the PAS recognition accuracy comparing to other normalization techniques in deep learning (e.g. batch normalization). More importantly, results show that comparing to batch normalization; GN enables the model to be naturally transferrable from a pre-trained task to a new target dataset (Section 2.5). Details and results for GN can be found in Supplementary Material S3.

Fig. 1.

The architecture of the proposed DeeReCT-PolyA network. The output feature channels (shown as a column) of the conv layer is divided into groups (green arrows) and each group is jointly normalized by the group normalization layer. After tunable parameters are learned from the data, two visualization methods (shown as dashed lines in green and gray) are applied to the model without normalization to extract cis-elements and variants for the regulation of polyadenylation (Color version of this figure is available at Bioinformatics online.) A rectified linear unit is applied to the normalized results as the activation function. After a max-pooling layer, all feature vectors are concatenated together and fed to the fully connected (FC) network followed by a softmax function which normalizes the prediction to values between 0 and 1. In fact, we have examined several other deep learning architectures (Supplementary Material S1) and finally chose the proposed one. In addition to the proposed normalization layer, we faciliate the proposed CNN with several other techniques. Dropout is applied to the hidden neurons in the FC network (Srivastava ) to alleviate overfitting. Namely, at each training iteration, some hidden neurons will be randomly set to 0 to force the model to make predictions with a subset of the parameters and the training will then yield a more robust model. Besides, our empirical experiment shows that dropout will also prevent the optimization of the model from being stuck at a local optimum. As for the initialization of the network, we find that the standard Xavier initializer (Glorot and Bengio, 2010) often leads to unsatisfactory results. Thus we propose to address this problem by sampling the initialized value of each parameter from a normal distribution with zero mean and a random variance. Then we do a random search on the validation dataset to get the best initialization variance for each layer. Our loss function comprises two terms: a cross-entropy loss of the prediction against true labels and a regularization term of the weights in the convolutional layer and FC layers, which is known as weight decay. The loss function is minimized by stochastic gradient descent with momentum. We also apply an exponential decay to the learning rate every 3000 iterations to make the training process more stable.

2.2 Cross validation and hyper-parameter search

To be consistent with previous studies, we use the standard 5-fold cross validation in all experiments we conduct. Specifically, we randomly partition the data into five equally sized folds and use three of them for training, one for validation and the remaining one for testing. In the DeeReCT-PolyA model, there are several hyper-parameters we need to specify for the training process, including the number of groups in GN, the learning rate etc. Thus, the validation fold is used for hyper-parameter search. The search is implemented in a random sampling manner as Alipanahi which randomly samples a set of hyper-parameters and tests its performance on the validation data. The one with the best accuracy on the validation fold is reserved as the final model and evaluated on the test set. The convolutional layer is set to have 16 filters with a length of 10. The number of hidden nodes in the FC network is searched within {32, 64, 128} and finally fixed to 64. We sample the number of groups from and the keep probability of dropout layer from . Details can be found in Supplementary Material S2 and Supplementary Table S1.

2.3 Evaluation metric

To measure the performance of our model, we adopt the classification error rate as the evaluation metric. The error rate is defined as where TP, TN, FP and FN stands for the number of predictions that are true positive, true negative, false positive and false negative.

2.4 Datasets

We comprehensively evaluate our model on four different datasets, including two human PAS motif benchmarks. The first benchmark, Dragon human poly(A) dataset, is proposed in Kalkatawi , containing 14 740 sequences for the 12 main human PAS motif variants. Many previously proposed methods have used this dataset as a standard benchmark to carry out comparative analysis. The second benckmark (referred as the Omni human poly(A) dataset) is proposed very recently by Magana-Mora . Omni human poly(A) is a much larger dataset that consists of 18 786 positive true PAS sequences for 12 human PAS motif variants. An equal number of pseudo-PAS sequences are extracted from human Chromosome 21 after excluding all the true PAS sequences. To assess the performance of our model beyond human, we apply it on two additional poly(A) datasets from C57BL/6J (BL) and SPRET/EiJ (SP) mouse strains, respectively. Overall, there are 46 224 genomic sequences in the BL mouse dataset and 40 230 sequences in the SP dataset. Both datasets consist of an equal number of true PAS and pseudo-PAS sequences. The positive sequences are generated based on the poly(A) sites identified in the previous study, which applies a well-established genome-wide experimental approach to capture the poly(A) sites (Xiao ). In brief, the poly(A) containing RNAs, which are extracted from the fibroblast in C57BL/6J and SPRET/EiJ mouse, are fragmented and only the 3’-end is captured for high throughput sequencing. The 200 nt genomic sequences flanking the poly(A) sites are abstracted from the BL and SP genome, respectively. We obtain the pseudo-PAS sequences by scanning the genomic sequences of the transcripts expressed in the same cell lines and selecting those that are not close to any annotated transcription end in GENCODE or any poly(A) sites identified by the experimental data. The same number of pseudo-PAS sequences, which also contain the same PAS motifs, are then randomly retrieved. Note that for the mouse data, there are 13 potential PAS motif variants instead of 12 for the human poly(A) data.

2.5 Transfer pre-trained models to new datasets

Transfer learning has been widely used (Do and Ng, 2006; Li ; Yosinski ). The idea of transfer learning is that one can solve a problem with the knowledge gained by solving another similar problem. There are two major advantages of applying the transfer learning idea in the PAS recognition problem. First, integrating the information from two related datasets with transfer learning is likely to further boost the performance as we will show in Section 3.3. Most importantly, in many species there are only a limited amount of annotated poly(A) data (Ji et al., 2014), which poses great challenges to most machine-learning methods, especially the data hungry ones, such as deep learning. One possible solution is to transfer an available model learned from a larger dataset (e.g. human poly[A] data and mouse poly[A] data) to the other species, and then fine-tune the network with a small set of new data (Section 3.3). To investigate the efficacy of pre-training and transfer learning in PAS identification, we adopt transfer learning in two different scenarios. In the first scenario, we incorporate a different dataset to increase the classification accuracy. Specifically, we transfer a model pre-trained on one dataset and then fine-tune it with the whole target dataset (cross-species or not). The fine-tuned model is then evaluated on the target dataset in comparison with baseline models which is trained only on target data without pre-training. In the second scenario, we evaluate whether transfer learning can address the problem of insufficient training dataset, in which we are limited by the number of annotated PAS sequences in the target species. The experiment is conducted on the dataset of a new species, a rat poly(A) dataset proposed very recently by Wang . Models are pre-trained on human or mouse data and tested on the new rat dataset. We investigated scenarios where we have different numbers of training sequences to fine-tune the pre-trained model. We split one fifth of the rat dataset as the test dataset, on which all pre-trained and fine-tuned models are evaluated. Different numbers of rat sequences are sampled from the rest of the dataset for fine-tuning. The detailed procedure of transfer learning is presented in Supplementary Material S4.

2.6 Visualization of oligomer importance at different positions

A particular advantage of our model is that it can provide a characterization of critical features in the sequence context flanking the poly(A) motif. Following Xie , we investigate the importance of all possible 2 nt subsequences at different locations. As shown by the green arrows in Figure 1, the method we use is to construct a special sequence to feed the trained model and take the outputs for visualization (Supplementary Material S5). We have also visualized the importance of positions by examing where the convolutional filters are mostly ‘looking at’ (Supplementary Material S5 and Supplementary Fig. S2).

2.7 Convolutional filter visualization and similarity measurement

Convolutional filters, which directly scan through the whole genomic sequence, are believed to carry abundant information for polyadenylation. We generate sequence logos for the interpretation of such information. Specifically, for each particular filter of DeeReCT-PolyA, we derive a position frequency matrix (PFM) by inputting all genomic sequences in the validation dataset to that filter. The filter will assign a score to every fixed-length subsequence of the input sequence. The subsequence with the maximum score (if larger than 0) is considered as an instance of the cis-element captured by that filter from the input sequence. Instances captured by a filter for all validation data are aligned to generate a PFM and transformed into a sequence logo subsequently (Supplementary Material S5). We empirically find that only 11 filters are active in the model trained with the Dragon human data. This is probably due to the limited size of the dataset. Furthermore, we quantify the similarity of the extracted cis-elements by measuring the correlation between convolutional layers in different models. This can be directly computed with the PFM of each filter. There are many works on measuring similarity of PFMs. We adopt the one proposed in Pape . Since each model has 16 filters, when comparing two models (e.g. A and B), for each filter in model A, we compute its similarity against all 16 filters in model B and take the maximum similarity score as the score for this filter. Then we average the score for each filter in A.

3 Results

We comprehensively evaluated the performance of DeeReCT-PolyA on the Dragon and Omni human datasets, and the SP and BL mouse datasets. We first showed that DeeReCT-PolyA outperforms the state-of-the-art PAS recognition methods on both Dragon and Omni human datasets (Section 3.1). We then evaluated the robustness of DeeReCT-PolyA by evaluating it on the SP and BL mouse datasets, challenging it to recognize unseen motifs by cross-motif validation and testing it on noisy regions in sequencing data (Section 3.2). Transfer learning across datasets and species was evaluated under two scenarios where we had sufficient and insufficient number of sequences for training in Section 3.3. Results showed that with transfer learning, DeeReCT-PolyA could achieve significantly better performance than models without pre-training when there are insufficient training data. The importance of dimers at different positions is measured in Section 3.4. We finally interpreted the deep learning model by constructing the sequence logos from the convolutional filters and measuring the similarities of the logos between human and mouse species (Section 3.5).

3.1 Performance comparison on human PAS prediction

We first evaluated our model, DeeReCT-PolyA, on two human poly(A) benchmarks. We reported the average error rates over the 5-fold cross-validation. As shown in Table 1, our model gives a significantly higher accuracy than previous state-of-art methods on the Dragon human poly(A) dataset (Kalkatawi ) with a 2.86% improvement. Note that while RF, HSVM and Omni-PolyA each trained 12 PAS variant-specific models for the 12 PAS motif variants of human, DeeReCT-PolyA uses a single generic model that deals with all variants simultaneously which is much more challenging. Our results show that in spite of this, our variant-agnostic model still outperforms variant-specific models of Omni-PolyA on most PAS motif variants (11 out of 12).

Table 1.

Error rate comparison between RF, HSVM, Omni-PolyA and our model (DeeReCT-PolyA) on the Dragon human poly(A) data

Variants	Size	Error Rate (%)
		RF	HSVM	Omni- PolyA	DeeReCT- PolyA	Rel
AATAAA	5190	20.06	18.59	14.02	11.81	2.21
ATTAAA	2400	18.42	16.21	12.50	9.00	3.50
AAAAAG	1250	16.64	9.36	10.80	5.77	3.59
AAGAAA	1230	11.06	5.45	4.87	7.76	−2.89
TATAAA	880	19.55	15.34	13.52	7.69	5.83
AATACA	780	19.36	11.15	13.85	10.45	0.70
AGTAAA	690	27.83	16.96	14.49	9.55	4.94
ACTAAA	670	22.09	14.33	13.13	10.72	2.41
GATAAA	460	20.00	9.57	8.48	8.04	0.44
CATAAA	410	18.54	9.27	13.41	9.02	0.25
AATATA	410	24.88	12.68	14.39	8.78	3.90
AATAGA	370	18.38	5.14	11.62	4.59	0.55
Average	–	19.19	14.42	12.43	9.57	2.86

Note: Rel denotes the improvement of DeeReCT-PolyA with respect to the best of the other three methods. Bold indicates the error rate of the best model for each PAS motif variant. Average is the weighted average of all motif variants with the size as weights. While results of all three previous methods are reported for 12 variant-specific models, the results of DeeReCT-PolyA are the performance of one single generic model that deals with all 12 variants.

Error rate comparison between RF, HSVM, Omni-PolyA and our model (DeeReCT-PolyA) on the Dragon human poly(A) data Note: Rel denotes the improvement of DeeReCT-PolyA with respect to the best of the other three methods. Bold indicates the error rate of the best model for each PAS motif variant. Average is the weighted average of all motif variants with the size as weights. While results of all three previous methods are reported for 12 variant-specific models, the results of DeeReCT-PolyA are the performance of one single generic model that deals with all 12 variants. Similar conclusions can be drawn from Table 2 in which we evaluated our model on the recent Omni human poly(A) dataset (Magana-Mora ). Results for RF and HSVM were reported by Magana-Mora . Result shows that our generic model consistently performs better than variant-specific models over the majority of the PAS motif variants, which leads to a clear improvement for the average error rate. Although our model is agnostic to PAS motif variants, we still tried to investigate the differences between different PASs. Specifically we visualized if the model ‘looks for’ different cis-elements and ‘looks at’ different positions for different PAS motifs (Supplementary Material S6).

Table 2.

Error rate comparison between RF, HSVM, Omni-PolyA and our model (DeeReCT-PolyA) on the Omni human poly(A) data

Variants	Size	Error Rate (%)
		RF	HMM	Omni- PolyA	DeeReCT- PolyA	Rel
AATAAA	24310	25.49	27.91	23.96	21.99	1.97
ATTAAA	7098	25.59	33.48	24.20	23.01	1.09
AAAAAG	1640	26.52	36.83	25.86	27.76	−1.90
AAGAAA	1306	26.67	34.77	23.07	26.80	−3.73
TATAAA	682	30.88	38.38	26.91	23.60	3.31
AATACA	634	24.41	36.98	22.06	22.00	0.06
AGTAAA	528	28.11	37.31	23.26	20.21	3.05
ACTAAA	368	32.97	33.89	24.72	25.79	−1.07
GATAAA	342	31.18	41.76	29.41	22.15	7.26
CATAAA	314	28.89	39.03	24.51	25.54	−1.03
AATATA	250	31.60	36.00	26.80	17.82	8.98
AATAGA	100	34.00	40.00	23.00	20.00	3.00
Average	–	25.93	30.43	24.15	22.64	1.51

Error rate comparison between RF, HSVM, Omni-PolyA and our model (DeeReCT-PolyA) on the Omni human poly(A) data Note: Rel denotes the improvement of DeeReCT-PolyA with respect to the best of the other three methods. Bold indicates the error rate of the best model for each PAS motif variant. Average is the weighted average of all motif variants with the size as weights.

3.2 Robustness of DeeReCT-PolyA to PAS variants and different species

To demonstrate the robustness of our model, we validated DeeReCT-PolyA on two mouse poly(A) datasets, i.e. SP and BL (Section 2.4), which cannot be handled by conventional human-feature-based methods. Table 3 shows that our model achieved an error rate of 24.11 and 23.49% on SP and BL mouse data, respectively. While there is no existing method for poly(A) identification of SP and BL data, for the purpose of comparison we trained a linear regression (LR) model. The LR model yielded an average error rate of 27.26 and 26.98% on SP and BL mouse data, which is significantly worse than DeeReCT-PolyA. We have also tried adding some hand-crafted features to our DeeReCT-PolyA model; however, the improvement is marginal. Detailed results can be found in Supplementary Material S7.

Table 3.

Error rate of DeeReCT-PolyA on SP and BL mouse poly(A) data

Variants	SP		BL
Variants	Size	Error Rate (%)	Size	Error Rate (%)
AATAAA	17 708	26.50	20 250	25.48
ATTAAA	7550	25.30	9056	24.89
TTTAAA	2336	19.95	2688	18.19
TATAAA	2178	22.91	2518	22.44
AGTAAA	2224	22.88	2376	21.63
CATAAA	1432	20.53	1760	19.77
AATATA	1334	23.55	1528	23.23
AATACA	1210	21.40	1326	22.55
GATAAA	1032	17.84	1176	18.54
AAGAAA	1022	15.07	1126	15.81
AATGAA	982	18.84	1108	18.86
ACTAAA	728	19.37	776	20.24
AATAGA	494	18.64	536	21.24
Average	–	24.11	–	23.49

Error rate of DeeReCT-PolyA on SP and BL mouse poly(A) data Although results from four benchmarks show that our model can generalize over PAS motif variants, we want to ask a further question: is the model able to recognize a PAS motif that it has never seen before? If so, then it would be an interesting idea to use DeeReCT-PolyA to process non-annotated DNA sequences which may help us find PASs that are currently unknown. We answered this question by evaluating our model with a leave-one-motif-out validation on the Dragon human data. Specifically, we trained a DeeReCT-PolyA model with human poly(A) data that contains only 11 types of PAS variants. All data for the remaining left-out variant, which our model did not see during training, was used as the test set. For each PAS motif variant, Table 4 reports the prediction error rate for two models evaluated on the data of this variant: one was trained with the regular 5-fold cross-validation (Section 3.1) and the other model was trained with data of all PAS motif variants excluding this held-out variant. The results show that our model works surprisingly well even on PAS motif variants that were not included in the training dataset. Notably, for most PAS motif variants (8 out of 12), the model trained with leave-one-motif-out data surpasses the model in 5-fold cross validation. The reason is that although the network could not see any sequence containing the test motif, it did see more training examples because all the data from the remaining 11 variants were included in the training. As an example, when testing the motif AATACA, the leave-one-motif-out model was trained with 14 740 – 780 = 13 960 sequences while the 5-fold cross-validation model only had 14 740 × 4/5 = 11 792 examples for training. The similarity of these twelve leave-one-motif-out models is visualized in Supplementary Material S5 and Supplementary Figure S3.

Table 4.

DeeReCT-PolyA with leave-one-motif-out test on the Dragon human dataset

Variants	Size	Error Rate (%)
Variants	Size	5-fold cross-validation	leave-one-motif-out
AATAAA	5190	11.81	14.20
ATTAAA	2400	9.00	8.17
AAAAAG	1250	5.77	5.45
AAGAAA	1230	7.76	7.52
TATAAA	880	7.69	7.18
AATACA	780	10.45	8.86
AGTAAA	690	9.55	7.46
ACTAAA	670	10.72	11.16
GATAAA	460	8.04	8.04
CATAAA	410	9.02	11.46
AATATA	410	8.78	8.54
AATAGA	370	4.59	4.05
Average	–	9.57	10.08

Note: For the leave-one-motif-out test, for each PAS variant, a DeeReCT-PolyA model was trained with data of all the other motif variants and then test only on this variant. Bold indicates the error rate of best model for each PAS variant.

DeeReCT-PolyA with leave-one-motif-out test on the Dragon human dataset Note: For the leave-one-motif-out test, for each PAS variant, a DeeReCT-PolyA model was trained with data of all the other motif variants and then test only on this variant. Bold indicates the error rate of best model for each PAS variant. We further investigated if the proposed model is robust to noise by introducing mismatches in the whole sequence or certain sub-regions. Results show that in general DeeReCT-PolyA is robust to mismatches as even if there are 40 mismatches in the whole sequence; DeeReCT-PolyA still only has an around 18% error rate on the Dragon human data (Supplementary Material S8).

3.3 Transfer learning for PAS recognition

To validate the idea of transfer learning in the PAS recognition problem, we tested its performance in two cases in which we have sufficient and insufficient target training sequences. In the first case, we evaluated transfer learning on three similarly sized datasets, Omni human and the SP and BL mouse dataset. Dragon human dataset was not considered in this scenario since its size is relatively small. Table 5 shows the performance of transferred models evaluated on SP dataset. We first pre-train the model on Omni human or BL mouse dataset and then fine-tune it with all SP training data. Note that the model before fine-tuning has not seen any SP data which the model is evaluated on. Yet, even without fine-tuning the transferred model still gets a comparative result comparing to the baseline model, which is trained only on the target dataset with no pre-training, especially when we transfer the model between close species. After fine-tuning, the transferred model yields superior performance than the model without pre-training. Similar conclusions can also be drawn from Tables 6 and 7 where we transfer models to BL data and Omni human data, respectively. The consistent results in these tables illustrate that while we already have sufficient data for training, transferring the pre-trained model from a relative dataset can further improve the prediction accuracy, even when the transferring is cross-species (human to mouse or mouse to human).

Table 5.

Evaluation of transferred DeeReCT-PolyA models on SP mouse poly(A) data before and after fine-tuning

	Average Error Rate (%)
Pre-trained on	None	Omni	BL
Before fine-tuning	–	30.23	23.67
After fine-tuning	24.11	24.04	22.57

Note: None denotes a model of no pre-training and trained with SP mouse data. Models respectively pre-trained on Omni and BL dataset are evaluated on SP mouse dataset before and after fine-tuning with SP data. Average error rate over all PAS motif variants is reported.

Table 6.

Evaluation of transferred DeeReCT-PolyA models on BL mouse poly(A) data before and after fine-tuning

	Average Error Rate (%)
Pre-trained on	None	Omni	SP
Before fine-tuning	–	29.75	23.13
After fine-tuning	23.49	23.38	22.08

Table 7.

Evaluation of transferred DeeReCT-PolyA models on Omni human poly(A) data before and after fine-tuning

	Average Error Rate (%)
Pre-trained on	None	SP	BL
Before fine-tuning	–	29.58	29.07
After fine-tuning	22.64	22.40	22.44

Evaluation of transferred DeeReCT-PolyA models on SP mouse poly(A) data before and after fine-tuning Note: None denotes a model of no pre-training and trained with SP mouse data. Models respectively pre-trained on Omni and BL dataset are evaluated on SP mouse dataset before and after fine-tuning with SP data. Average error rate over all PAS motif variants is reported. Evaluation of transferred DeeReCT-PolyA models on BL mouse poly(A) data before and after fine-tuning Evaluation of transferred DeeReCT-PolyA models on Omni human poly(A) data before and after fine-tuning In the second case, we addressed the problem of insufficient training data by transferring a pre-trained model to a new species. Specifically, models are pre-trained on human or mouse data and tested on the new rat dataset. We investigated scenarios where we have different numbers of rat sequences to fine-tune the pre-trained model. The baseline is a model without any pre-training. Results (Table 8) show that without pre-training, CNN fails when there are insufficient training data. However, by pre-training on a different species, even if the pre-trained model itself (e.g. model pre-trained on the Dragon human data) is not suitable for prediction on the new dataset, after fine-tuning with a very small number of sequences the model is able to achieve much better performance. As expected, as the number of training data increases, the error rate of fine-tuned models becomes smaller.

Table 8.

Transfer learning for insufficient amount of sequences in the rat poly(A) dataset

	Average Error Rate (%)
Pre-trained on	None	Dragon	Omni	SP	BL
n = 0	–	40.55	29.30	22.11	22.40
n = 100	50.00±0.00	39.32±1.84	28.94±0.31	22.65±0.74	22.27±0.12
n = 500	48.90±1.47	29.72±3.79	25.61±0.36	22.63±0.37	22.22±0.22
n = 1000	49.71±0.76	26.44±0.68	24.77±0.22	22.10±0.16	22.03±0.18
n = 2000	49.06±1.40	25.26±0.54	24.35±0.23	22.04±0.20	22.43±0.33
n = 5000	26.88±8.74	24.25±0.22	23.65±0.16	21.91±0.21	21.90±0.25
n = 10 000	22.63±0.16	23.13±0.36	22.68±0.08	21.67±0.34	21.40±0.18
n = 42 233	20.23	20.48	20.38	19.82	19.99

Note: n denotes the number of rat sequences used for fine-tuning. For every n except 0 and 42 233 (the total size of rat training data), n sequences are randomly sampled from the rat training dataset and used to fine-tune the pre-trained model. Such step is repeated 10 times for every n. The table shows the average error rate of these 10 repeats with the standard deviation on the rat test set. None indicates a model without any pre-training.

Transfer learning for insufficient amount of sequences in the rat poly(A) dataset Note: n denotes the number of rat sequences used for fine-tuning. For every n except 0 and 42 233 (the total size of rat training data), n sequences are randomly sampled from the rat training dataset and used to fine-tune the pre-trained model. Such step is repeated 10 times for every n. The table shows the average error rate of these 10 repeats with the standard deviation on the rat test set. None indicates a model without any pre-training. We also observed that if the model is pre-trained on a less similar species, e.g. pre-trained on human data and tested on rats, fine-tuning is crucial as it can adapt the pre-trained model to the new data even when we only have very limited data for fine-tuning. While for similar species, i.e. mice to rats, the pre-trained model is already quite good. Note that here we did not perform any hyper-parameter (such as learning rate, dropout rate) search for fine-tuning because of insufficient data, instead we just used the best set of hyper-parameters found during pre-training. In general, the results not only demonstrate that our proposed DeeReCT-PolyA model can generalize across species, but also offer us an effective method for solving the issue of the shortage of PAS annotation in some species.

3.4 Visualizing importance of dimers and positions

One advantage of our method is it can extract important patterns of polyadenylation by learning from data. Thus it provides us a direct way to understand the underlying regulatory elements of polyadenylation through the visualization of the trained model. Figure 2 presents the importance of all 16 dimers at different positions surrounding the PAS motif. Several interesting patterns can be observed here. For the Dragon human poly(A) dataset, dimer AA is found to be an informative subsequence in Xie within 30 nt downstream of the candidate PAS motif. The same observation was also obtained by our deep learning model. Another finding is that when GT or TG appears in the downstream region of the PAS motif, it strongly suggests a true PAS motif. This finding coincides with Hu , where their result shows that there are many GU-rich elements in the downside of poly(A) sites.

Fig. 2.

Visualization of the importance of different dimers at different positions for models trained with four datasets. The colors denote the contribution of the dimer at that position to determining a true PAS motif. The darker blue, the more contribution the dimer at that position has to determining a true PAS motif. The more white, the less contribution. The x-axis shows the positions of the dimer in the sequence, where Position 0 is the first base of the PAS motif. The y-axis lists all possible dimers On the other hand, our model also suggested some novel patterns that have not been observed before. Dimer CT, as an interesting subsequence, is very informative when occurring 20–nt downstream the PAS especially for the BL mouse data. And in general, the downstream region contains more information than the upstream one which makes sense since the cleavage site usually resides 10–30 nt downstream the PAS motif. More importantly, it is clearly shown in Figure 2 that human, BL mouse and SP mouse share many similar poly(A) patterns. The importance of positions can be found in Supplementary Figure S2. We have also visualized the importance of some known motifs, i.e. RNA-binding protein motifs (Supplementary Material S5 and Supplementary Fig. S1).

3.5 Interpreting convolutional filters and measuring similarities between human and mouse

Deep learning models are often criticized to be ‘black-box’ models, which lack interpretability. Here we tried to overcome this bottleneck by interpreting the learned convolutional filters as cis-elements and visualizing them as sequence logos (Section 2.7). We presented sequence logos for each filter of the four DeeReCT-PolyA models trained with different datasets in Figure 3.

Fig. 3.

Sequence logos of cis-elements identified by each convolutional filter in four DeeReCT-PolyA models trained with different datasets. Thymine (T) is replaced by uracil (U) for the purpose of comparison with previous works on statistics of PASs in mRNA sequences. Many subsequences, such as U/GU-rich elements, UC elements and UGUA, are shown to have great influences on polyadenylation in both human and mouse As shown in Figure 3, our model reveals many cis-elements, including some subsequences that have been found and validated before. Evidently, U-rich elements are essential for PAS motif recognition for both human and mouse, which is in line with many previous studies. Other elements, like GU-rich elements which are shown to have great importance in Figure 2 and Hu , can also be found in the sequence logos. Specifically, the UGUA element, which is implicated by Hu and Venkataraman in human poly(A) site recognition, is also shown to be an informative cis-element in both human and mouse datasets by our model. Our results also indicate some novel elements that have not been reported before, such as the UC and AAUA elements. Our results in Figure 3 clearly demonstrate that there are high similarities among the cis-elements in human, BL mouse and SP mouse. To quantitatively measure the similarity of patterns extracted from different datasets, we adopted the measurement method proposed in Pape . For the purpose of baseline comparison, we generated two additional sets of sequence logos with two randomly initialized deep neural networks. Similarity scores are reported in Table 9. As expected, the similarity of human and mouse poly(A) cis-elements are much higher than that of the randomly initialized models. In addition, the similarities of cis-elements between the human datasets and between the mouse datasets are often higher than that between human and mouse datasets. Interestingly, the similarity between the sequence logos of the Omni human model and that of the BL mouse model is higher than that between the Omni human model and the SP model, which is consistent with the visual similarity between the 2-mer importance distributions of the Omni human and the BL mouse (Fig. 2).

Table 9.

Similarity of sequence logos generated for different models

Models	Dragon-vs-Omni	Dragon-vs-SP	Dragon-vs-BL	Omni-vs-SP	Omni-vs-BL	SP-vs-BL	random1-vs-random2
Similarity (×10^–3)	1.15	1.03	1.02	1.07	1.24	1.40	0.36

Note: random1 and random2 denote two randomly initialized CNNs.

Similarity of sequence logos generated for different models Note: random1 and random2 denote two randomly initialized CNNs.

4 Discussions and conclusions

Here we highlight the main differences of our method to a recent work (Leung ) which also makes use of deep neural networks for human poly(A) code inference. The problem they tried to solve is also a binary classification problem, but a different one, i.e. to determine which of the two true poly(A) sites has a higher usage. Additional supervision is required for training their model as it is trained on sequences containing true poly(A) sites and the strength of each site, while our model is trained with only the sequences. They also did an experiment for poly(A) site discovery, but the discovery task is indeed a much simpler problem than the PAS recognition problem studied in this paper, because in the former, to identify a false sequence one often only needs to search if there is a PAS hexamer (e.g. AATAAA) upstream the center of the sequence. In addition, their study focused on human data only, whereas we propose a robust PAS recognition method across different motifs and different species. In this study, we proposed a deep learning model, DeeReCT-PolyA, along with a novel filter-GN method for automatic feature extraction and PAS recognition that is applicable to multiple species. The proposed method is generic, PAS variant agnostic and outperforms the state-of-art variant-specific methods on two standard human benchmarks. We obtained two additional poly(A) datasets for BL and SP mouse and showed that DeeReCT-PolyA can consistently achieve high accuracy across species. Furthermore, our results demonstrate that transfer learning can further improve the PAS recognition accuracy on each individual dataset. In particular, by transferring a pre-trained model, DeeReCT-PolyA can outperform the previous state-of-the-art method while using much less training data in the target dataset, which also provides us a way to address the problem of insufficient data in many species. Visualization methods were applied to DeeReCT-PolyA which revealed some interesting features including several novel cis-elements. We visualize the convolutional filters as sequence logos and quantitatively measure the similarity of trained models to show that human, BL mouse and SP mouse share a number of similar features in polyadenylation. Click here for additional data file.

28 in total

1. Detection of polyadenylation signals in human DNA sequences.

Authors: J E Tabaska; M Q Zhang
Journal: Gene Date: 1999-04-29 Impact factor: 3.688

2. Analysis of a noncanonical poly(A) site reveals a tripartite mechanism for vertebrate poly(A) site recognition.

Authors: Krishnan Venkataraman; Kirk M Brown; Gregory M Gilmartin
Journal: Genes Dev Date: 2005-06-01 Impact factor: 11.361

3. Natural similarity measures between position frequency matrices with an application to clustering.

Authors: Utz J Pape; Sven Rahmann; Martin Vingron
Journal: Bioinformatics Date: 2008-01-02 Impact factor: 6.937

4. Recognition of 3'-processing sites of human mRNA precursors.

Authors: A A Salamov; V V Solovyev
Journal: Comput Appl Biosci Date: 1997-02

5. Stability of BAT26 in tumours of hereditary nonpolyposis colorectal cancer patients with MSH2 intragenic deletion.

Authors: Chiara Pastrello; Silvana Baglioni; Maria Grazia Tibiletti; Laura Papi; Mara Fornasarig; Alberto Morabito; Marco Agostini; Maurizio Genuardi; Alessandra Viel
Journal: Eur J Hum Genet Date: 2006-01 Impact factor: 4.246

Review 6. Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures.

Authors: Margarita I Zarudnaya; Iryna M Kolomiets; Andriy L Potyahaylo; Dmytro M Hovorun
Journal: Nucleic Acids Res Date: 2003-03-01 Impact factor: 16.971

7. POLYAR, a new computer program for prediction of poly(A) sites in human sequences.

Authors: Malik Nadeem Akhtar; Syed Abbas Bukhari; Zeeshan Fazal; Raheel Qamar; Ilham A Shahmuradov
Journal: BMC Genomics Date: 2010-11-19 Impact factor: 3.969

8. A large-scale analysis of mRNA polyadenylation of human and mouse genes.

Authors: Bin Tian; Jun Hu; Haibo Zhang; Carol S Lutz
Journal: Nucleic Acids Res Date: 2005-01-12 Impact factor: 16.971

9. DEEPre: sequence-based enzyme EC number prediction by deep learning.

Authors: Yu Li; Sheng Wang; Ramzan Umarov; Bingqing Xie; Ming Fan; Lihua Li; Xin Gao
Journal: Bioinformatics Date: 2018-03-01 Impact factor: 6.937

10. Poly(A) motif prediction using spectral latent features from human DNA sequences.

Authors: Bo Xie; Boris R Jankovic; Vladimir B Bajic; Le Song; Xin Gao
Journal: Bioinformatics Date: 2013-07-01 Impact factor: 6.937

11 in total

1. DeepPASTA: deep neural network based polyadenylation site analysis.

Authors: Ashraful Arefeen; Xinshu Xiao; Tao Jiang
Journal: Bioinformatics Date: 2019-11-01 Impact factor: 6.937

2. A deep dense inception network for protein beta-turn prediction.

Authors: Chao Fang; Yi Shang; Dong Xu
Journal: Proteins Date: 2019-07-23

3. Protein-RNA interaction prediction with deep learning: structure matters.

Authors: Junkang Wei; Siyuan Chen; Licheng Zong; Xin Gao; Yu Li
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

4. Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials.

Authors: Hyunjin Shim
Journal: Evol Bioinform Online Date: 2022-06-08 Impact factor: 2.031

5. Precise Prediction of Calpain Cleavage Sites and Their Aberrance Caused by Mutations in Cancer.

Authors: Ze-Xian Liu; Kai Yu; Jingsi Dong; Linhong Zhao; Zekun Liu; Qingfeng Zhang; Shihua Li; Yimeng Du; Han Cheng
Journal: Front Genet Date: 2019-08-08 Impact factor: 4.599

6. mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning.

Authors: Zhenzhen Zou; Shuye Tian; Xin Gao; Yu Li
Journal: Front Genet Date: 2019-01-22 Impact factor: 4.599

7. Deep Learning Deepens the Analysis of Alternative Splicing.

Authors: Xudong Zou; Xin Gao; Wei Chen
Journal: Genomics Proteomics Bioinformatics Date: 2019-05-14 Impact factor: 7.691

8. Predicting environmentally responsive transgenerational differential DNA methylated regions (epimutations) in the genome using a hybrid deep-machine learning approach.

Authors: Lawrence Holder; Michael K Skinner; Pegah Mavaie; Daniel Beck
Journal: BMC Bioinformatics Date: 2021-11-30 Impact factor: 3.169

Review 9. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome.

Authors: Boris Jankovic; Takashi Gojobori
Journal: Hum Genomics Date: 2022-02-18 Impact factor: 4.639

10. Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species.

Authors: Yumin Zheng; Haohan Wang; Yang Zhang; Xin Gao; Eric P Xing; Min Xu
Journal: PLoS Comput Biol Date: 2020-11-05 Impact factor: 4.475