| Literature DB >> 30500881 |
Zhihao Xia1, Yu Li2, Bin Zhang3, Zhongxiao Li2, Yuhui Hu3, Wei Chen3, Xin Gao2.
Abstract
MOTIVATION: Polyadenylation is a critical step for gene expression regulation during the maturation of mRNA. An accurate and robust method for poly(A) signals (PASs) identification is not only desired for the purpose of better transcripts' end annotation, but can also help us gain a deeper insight of the underlying regulatory mechanism. Although many methods have been proposed for PAS recognition, most of them are PAS motif- and human-specific, which leads to high risks of overfitting, low generalization power, and inability to reveal the connections between the underlying mechanisms of different mammals.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30500881 PMCID: PMC6612895 DOI: 10.1093/bioinformatics/bty991
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The architecture of the proposed DeeReCT-PolyA network. The output feature channels (shown as a column) of the conv layer is divided into groups (green arrows) and each group is jointly normalized by the group normalization layer. After tunable parameters are learned from the data, two visualization methods (shown as dashed lines in green and gray) are applied to the model without normalization to extract cis-elements and variants for the regulation of polyadenylation (Color version of this figure is available at Bioinformatics online.)
Error rate comparison between RF, HSVM, Omni-PolyA and our model (DeeReCT-PolyA) on the Dragon human poly(A) data
| Variants | Size | Error Rate (%) | ||||
|---|---|---|---|---|---|---|
| RF | HSVM | Omni- PolyA | DeeReCT- PolyA | Rel | ||
|
| 5190 | 20.06 | 18.59 | 14.02 |
| 2.21 |
|
| 2400 | 18.42 | 16.21 | 12.50 |
| 3.50 |
|
| 1250 | 16.64 | 9.36 | 10.80 |
| 3.59 |
|
| 1230 | 11.06 | 5.45 |
| 7.76 | −2.89 |
|
| 880 | 19.55 | 15.34 | 13.52 |
| 5.83 |
|
| 780 | 19.36 | 11.15 | 13.85 |
| 0.70 |
|
| 690 | 27.83 | 16.96 | 14.49 |
| 4.94 |
|
| 670 | 22.09 | 14.33 | 13.13 |
| 2.41 |
|
| 460 | 20.00 | 9.57 | 8.48 |
| 0.44 |
|
| 410 | 18.54 | 9.27 | 13.41 |
| 0.25 |
|
| 410 | 24.88 | 12.68 | 14.39 |
| 3.90 |
|
| 370 | 18.38 | 5.14 | 11.62 |
| 0.55 |
|
| – | 19.19 | 14.42 | 12.43 |
| 2.86 |
Note: Rel denotes the improvement of DeeReCT-PolyA with respect to the best of the other three methods. Bold indicates the error rate of the best model for each PAS motif variant. Average is the weighted average of all motif variants with the size as weights. While results of all three previous methods are reported for 12 variant-specific models, the results of DeeReCT-PolyA are the performance of one single generic model that deals with all 12 variants.
Error rate comparison between RF, HSVM, Omni-PolyA and our model (DeeReCT-PolyA) on the Omni human poly(A) data
| Variants | Size | Error Rate (%) | ||||
|---|---|---|---|---|---|---|
| RF | HMM | Omni- PolyA | DeeReCT- PolyA | Rel | ||
|
| 24310 | 25.49 | 27.91 | 23.96 |
| 1.97 |
|
| 7098 | 25.59 | 33.48 | 24.20 |
| 1.09 |
|
| 1640 | 26.52 | 36.83 |
| 27.76 | −1.90 |
|
| 1306 | 26.67 | 34.77 |
| 26.80 | −3.73 |
|
| 682 | 30.88 | 38.38 | 26.91 |
| 3.31 |
|
| 634 | 24.41 | 36.98 | 22.06 |
| 0.06 |
|
| 528 | 28.11 | 37.31 | 23.26 |
| 3.05 |
|
| 368 | 32.97 | 33.89 |
| 25.79 | −1.07 |
|
| 342 | 31.18 | 41.76 | 29.41 |
| 7.26 |
|
| 314 | 28.89 | 39.03 |
| 25.54 | −1.03 |
|
| 250 | 31.60 | 36.00 | 26.80 |
| 8.98 |
|
| 100 | 34.00 | 40.00 | 23.00 |
| 3.00 |
|
| – | 25.93 | 30.43 | 24.15 |
| 1.51 |
Note: Rel denotes the improvement of DeeReCT-PolyA with respect to the best of the other three methods. Bold indicates the error rate of the best model for each PAS motif variant. Average is the weighted average of all motif variants with the size as weights.
Error rate of DeeReCT-PolyA on SP and BL mouse poly(A) data
| Variants | SP | BL | ||
|---|---|---|---|---|
| Size | Error Rate (%) | Size | Error Rate (%) | |
|
| 17 708 | 26.50 | 20 250 | 25.48 |
|
| 7550 | 25.30 | 9056 | 24.89 |
|
| 2336 | 19.95 | 2688 | 18.19 |
|
| 2178 | 22.91 | 2518 | 22.44 |
|
| 2224 | 22.88 | 2376 | 21.63 |
|
| 1432 | 20.53 | 1760 | 19.77 |
|
| 1334 | 23.55 | 1528 | 23.23 |
|
| 1210 | 21.40 | 1326 | 22.55 |
|
| 1032 | 17.84 | 1176 | 18.54 |
|
| 1022 | 15.07 | 1126 | 15.81 |
|
| 982 | 18.84 | 1108 | 18.86 |
|
| 728 | 19.37 | 776 | 20.24 |
|
| 494 | 18.64 | 536 | 21.24 |
|
| – | 24.11 | – | 23.49 |
DeeReCT-PolyA with leave-one-motif-out test on the Dragon human dataset
| Variants | Size | Error Rate (%) | |
|---|---|---|---|
| 5-fold cross-validation | leave-one-motif-out | ||
|
| 5190 |
| 14.20 |
|
| 2400 | 9.00 |
|
|
| 1250 | 5.77 |
|
|
| 1230 | 7.76 |
|
|
| 880 | 7.69 |
|
|
| 780 | 10.45 |
|
|
| 690 | 9.55 |
|
|
| 670 |
| 11.16 |
|
| 460 |
|
|
|
| 410 |
| 11.46 |
|
| 410 | 8.78 |
|
|
| 370 | 4.59 |
|
|
| – |
| 10.08 |
Note: For the leave-one-motif-out test, for each PAS variant, a DeeReCT-PolyA model was trained with data of all the other motif variants and then test only on this variant. Bold indicates the error rate of best model for each PAS variant.
Evaluation of transferred DeeReCT-PolyA models on SP mouse poly(A) data before and after fine-tuning
| Average Error Rate (%) | |||
|---|---|---|---|
| Pre-trained on | None | Omni | BL |
| Before fine-tuning | – | 30.23 | 23.67 |
| After fine-tuning | 24.11 | 24.04 |
|
Note: None denotes a model of no pre-training and trained with SP mouse data. Models respectively pre-trained on Omni and BL dataset are evaluated on SP mouse dataset before and after fine-tuning with SP data. Average error rate over all PAS motif variants is reported.
Evaluation of transferred DeeReCT-PolyA models on BL mouse poly(A) data before and after fine-tuning
| Average Error Rate (%) | |||
|---|---|---|---|
| Pre-trained on | None | Omni | SP |
| Before fine-tuning | – | 29.75 | 23.13 |
| After fine-tuning | 23.49 | 23.38 |
|
Evaluation of transferred DeeReCT-PolyA models on Omni human poly(A) data before and after fine-tuning
| Average Error Rate (%) | |||
|---|---|---|---|
| Pre-trained on | None | SP | BL |
| Before fine-tuning | – | 29.58 | 29.07 |
| After fine-tuning | 22.64 |
| 22.44 |
Transfer learning for insufficient amount of sequences in the rat poly(A) dataset
|
| |||||
|---|---|---|---|---|---|
| Pre-trained on | None | Dragon | Omni | SP | BL |
|
| – | 40.55 | 29.30 | 22.11 | 22.40 |
|
| 50.00±0.00 | 39.32±1.84 | 28.94±0.31 | 22.65±0.74 | 22.27±0.12 |
|
| 48.90±1.47 | 29.72±3.79 | 25.61±0.36 | 22.63±0.37 | 22.22±0.22 |
|
| 49.71±0.76 | 26.44±0.68 | 24.77±0.22 | 22.10±0.16 | 22.03±0.18 |
|
| 49.06±1.40 | 25.26±0.54 | 24.35±0.23 | 22.04±0.20 | 22.43±0.33 |
|
| 26.88±8.74 | 24.25±0.22 | 23.65±0.16 | 21.91±0.21 | 21.90±0.25 |
|
| 22.63±0.16 | 23.13±0.36 | 22.68±0.08 | 21.67±0.34 | 21.40±0.18 |
|
| 20.23 | 20.48 | 20.38 | 19.82 | 19.99 |
Note: n denotes the number of rat sequences used for fine-tuning. For every n except 0 and 42 233 (the total size of rat training data), n sequences are randomly sampled from the rat training dataset and used to fine-tune the pre-trained model. Such step is repeated 10 times for every n. The table shows the average error rate of these 10 repeats with the standard deviation on the rat test set. None indicates a model without any pre-training.
Fig. 2.Visualization of the importance of different dimers at different positions for models trained with four datasets. The colors denote the contribution of the dimer at that position to determining a true PAS motif. The darker blue, the more contribution the dimer at that position has to determining a true PAS motif. The more white, the less contribution. The x-axis shows the positions of the dimer in the sequence, where Position 0 is the first base of the PAS motif. The y-axis lists all possible dimers
Fig. 3.Sequence logos of cis-elements identified by each convolutional filter in four DeeReCT-PolyA models trained with different datasets. Thymine (T) is replaced by uracil (U) for the purpose of comparison with previous works on statistics of PASs in mRNA sequences. Many subsequences, such as U/GU-rich elements, UC elements and UGUA, are shown to have great influences on polyadenylation in both human and mouse
Similarity of sequence logos generated for different models
| Models | Dragon-vs-Omni | Dragon-vs-SP | Dragon-vs-BL | Omni-vs-SP | Omni-vs-BL | SP-vs-BL | random1-vs-random2 |
|---|---|---|---|---|---|---|---|
| Similarity (×10–3) | 1.15 | 1.03 | 1.02 | 1.07 | 1.24 | 1.40 | 0.36 |
Note: random1 and random2 denote two randomly initialized CNNs.