| Literature DB >> 27887571 |
Yu Bao1, Morihiro Hayashida2, Tatsuya Akutsu2.
Abstract
BACKGROUND: Dicer is necessary for the process of mature microRNA (miRNA) formation because the Dicer enzyme cleaves pre-miRNA correctly to generate miRNA with correct seed regions. Nonetheless, the mechanism underlying the selection of a Dicer cleavage site is still not fully understood. To date, several studies have been conducted to solve this problem, for example, a recent discovery indicates that the loop/bulge structure plays a central role in the selection of Dicer cleavage sites. In accordance with this breakthrough, a support vector machine (SVM)-based method called PHDCleav was developed to predict Dicer cleavage sites which outperforms other methods based on random forest and naive Bayes. PHDCleav, however, tests only whether a position in the shift window belongs to a loop/bulge structure. RESULT: In this paper, we used the length of loop/bulge structures (in addition to their presence or absence) to develop an improved method, LBSizeCleav, for predicting Dicer cleavage sites. To evaluate our method, we used 810 empirically validated sequences of human pre-miRNAs and performed fivefold cross-validation. In both 5p and 3p arms of pre-miRNAs, LBSizeCleav showed greater prediction accuracy than PHDCleav did. This result suggests that the length of loop/bulge structures is useful for prediction of Dicer cleavage sites.Entities:
Keywords: Dicer cleavage site; Loop/bulge length; Support vector machine
Mesh:
Substances:
Year: 2016 PMID: 27887571 PMCID: PMC5124314 DOI: 10.1186/s12859-016-1353-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Binary patterns for nucleotides, A, U, C, G, and a loop/bulge structure, denoted by L, in PHDCleav [14] and LBSizeCleav with k ones based on sequences and predicted secondary structures
| Mapping | Sequence | Structure | |
|---|---|---|---|
| PHDCleav | A | [1,0,0,0] | [1,0,0,0] |
| U | [0,1,0,0] | [0,1,0,0] | |
| C | [0,0,1,0] | [0,0,1,0] | |
| G | [0,0,0,1] | [0,0,0,1] | |
| L | − | [0,0,0,0] | |
| Extended PHDCleav | A | − | [1,0,0,0,0] |
| U | [0,1,0,0,0] | ||
| C | [0,0,1,0,0] | ||
| G | [0,0,0,1,0] | ||
| L | [0,0,0,0,1] | ||
| LBSizeCleav | A | − | [1,0,0,0,0, … 0] |
| U | [0,1,0,0,0, … 0] | ||
| C | [0,0,1,0,0, … 0] | ||
| G | [0,0,0,1,0, … 0] | ||
| L |
|
In PHDCleav binary patterns each nucleotide is represented by a 4-dimensional vector, and in PHDCleav Extended patterns each nucleotide is represented by a 5-dimensional vector, while in LBSizeCleav the dimension of the vector is 3+k+N, in which N denotes the maximum number of length of loop/bulges among all the pre-miRNAs in the training dataset
Fig. 1Illustration on the feature space mapping of LBSizeCleav. CD-5p and CD-3p denote cleavage sites in 5p and 3p arms, respectively, For two sites of CD-5p and six nucleotides far from CD-3p, the feature vectors of LBSizeCleav with k=3 and w=6 are shown, the red rectangles represent the window of the positive pattern of CD-5p and the window of the negative pattern of CD-3p
Results on average specificity, sensitivity, accuracy, and MCC for both 5p and 3p arms by five-fold cross-validation using PHDCleav and LBSizeCleav (k=1,⋯,5) with window sizes 8,10,12,14 based on sequences and secondary structures predicted by quikfold server
| Method | Window size | 5p arm | 3p arm | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sn | Sp | Ac | MCC | Sn | Sp | Ac | MCC | ||
| PHDCleav (sequence) | 8 | 0.602 | 0.503 | 0.552 | 0.105 | 0.662 | 0.625 | 0.644 | 0.287 |
| 10 | 0.541 | 0.573 | 0.557 | 0.115 | 0.661 | 0.642 | 0.652 | 0.303 | |
| 12 | 0.560 | 0.555 | 0.557 | 0.115 | 0.660 | 0.656 | 0.658 | 0.316 | |
| 14 | 0.539 | 0.572 | 0.555 | 0.111 | 0.654 | 0.702 | 0.678 | 0.356 | |
| PHDCleav (structure) | 8 | 0.753 | 0.814 | 0.784 | 0.568 | 0.670 | 0.661 | 0.665 | 0.330 |
| 10 | 0.784 | 0.827 | 0.806 | 0.612 | 0.702 | 0.719 | 0.710 | 0.421 | |
| 12 | 0.790 | 0.842 | 0.816 | 0.633 | 0.739 | 0.764 | 0.752 | 0.503 | |
| 14 | 0.799 | 0.857 | 0.828 | 0.657 | 0.779 | 0.783 | 0.781 | 0.562 | |
| Extended PHDCleav | 8 | 0.750 | 0.798 | 0.774 | 0.548 | 0.652 | 0.716 | 0.684 | 0.369 |
| 10 | 0.779 | 0.827 | 0.803 | 0.607 | 0.674 | 0.783 | 0.729 | 0.460 | |
| 12 | 0.809 | 0.845 | 0.827 | 0.654 | 0.714 | 0.790 | 0.752 | 0.506 | |
| 14 | 0.813 | 0.868 | 0.840 | 0.682 |
| 0.801 | 0.791 | 0.582 | |
| LBSizeCleav ( | 8 | 0.668 | 0.924 | 0.796 | 0.612 | 0.630 | 0.684 | 0.657 | 0.315 |
| 10 | 0.709 | 0.947 | 0.828 | 0.675 | 0.651 | 0.776 | 0.713 | 0.430 | |
| 12 | 0.774 | 0.945 | 0.859 | 0.730 | 0.686 | 0.847 | 0.766 | 0.540 | |
| 14 | 0.808 | 0.933 | 0.871 | 0.747 | 0.758 | 0.874 | 0.816 | 0.637 | |
| LBSizeCleav ( | 8 | 0.662 |
| 0.808 | 0.645 | 0.626 | 0.723 | 0.674 | 0.351 |
| 10 | 0.725 | 0.946 | 0.835 | 0.688 | 0.642 | 0.806 | 0.724 | 0.455 | |
| 12 | 0.784 | 0.938 | 0.861 | 0.731 | 0.665 | 0.882 | 0.773 | 0.560 | |
| 14 | 0.820 | 0.925 | 0.872 |
| 0.734 | 0.916 | 0.825 | 0.661 | |
| LBSizeCleav ( | 8 | 0.692 | 0.949 | 0.821 | 0.664 | 0.619 | 0.735 | 0.677 | 0.356 |
| 10 | 0.752 | 0.941 | 0.846 | 0.706 | 0.618 | 0.822 | 0.720 | 0.450 | |
| 12 | 0.803 | 0.932 | 0.867 | 0.741 | 0.635 | 0.914 | 0.774 | 0.571 | |
| 14 | 0.825 | 0.912 | 0.869 | 0.740 | 0.719 |
|
|
| |
| LBSizeCleav ( | 8 | 0.695 | 0.949 | 0.822 | 0.667 | 0.614 | 0.736 | 0.675 | 0.353 |
| 10 | 0.767 | 0.938 | 0.853 | 0.716 | 0.621 | 0.835 | 0.728 | 0.467 | |
| 12 | 0.815 | 0.927 | 0.871 | 0.747 | 0.639 | 0.912 | 0.776 | 0.573 | |
| 14 | 0.835 | 0.909 | 0.872 | 0.746 | 0.723 | 0.924 | 0.823 | 0.660 | |
| LBSizeCleav ( | 8 | 0.700 | 0.947 | 0.824 | 0.668 | 0.594 | 0.771 | 0.682 | 0.371 |
| 10 | 0.772 | 0.936 | 0.854 | 0.717 | 0.578 | 0.862 | 0.720 | 0.459 | |
| 12 | 0.821 | 0.924 | 0.872 |
| 0.634 | 0.921 | 0.777 | 0.579 | |
| 14 |
| 0.909 |
|
| 0.724 | 0.932 | 0.828 | 0.671 | |
Sn, Sp, Ac, and MCC denote sensitivity, specificity, accuracy, and Matthews correlation coefficient, respectively
Results on average specificity, sensitivity, accuracy, and MCC for both 5p and 3p arms by five-fold cross-validation using PHDCleav and LBSizeCleav (k=1,⋯,5) with window sizes 8,10,12,14 based on secondary structures predicted by RNAFold
| Method | Window size | 5p arm | 3p arm | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sn | Sp | Ac | MCC | Sn | Sp | Ac | MCC | ||
| Extended PHDCleav | 8 | 0.746 | 0.744 | 0.745 | 0.490 | 0.772 | 0.750 | 0.761 | 0.522 |
| 10 | 0.792 | 0.783 | 0.787 | 0.575 | 0.779 | 0.800 | 0.790 | 0.580 | |
| 12 | 0.798 | 0.799 | 0.798 | 0.597 | 0.785 | 0.830 | 0.808 | 0.616 | |
| 14 | 0.778 | 0.813 | 0.795 | 0.591 | 0.805 |
| 0.829 | 0.659 | |
| LBSizeCleav ( | 8 | 0.739 | 0.805 | 0.772 | 0.545 | 0.785 | 0.790 | 0.788 | 0.576 |
| 10 | 0.798 | 0.820 | 0.809 | 0.618 | 0.795 | 0.815 | 0.805 | 0.610 | |
| 12 | 0.792 | 0.815 | 0.803 | 0.607 | 0.822 | 0.840 | 0.831 | 0.662 | |
| 14 | 0.815 |
|
|
| 0.851 | 0.852 |
|
| |
| LBSizeCleav ( | 8 | 0.753 | 0.788 | 0.771 | 0.542 | 0.792 | 0.788 | 0.790 | 0.580 |
| 10 | 0.816 | 0.795 | 0.806 | 0.612 | 0.811 | 0.794 | 0.803 | 0.606 | |
| 12 | 0.836 | 0.784 | 0.810 | 0.621 | 0.814 | 0.803 | 0.808 | 0.617 | |
| 14 | 0.845 | 0.769 | 0.807 | 0.616 | 0.867 | 0.800 | 0.834 | 0.669 | |
| LBSizeCleav ( | 8 | 0.751 | 0.794 | 0.773 | 0.546 | 0.784 | 0.797 | 0.790 | 0.581 |
| 10 | 0.808 | 0.808 | 0.808 | 0.615 | 0.795 | 0.813 | 0.804 | 0.608 | |
| 12 | 0.822 | 0.800 | 0.811 | 0.623 | 0.808 | 0.835 | 0.821 | 0.643 | |
| 14 | 0.816 | 0.803 | 0.809 | 0.619 | 0.853 | 0.838 | 0.846 | 0.692 | |
| LBSizeCleav ( | 8 | 0.764 | 0.772 | 0.768 | 0.536 | 0.809 | 0.772 | 0.790 | 0.581 |
| 10 | 0.824 | 0.762 | 0.793 | 0.587 | 0.824 | 0.766 | 0.795 | 0.590 | |
| 12 | 0.841 | 0.737 | 0.789 | 0.581 | 0.842 | 0.756 | 0.799 | 0.600 | |
| 14 | 0.871 | 0.678 | 0.774 | 0.559 | 0.898 | 0.697 | 0.797 | 0.607 | |
| LBSizeCleav ( | 8 | 0.782 | 0.747 | 0.764 | 0.529 | 0.822 | 0.744 | 0.783 | 0.568 |
| 10 | 0.836 | 0.732 | 0.784 | 0.572 | 0.829 | 0.726 | 0.777 | 0.558 | |
| 12 | 0.867 | 0.699 | 0.783 | 0.574 | 0.864 | 0.682 | 0.773 | 0.556 | |
| 14 |
| 0.626 | 0.763 | 0.546 |
| 0.619 | 0.768 | 0.562 | |
Sn, Sp, Ac, and MCC denote sensitivity, specificity, accuracy, and Matthews correlation coefficient, respectively
Variances of specificity, sensitivity, accuracy, and MCC for both 5p and 3p arms by five-fold cross-validation using PHDCleav and LBSizeCleav (k=1,⋯,5) with window sizes 8,10,12,14 based on sequences and secondary structures predicted by quikfold server
| feature extraction method | Window size | CD-5p | CD-3p | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Sn | Sp | Ac | Mc | Sn | Sp | Ac | Mc | ||
| PHDCleav (sequence) | 8 | 0.0137 | 0.0008 | 0.0074 | 0.0036 | 0.0072 | 0.0015 | 0.0094 | 0.0066 |
| 10 | 0.0184 | 0.0004 | 0.0111 | 0.0018 | 0.0044 | 0.0005 | 0.0072 | 0.0024 | |
| 12 | 0.0208 | 0.0001 | 0.0223 | 0.0003 | 0.0078 | 0.0011 | 0.0031 | 0.0046 | |
| 14 | 0.0293 | 0.0009 | 0.0174 | 0.0037 | 0.0065 | 0.0007 | 0.0048 | 0.0029 | |
| PHDCleav (structure) | 8 | 0.0042 | 0.0039 | 0.0067 | 0.0155 | 0.0187 | 0.0013 | 0.0091 | 0.0062 |
| 10 | 0.0026 | 0.0043 | 0.0088 | 0.0177 | 0.0100 | 0.0014 | 0.0050 | 0.0059 | |
| 12 | 0.0042 | 0.0027 | 0.0034 | 0.0109 | 0.0051 | 0.0011 | 0.0024 | 0.0045 | |
| 14 | 0.0047 | 0.0031 | 0.0034 | 0.0125 | 0.0039 | 0.0012 | 0.0014 | 0.0047 | |
| Extended PHDCleav | 8 | 0.0029 | 0.0032 | 0.0063 | 0.0128 | 0.0123 | 0.0025 | 0.0043 | 0.0103 |
| 10 | 0.0030 | 0.0038 | 0.0061 | 0.0154 | 0.0064 | 0.0019 | 0.0016 | 0.0075 | |
| 12 | 0.0040 | 0.0033 | 0.0050 | 0.0136 | 0.0054 | 0.0015 | 0.0011 | 0.0059 | |
| 14 | 0.0059 | 0.0027 | 0.0016 | 0.0108 | 0.0032 | 0.0013 | 0.0010 | 0.0052 | |
| LBSizeCleav ( | 8 | 0.0030 | 0.0025 | 0.0044 | 0.0115 | 0.0100 | 0.0004 | 0.0074 | 0.0019 |
| 10 | 0.0022 | 0.0015 | 0.0013 | 0.0064 | 0.0078 | 0.0011 | 0.0015 | 0.0042 | |
| 12 | 0.0050 | 0.0024 | 0.0010 | 0.0090 | 0.0077 | 0.0015 | 0.0002 | 0.0051 | |
| 14 | 0.0075 | 0.0035 | 0.0010 | 0.0132 | 0.0036 | 0.0007 | 0.0002 | 0.0026 | |
| LBSizeCleav ( | 8 | 0.0042 | 0.0018 | 0.0008 | 0.0066 | 0.0053 | 0.0010 | 0.0036 | 0.0041 |
| 10 | 0.0038 | 0.0020 | 0.0009 | 0.0076 | 0.0034 | 0.0010 | 0.0010 | 0.0039 | |
| 12 | 0.0051 | 0.0029 | 0.0016 | 0.0115 | 0.0042 | 0.0017 | 0.0008 | 0.0063 | |
| 14 | 0.0051 | 0.0028 | 0.0012 | 0.0107 | 0.0043 | 0.0008 | 0.0002 | 0.0024 | |
| LBSizeCleav ( | 8 | 0.0025 | 0.0013 | 0.0008 | 0.0050 | 0.0070 | 0.0010 | 0.0021 | 0.0042 |
| 10 | 0.0039 | 0.0022 | 0.0012 | 0.0086 | 0.0064 | 0.0013 | 0.0006 | 0.0048 | |
| 12 | 0.0063 | 0.0031 | 0.0012 | 0.0119 | 0.0039 | 0.0015 | 0.0005 | 0.0055 | |
| 14 | 0.0060 | 0.0033 | 0.0015 | 0.0130 | 0.0073 | 0.0016 | 0.0003 | 0.0046 | |
| LBSizeCleav ( | 8 | 0.0029 | 0.0016 | 0.0009 | 0.0064 | 0.0066 | 0.0020 | 0.0030 | 0.0078 |
| 10 | 0.0046 | 0.0025 | 0.0011 | 0.0095 | 0.0071 | 0.0014 | 0.0006 | 0.0049 | |
| 12 | 0.0061 | 0.0032 | 0.0014 | 0.0124 | 0.0033 | 0.0011 | 0.0007 | 0.0041 | |
| 14 | 0.0051 | 0.0030 | 0.0015 | 0.0120 | 0.0088 | 0.0025 | 0.0002 | 0.0082 | |
| LBSizeCleav ( | 8 | 0.0029 | 0.0017 | 0.0009 | 0.0066 | 0.0062 | 0.0021 | 0.0031 | 0.0082 |
| 10 | 0.0055 | 0.0029 | 0.0013 | 0.0113 | 0.0032 | 0.0011 | 0.0005 | 0.0042 | |
| 12 | 0.0051 | 0.0029 | 0.0015 | 0.0114 | 0.0029 | 0.0011 | 0.0009 | 0.0044 | |
| 14 | 0.0047 | 0.0029 | 0.0016 | 0.0116 | 0.0076 | 0.0023 | 0.0002 | 0.0076 | |
Sn, Sp, Ac, and MCC denote sensitivity, specificity, accuracy, and Matthews correlation coefficient, respectively
Fig. 2Results on ROC curves by LBSizeCleav and PHDCleav with window size w=14 for 5p arm. From the figure we could see that the ROC curve of LBSizeCleav from k=1 to k=5 is significantly better than binary Pattern and extended binary pattern of PHDCleav for both 5p and 3p arms
Fig. 3Results on ROC curves by LBSizeCleav and PHDCleav with window size w=14 for 3p arm. From the figure we could see that the ROC curve of LBSizeCleav from k=1 to k=5 is significantly better than binary Pattern and extended binary pattern of PHDCleav for both 5p and 3p arms
Fig. 4Regression analysis examples of LBSizeCleav(k=5) compared with PHDCleav extended binary
Fig. 5Result on accuracy of LBSizeCleav(k=5) compared with PHDCleav extended binary and SGL of prediction in CD-5p
Fig. 6Secondary structures of hsa-mir-221, hsa-mir-138-1, hsr-mir-15a predicted by quikfold server. The black arrow means the cleavage site validated by biological experiments
Number of patterns predicted only by LBSizeCleav(k=1,4)/PHDCleav(extended binary) using secondary structure predicted by quikfold
| 5’-arm | 3’-arm | ||
|---|---|---|---|
| Positive | Only predicted by LBSizeCleav ( | 39 | 39 |
| Only predicted by PHDCleav (extended binary) compared with LBSizeCleav ( | 57 | 38 | |
| Negative | Only predicted by LBSizeCleav ( | 82 | 65 |
| Only predicted by PHDCleav (extended binary) compared with LBSizeCleav ( | 23 | 12 | |
| Positive | Only predicted by LBSizeCleav ( | 39 | 39 |
| Only predicted by PHDCleav compared with LBSizeCleav ( | 57 | 38 | |
| Negative | Only predicted by LBSizeCleav ( | 82 | 65 |
| Only predicted by PHDCleav (extended binary) compared with LBSizeCleav ( | 23 | 12 |