Literature DB >> 34161188

M6A-BiNP: predicting N6-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information.

Mingzhao Wang1,2, Juanying Xie2, Shengquan Xu1.   

Abstract

N6-methyladenosine (m6A) plays an important role in various biological processes. Identifying m6A site is a key step in exploring its biological functions. One of the biggest challenges in identifying m6A sites is how to extract features comprising rich categorical information to distinguish m6A and non-m6A sites. To address this challenge, we propose bidirectional dinucleotide and trinucleotide position-specific propensities, respectively, in this paper. Based on this, we propose two feature-encoding algorithms: Position-Specific Propensities and Pointwise Mutual Information (PSP-PMI) and Position-Specific Propensities and Pointwise Joint Mutual Information (PSP-PJMI). PSP-PMI is based on the bidirectional dinucleotide propensity and the pointwise mutual information, while PSP-PJMI is based on the bidirectional trinucleotide position-specific propensity and the proposed pointwise joint mutual information in this paper. We introduce parameters α and β in PSP-PMI and PSP-PJMI, respectively, to represent the distance from the nucleotide to its forward or backward adjacent nucleotide or dinucleotide, so as to extract features containing local and global classification information. Finally, we propose the M6A-BiNP predictor based on PSP-PMI or PSP-PJMI and SVM classifier. The 10-fold cross-validation experimental results on the benchmark datasets of non-single-base resolution and single-base resolution demonstrate that PSP-PMI and PSP-PJMI can extract features with strong capabilities to identify m6A and non-m6A sites. The M6A-BiNP predictor based on our proposed feature encoding algorithm PSP-PJMI is better than the state-of-the-art predictors, and it is so far the best model to identify m6A and non-m6A sites.

Entities:  

Keywords:  N6-methyladenosine (m6A); feature representation; nucleotide position-specific propensities; pointwise joint mutual information; predictive model

Mesh:

Substances:

Year:  2021        PMID: 34161188      PMCID: PMC8632114          DOI: 10.1080/15476286.2021.1930729

Source DB:  PubMed          Journal:  RNA Biol        ISSN: 1547-6286            Impact factor:   4.652


Introduction

Epigenetics refers to the study of genetic variations in gene expression under the condition that the nucleotide sequence composition of genes remains unchanged [1]. RNA methylation is the most important epigenetic modification of ~150 chemical modifications. It is the process to transfer methyl catalytic from an active methyl compound, such as S-adenosine methionine, to different positions of an RNA molecule and make the chemical modification to form methylated products [2,3]. The common RNA methylation patterns include N6-methyladenosine (m6A), N1-methyladenosine (m1A) and C5-methylcytidine (m5C) etc., where m6A modification exists in Bacteria [4], Homo sapiens [5], Arabidopsis thaliana [6], etc. The m6A is a dynamic reversible modification regulated by a series of methyltransferases, such as MettL3/14, WTAP and YTHDF2 [7-9], and the demethylases, such as FTO and ALKBH5 [10,11]. It plays an important role in many molecular processes, such as protein translation and localization [12], splicing [13], RNA stability [14], mRNA longevity control and degradation [12], and cell differentiation promotion [15]. It is also associated with the occurrence of complex diseases [16], such as Glioblastoma formation [17], breast cancer [18] and obesity [11]. Therefore, identifying m6A will benefit the diagnosis and treatment of complex diseases, even understanding their mechanism. It has valuable scientific and applicable value in personal medicine and drug development. With the development of second-generation sequencing technology, a number of non-single-base resolution m6A site identification protocols, such as m6A-seq [19] and MeRIP-Seq [13], and single-base resolution m6A site identification protocols, such as miCLIP [20], m6A-CLIP [21] and m6A-REF-seq [22], were proposed based on high-throughput sequencing technology. At present, the m6A sites of Saccharomyces cerevisiae [23], Arabidopsis Thaliana [6], Oryza sativa [24], Mus Musculus [5] and Homo sapiens [5] have been identified at the full-transcriptome level. The study results show that the distribution of m6A sites is highly conservative, and most of them have a common consensus motif DRACH (A = m6A; D = A or G or U; R = A or G; H = A or C or U) [5,13,20]. This lays a theoretical base for identifying m6A sites using machine learning techniques. However, the high-throughput sequencing technology-based m6A site identification methods are time-consuming and inaccuracy, such that they cannot be used on large-scale genomic data. Therefore, many m6A site predictive models have been proposed in recent years based on various feature representation methods of sequence and traditional machine learning algorithms or deep learning framework [25-48]. The latest several predictors, such as Gene2Vec [38], DeepPromise [39], WHISTLE [40], im6A-TS-CNN [42], iRNA-m6A [43] and HSM6AP [44] etc., were developed to identify and predict the m6A sites with the golden standard datasets at the single-base resolution level. Although there are so many computable models to identify m6A sites, it is still a challenging task to distinguish m6A from non-m6A sites accurately. The most key issue is how to extract features containing more categorical information from RNA sequences. Therefore, this paper proposes two new feature encoding algorithms named Position-Specific Propensities and Pointwise Mutual Information (PSP-PMI) and Position-Specific Propensities and Pointwise Joint Mutual Information (PSP-PJMI), respectively. The bidirectional dinucleotide and trinucleotide position-specific propensities are, respectively, proposed in PSP-PMI and PSP-PJMI based on Pointwise Mutual Information (PMI) and our Pointwise Joint Mutual Information (PJMI) theories, respectively. The parameters and are introduced to represent the distance between nucleotides in a pair of nucleotides in PSP-PMI, and the distance from the nucleotide to its forward or backward consecutive dinucleotide in PSP-PJMI, respectively, so as to extract more discriminative features from RNA sequences. The features corresponding to different and are, respectively, concatenated to comprise a high-dimensional feature vector embodying both local and global position-specific information of nucleotides between m6A and non-m6A sites. Finally, the novel m6A site predictor named as M6A-BiNP is proposed based on aforementioned contributions and Support Vector Machine (SVM) classifier. We test our M6A-BiNP models on a number of non-single-base resolution and single-base resolution m6A benchmark datasets of different species. The 10-fold cross-validation experimental results demonstrate that our PSP-PMI and PSP-PJMI algorithms can extract features with much more discriminative capability for identifying m6A sites from RNA sequences. The M6A-BiNP predictor based on our feature encoding algorithm PSP-PJMI is superior to the state-of-the-art predictive models, and it is so far the best model for identifying m6A site.

Materials and methods

Datasets

There are two types of benchmark datasets used to test our feature encoding algorithms PSP-PMI and PSP-PJMI, and our M6A-BiNP predictors. The first type is non-single-base resolution data that across four species of Arabidopsis thaliana [27,49], Musculus [5,34], Homo Sapiens [50] and Saccharomyces cerevisiae [25] were generated from the low-resolution level technique MeRIP-Seq. The detailed information of the non-single-base resolution datasets is shown in Table 1. The second type is the single-base resolution data including three species of human, mouse and rat, which were generated from two single-base resolution m6A sequencing techniques miCLIP or m6A-REF-seq. The three species datasets with different tissues based on m6A-REF-seq technique are downloaded from Dao’s study in [42], and the dataset of human species based on miCLIP technique is obtained from Xing’s study in [31]. The dataset of human species from Xing’s study is denoted as Human51. The detailed information of the single-base resolution datasets is shown in Table 2. These m6A benchmark datasets have been used to test the m6A site predictive models [30-32,35,37,41-43,50-52].
Table 1.

The detailed information of the non-single-base resolution benchmark datasets

Species# positive samples# negative samples#Total samplesSequence length (nt)
Arabidopsis thaliana39439478825
Musculus725725145041
Homo sapiens11301130226041
Saccharomyces cerevisiae13071307261451
Table 2.

The detailed information of the single-base resolution benchmark datasets

SpeciesTissuesNameTraining dataset
Independent dataset
  
   # positive# negative# positive# negativeIdentification methodSequence length (nt)
RatBrainRB2352235223512351m6A-REF-seq41
KidneyRK3433343334323432
LiverRL1762176217621762
MouseBrainMB8025802580258025
HeartMH2201220122002200
KidneyMK3953395339523952
LiverML4133413341334133
TestisMT4707470747064706
HumanBrainHB4605460546044604
KidneyHK4574457445734573
LiverHL2634263426342634
Human5183668366miCLIP51
The detailed information of the non-single-base resolution benchmark datasets The detailed information of the single-base resolution benchmark datasets

PMI and PJMI theory

Mutual information is to measure the correlation between two random variables and [53-55]. It is calculated in (1) when , are discrete random variables. where and are the marginal probability distribution functions of and , respectively, and is the joint probability distribution function of and . The mutual information between random variables , and is calculated in (2). where is the joint probability distribution function of , and . Pointwise mutual information is a special case of . It is to record the amount of uncertainty reduction in when giving in information theory. It is also used to measure the correlation between and . It is calculated in (3). The domain of is . iff and are independent to each other. In addition, is symmetric, that is, . Proof S1 proves this symmetry in supplementary material. Inspired by pointwise mutual information, we propose and define the pointwise joint mutual information in (4) to measure the amount of uncertainty reduction of when giving and . It can also measure the correlation between , and . The , and are the specific events of random variables , and . The domain of is also . iff , and are independent to each other. is symmetric and is independent to the order of and , that is, and both hold. The proofs of these two properties of PJMI are Proof S2 and Proof S3 in supplementary material.

Sequence encoding algorithms

Position-specific propensity has been applied to mine and identify the functional sites of biological sequences [29,31,36,56-58]. The basic principle is to calculate the frequencies of each nucleotide or amino acid of all sequences, and convert the input sequences into feature vectors using the difference between frequencies of positive and negative datasets. To extract features containing rich category information from RNA sequences using position-specific propensity, we propose bidirectional dinucleotide and trinucleotide position-specific propensities, and feature-encoding algorithms PSP-PMI and PSP-PJMI by combing PMI and PJMI, respectively. To extract both local and global position-specific information of nucleotides from RNA sequences, we introduce parameters and in PSP-PMI and PSP-PJMI, respectively, to represent the spacing between nucleotides. We formalize the m6A benchmark datasets in Table 1 and 2 as following mathematics. Let represent the m6A dataset, the positive dataset, that is, the true m6A dataset, and the negative dataset, that is the non-m6A dataset. The relationship between , and is . For RNA sequence in , where is its length, and is the nucleotide at position . The position-specific occurrence frequency of four nucleotides at position in is denoted as vector , where the elements of are the occurrence frequencies of nucleotides A, C, G and U at position i in , respectively. We define the nucleotide position-specific propensity matrix in (5) to represent the statistic information of four nucleotides in .

PSP-PMI algorithm

The bidirectional dinucleotide position-specific propensity is proposed in PSP-PMI algorithm in this paper, so as to extract more position-specific information of nucleotide from forward and backward directions. Furthermore, to extract both local and global category information from RNA sequences, we introduce parameter to represent the distance between two nucleotides in a pair of nucleotides. The means that the two nucleotides are adjacent. We take the RNA sequence with length to describe the idea of our bidirectional dinucleotide position-specific propensity in Figure 1, where Figure 1(a,b) correspond to and , respectively.
Figure 1.

The bidirectional dinucleotide position-specific propensity. (a) for , (b) for

The bidirectional dinucleotide position-specific propensity. (a) for , (b) for We first take into consideration. The frequency of the positional-specific propensity of forward dinucleotide at position is the vector of 16 elements. Its elements are frequencies of the dinucleotides of , respectively. Such as in represents the frequency of the dinucleotide pair AA in , where the nucleotides A and A appear at positions and , respectively. Then, we define the positional-specific propensity matrix in (6) for forward dinucleotides to represent the statistic information of 16 types of dinucleotides in . Similarly, the frequency of the position-specific propensity of the backward dinucleotide at position of can be represented as the vector comprising 16 elements. The element of denotes the frequency of the dinucleotide pair AA, where these nucleotides A and A appear at positions and of , respectively. We define in (7) as the backward dinucleotide position-specific propensity matrix for . Assume that the nucleotides at positions , and are A, G and C, respectively, then the forward PMI value for the nucleotide at position can be calculated in (8), and its backward PMI value is calculated in (9). The in (8) and in (9) can be obtained from matrixes and , respectively. The , and come from matrix . They are, respectively, the occurrence probabilities of nucleotides A, G and C at positions , and of . The PMI encoding value for nucleotide at position of RNA sequences is defined as the average of its forward PMI value and its backward PMI value , that is, . Therefore, the PMI feature encoding vector of the RNA sequence with length in is the feature vector containing elements. Similarly, we can obtain the nucleotide position-specific propensity matrix , position-specific propensity matrix of forward dinucleotide and of backward dinucleotide for . Then, we calculate its forward PMI value , backward PMI value and PMI encoding value of the nucleotide at position . The PMI feature encoding vector for the RNA sequence with length in is the feature vector comprising elements. Finally, we encode the RNA sequence with length into a feature vector containing elements by feature vector minus as follows in (10). where and , . It should be noted that the PMI in (8) and (9) is not the strict PMI in theory due to not satisfying the symmetry property of theoretic PMI. As we know that the nucleotides in RNA sequences have their own orders such that when encoding RNA sequences. The Proof S4 in supplementary material addresses this fact. We summarize our PSP-PMI in Figure S1 in supplemental material. We first partition dataset into positive dataset and negative dataset . Then, the mononucleotide position-specific propensity matrix and the bidirectional dinucleotide position-specific propensity matrices for and are calculated, respectively. The PMI values of nucleotides are calculated based on above six matrices. Finally, the RNA sequence with length is encoded into a feature vector with variables. We introduce parameter to represent the distance between two nucleotides in a pair of nucleotides, so as to get both local and global categorical information from RNA sequences. The encoded feature vectors corresponding to different are concatenated to comprise one final feature vector containing elements.

PSP-PJMI algorithm

To extract much more meaningful information from RNA sequences, we further propose PSP-PJMI feature encoding algorithm. PSP-PJMI proposes bidirectional trinucleotide position-specific propensity. It calculates the trinucleotide position-specific propensity matrices of forward and backward for and , respectively, and utilizes our proposed PJMI in (4) to encode RNA sequences. We introduce parameter into bidirectional trinucleotide position-specific propensity to represent the distance from the nucleotide to its forward or backward successive dinucleotide. It is worth noting that means that the three nucleotides are successive. Here, we adopt the RNA sequence of length to show our bidirectional trinucleotide position-specific propensity for and in Figure 2(a,b), respectively.
Figure 2.

The bidirectional trinucleotide position-specific propensity. (a) for , (b) for

The bidirectional trinucleotide position-specific propensity. (a) for , (b) for The forward trinucleotide position-specific propensity frequency for RNA sequences in at position can be expressed as vector with 64 elements. It represents the frequencies of the trinucleotides of . The element in represents the frequency of the trinucleotide AAA. These nucleotides A, A and A are at positions , and of , respectively. The trinucleotide position-specific propensity matrix of forward direction of is shown in (11). The frequency of trinucleotide position-specific propensity of backward direction at position of is . The first element of represents the frequency of trinucleotide AAA, where the nucleotides A, A and A appear at positions , and of , respectively. The backward trinucleotide position-specific propensity matrix of is defined in (12). Assume that the nucleotide at position is A, the nucleotides at positions and are both G, and the nucleotides at positions and are C and U, respectively. The forward PJMI value and the backward PJMI value for the nucleotide at position are calculated in (13) and (14), respectively. The and in (13) are obtained from and . The and in (14) come from and . The in (13) and (14) is from . The PJMI value of the nucleotide at position of an RNA sequence is defined as , that is, the average of the forward and backward PJMI values and . The PJMI feature vector of the RNA sequence with length in is as . Similarly, we can calculate the forward trinucleotide position-specific propensity matrix and the backward trinucleotide position-specific propensity matrix of . Then the forward and backward PJMI values and are calculated using our PJMI in (4). The average of and is the PJMI encoding value of the nucleotide at position . The PJMI feature encoding vector in is as . Finally, we encode the given RNA sequence with length of into a feature vector containing elements in (15) by minus . where , , . It should be noted that the PJMI in (13) and (14) is not a strict theoretic PJMI because the nucleotides in an RNA sequence have their own orders. The PJMI in (13) and (14) does not satisfy the symmetry and the order of and being not irrelevant. The detail proof of this is shown in Proof S5 in supplementary material. Figure S2 in supplemental material shows the schematic of our PSP-PJMI. It introduces bidirectional trinucleotide position-specific propensity matrixes for and . For the given RNA sequence with length , the PJMI values and of nucleotide at position are calculated using our PJMI theory. The RNA sequence is converted into a feature vector comprising elements using PJMI feature vectors minus . We introduce parameter for extracting both local and global categorical information from RNA sequences. The feature vectors of different are concatenated into a -dimensional vector.

Support vector machine

SVM is proposed by Cortes and Vapnik [59]. It maps nonlinear separation samples in low-dimensional input space to high-dimensional feature space using kernel functions such that samples become linearly separable in it. SVM has got excellent learning and generalization capability, and has been widely used in complex disease diagnoses, biological function site predictions and other bioinformatics fields [60-64]. We adopt the LibSVM toolbox (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) developed by Chang et al [65] to train our m6A predictive model. The radial basis function (RBF) is as the kernel function. The grid search is adopted to find the optimal parameter pair (C, ), so as to get the best predictive model. The penalty factor C and the parameter of RBF are, respectively, as and with both steps of 1.

Metrics to evaluate m6A predictors

To test the power of our PSP-PMI and PSP-PJMI in extracting features with rich categorical information, we evaluate the performance of our M6A-BiNP predictor built on features extracted by PSP-PMI or PSP-PJMI in terms of very popular metrics, such as Accuracy (Acc), Sensitivity (Sn), Specificity (Sp) and Mathew’s correlation coefficient (MCC) [31,38-44,52,62,66,67] and other two comprehensive indexes Area under the receiver operating characteristic curve (AUROC) and Area under the precision recall curve (AUPRC). AUROC and AUPRC are to value the performance of a binary classifier [28,41-43,52,68,69]. AUROC is the area under ROC (Receiver operating characteristic). ROC curve [70] is plotted by multiple pairs of false-positive rate (FPR) and true-positive rate (TPR) corresponding to different thresholds in a two-dimensional space with FPR as x-axis and TPR as y-axis. AUPRC is the area under the Precision-Recall (P-R) curve [71]. The P-R curve is better than ROC curve when dealing with imbalanced binary classification problems [72].

Framework of M6A-BiNP

Figure 3 shows the framework of our M6A-BiNP predictor. We first encode RNA sequence into feature vectors using our PSP-PMI and PSP-PJMI algorithms, respectively, and concatenate feature vectors of different and , respectively, to form the final feature vector and normalize it in min–max normalization. The 10-fold cross-validation experiments are done to train SVM classifiers to build our M6A-BiNP predictor. The results of 10-fold cross-validation experiments are used to evaluate our M6A-BiNP predictor.
Figure 3.

Framework of our M6A-BiNP predictor

Framework of our M6A-BiNP predictor

Results

Since the m6A sequencing data of non-single-base resolution were popular in available studies, we first test our PSP-PMI and PSP-PJMI on four species non-single-base resolution datasets in Table 1. Then, we carry out experiments to test them on single-base resolution datasets in Table 2. We compare the performance of our M6A-BiNP predictors with state-of-the-art predictive models on non-single-base resolution and single-base resolution datasets, respectively.

Performance evaluation on non-single-base resolution m6A datasets

Analysis to position-specific propensities of nucleotide

To reveal the nucleotide position-specific propensities of four species m6A datasets, we adopt Two Sample Logo [73] tool (t-test, p = 0.05) to calculate whether there is a significant difference in the distribution of nucleotides at each site in m6A dataset between its positive and negative samples and to visualize the significant distribution difference of the nucleotide at each site using a nucleotide symbol proportional to the significant difference. The result is shown in Figure 4.
Figure 4.

The nucleotide position-specific propensity. (a) Arabidopsis thaliana, (b) Musculus, (c) Homo sapiens and (d) Saccharomyces cerevisiae. The nucleotide A at position 0 is m6A site in positive sequence and non-m6A site in negative sequence. The nucleotide symbols in the upper of a picture indicate that the corresponding nucleotide is enriched in positive dataset. The nucleotide symbols in the lower indicate that the corresponding nucleotide is depleted in positive dataset. The nucleotide symbols in the middle indicate that the corresponding nucleotide is the consensus motif in both datasets

The nucleotide position-specific propensity. (a) Arabidopsis thaliana, (b) Musculus, (c) Homo sapiens and (d) Saccharomyces cerevisiae. The nucleotide A at position 0 is m6A site in positive sequence and non-m6A site in negative sequence. The nucleotide symbols in the upper of a picture indicate that the corresponding nucleotide is enriched in positive dataset. The nucleotide symbols in the lower indicate that the corresponding nucleotide is depleted in positive dataset. The nucleotide symbols in the middle indicate that the corresponding nucleotide is the consensus motif in both datasets The results in Figure 4 show that there is the consensus motif AC at positions 0 and 1 in Arabidopsis thaliana and Musculus sequences, and the consensus motif GAC at positions −1 to 1 in Saccharomyces cerevisiae sequences, and the consensus motif A at position 0 in Homo sapiens sequences. The nucleotide position-specific propensity exists in both upstream and downstream of the m6A site of the four species of m6A datasets, such as the nucleotide A is enriched while nucleotide U is depleted at both upstream and downstream of m6A site in Saccharomyces cerevisiae. The results in Figure 4 also show that the closer to the m6A site, the more significant difference exists in the nucleotide distribution, such as nucleotides A, G and U at positions −4, −2 and 4 are significantly enriched while nucleotide A at position −2 is significantly depleted in Saccharomyces cerevisiae; nucleotides A and C are significantly enriched while nucleotide U is significantly depleted at position 2 in Arabidopsis thaliana, Musculus and Homo sapiens, and nucleotides U at position −2 is significantly enriched in Musculus and Homo sapiens. The above analyses discover that the nucleotide distributions are various in sequence positions in four specific species. This means the nucleotide position-specific propensity is the key predictive information to distinguish m6A from non-m6A samples. This guarantees the correctness of our PSP-PMI and PSP-PJMI, and the capability of features encoded by them in recognizing m6A sites.

Effects of parameters and

To extract the features with local and global categorical information from RNA sequences, the parameters and are introduced into PSP-PMI and PSP-PJMI, respectively, to represent the distance from the nucleotide to its forward or backward nucleotides in PSP-PMI, or to its forward or backward dinucleotides in PSP-PJMI. The features encoded by PSP-PMI or PSP-PJMI are various when parameters and are different. We take Saccharomyces cerevisiae in Table 1 as an example to investigate the impact of parameters and in PSP-PMI and PSP-PJMI, respectively. The features corresponding to different and are concatenated to train SVM models. Figure 5 displays the 10-fold cross-validation experimental results. The experimental results on Arabidopsis thaliana, Musculus and Homo sapiens are shown in Figure S3 in supplemental materials.
Figure 5.

The performance of SVM built on features encoded by (a) PSP-PMI and (b) PSP-PJMI via varying parameters and on Saccharomyces cerevisiae. The bar chart represents the performance of the SVM built on features corresponding to different and . The line chart represents the performance of the SVM built on concatenating features

The performance of SVM built on features encoded by (a) PSP-PMI and (b) PSP-PJMI via varying parameters and on Saccharomyces cerevisiae. The bar chart represents the performance of the SVM built on features corresponding to different and . The line chart represents the performance of the SVM built on concatenating features The results in Figure 5 show that the SVM model built on features encoded by our PSP-PMI or PSP-PJMI performs worse and worse when parameters and going up, till the worst when is up to 24 in PSP-PMI and up to 23 in PSP-PJMI. The Acc is 0.5 under the worst condition, which means it randomly classifies sequences of m6A and non-m6A in Saccharomyces cerevisiae. This is due to the distance between two nucleotides in dinucleotides and trinucleotides becoming larger and larger as parameters and going up, such that the number of nucleotides encoded by PSP-PMI and PSP-PJMI becomes less and the dimensionality of encoded features decrease, even the useful categorical information cannot be extracted from RNA sequences. The results in Figure 5 also show that the performance of the SVM model goes up built on the concatenating features encoded by PSP-PMI or PSP-PJMI at various or , respectively. This fact discloses that the features with local classification information are extracted when giving different values to parameters and , and these features can be concatenated to comprise the features with global categorical information to maximize the performance of the SVM classifier. This further validates the correctness of our introducing parameters and into PSP-PMI and PSP-PJMI, respectively. Moreover, it can be seen from the results in Figure 5 that the SVM predictor built on the features encoded by PSP-PJMI performs better than that built on the features encoded by PSP-PMI on Saccharomyces cerevisiae for nearly 11%. This fact demonstrates that PSP-PJMI can extract features containing far more categorical information than PSP-PMI.

Comparison with other feature encoding algorithms

To test the performance of our PSP-PMI and PSP-PJMI, we compare them with other seven feature encoding algorithms on four species non-single-base resolution datasets from Table 1, including position-specific nucleotide propensities (PSNP) [29], position-specific dinucleotide propensities (PSDP) [29], K-nucleotide frequencies (KNF) [32], K-spaced nucleotide pair frequencies (KSNPF) [32], nucleotide pair position specificity (NPPS) [31], positional binary encoding (PBE) [74] and nucleotide chemical property and nucleotide composition (NCPNC) [27]. The parameter of NPPS belongs to . The performance of each feature encoding algorithm is shown in Table 3 in terms of Acc, Sn, Sp and MCC of SVM classifier. Figure 6 displays the AUROC and AUPRC of each algorithm. The mean value of the optimal parameters of SVM classifier obtained by grid search for each feature encoding algorithm in 10-fold cross-validation experiments is shown in Table S1 in supplemental material.
Table 3.

Performance comparison between our PSP-PMI, PSP-PJMI and other seven feature representation algorithms on four non-single-base resolution datasets

 Arabidopsis thaliana
Musculus
AlgorithmsAccSnSpMCCAccSnSpMCC
PSP-PMI0.8300.8250.8350.6610.9010.9090.8940.804
PSP-PJMI0.9610.9600.9620.9240.9940.9960.9920.988
PSNP0.840=0.683+0.9980.717=0.856+0.712+1.0000.745+
PSDP0.843=0.685+1.0000.723=0.885=0.771+1.0000.793=
KNF0.787+0.622+0.9520.609=0.732+0.657+0.808+0.472+
KSNPF0.666+0.617+0.715+0.334+0.663+0.648+0.678+0.329+
NPPS0.9250.8910.9590.8540.918=0.855=0.9810.844=
PBE0.840=0.683+0.9980.717=0.885=0.771+1.0000.793=
NCPNC0.843=0.685+1.0000.723=0.881=0.763+1.0000.786=
+/ = /-
2/4/2
6/0/2
1/0/7
1/5/2
3/4/1
6/1/1
2/0/6
3/4/1
 
Homo sapiens
Saccharomyces cerevisiae
PSP-PMI0.8490.8580.8410.7000.9050.9160.8940.810
PSP-PJMI0.9860.9820.9890.9720.9950.9960.9940.990
PSNP0.9020.804+1.0000.8210.747+0.751+0.743+0.495+
PSDP0.9030.806+1.0000.8220.766+0.764+0.769+0.534+
KNF0.797+0.695+0.8990.607+0.692+0.741+0.643+0.387+
KSNPF0.680+0.612+0.749+0.365+0.651+0.712+0.591+0.307+
NPPS0.9080.817+0.9990.8300.874+0.884+0.864=0.749+
PBE0.9080.817+1.0000.8310.727+0.727+0.728+0.456+
NCPNC0.9090.818+1.0000.8320.731+0.735+0.726+0.463+
+/ = /-2/0/67/0/11/0/72/0/67/0/17/0/16/1/17/0/1
Figure 6.

ROC and P-R curves of nine feature representation algorithms on four datasets. (a) – (d) AUROC, (e) – (h) AUPRC

Performance comparison between our PSP-PMI, PSP-PJMI and other seven feature representation algorithms on four non-single-base resolution datasets ROC and P-R curves of nine feature representation algorithms on four datasets. (a) – (d) AUROC, (e) – (h) AUPRC Furthermore, we adopt paired two-tailed t-test method (p = 0.05) to carry out the statistical test of PSP-PMI, PSP-PJMI and other seven feature encoding algorithms via their 10-fold cross-validation experimental results on four species of m6A datasets, so as to verify whether or not there is significant difference between these feature encoding algorithms. We adopt the symbols ‘+’, ‘ = ’ and ‘-’ to denote that PSP-PMI has a significant difference and is better than compared algorithm, has no significant difference, and has significant difference and is worse than compared algorithm, respectively, at the 5% significant level. We count the number of symbols ‘+’, ‘ = ‘ and ‘-’ of each algorithm in terms of Acc, Sn, Sp and MCC, so as to compare the performance of PSP-PMI with that of PSP-PJMI and other seven algorithms on four species m6A datasets. The statistic test results are also shown in Table 3. The results in Table 3 show that PSP-PJMI defeats seven compared algorithms and PSP-PMI in terms of Acc, Sn and MCC, especially on Saccharomyces cerevisiae, it is superior to any compared feature encoding algorithms including PSP-PMI in terms of Acc, Sn, Sp and MCC. Our PSP-PMI outperforms seven compared feature encoding algorithms on Saccharomyces cerevisiae. It is statistically better than or equal to other seven feature encoding algorithms on Musculus, and other six feature encoding algorithms except for NPPS on Arabidopsis thaliana in terms of Acc, Sn and MCC. It performs poor on Homo sapiens with only superior to KSNPF in terms of four metrics and to KNF in terms of Acc, Sn and MCC, but it can defeat seven compared feature encoding algorithms in terms of Sn. The paired two-tailed t-test results in Table 3 show us that PSP-PMI performs best on Saccharomyces cerevisiae when comparing to seven compared feature encoding algorithms, and worst on Homo sapiens. Its capability to extract features to identify non-m6A sites on Arabidopsis Thaliana and Homo Sapiens is inferior to compared feature encoding algorithms except for KSNPF, so is on Musculus except for KSNPF and KNF. We are more interested in the capability to identify the true m6A sites, that is, the bigger the Sn, the better is the algorithm. Therefore, although our PSP-PMI is inferior to our PSP-PJMI algorithm, it can extract more useful features from RNA sequences compared to other seven algorithms. The results in Figure 6 show that PSP-PJMI is far better than PSP-PMI and other seven compared algorithms. Its AUROC and AUPRC obtain the maximal value 1 on Musculus and Saccharomyces cerevisiae. Our PSP-PMI defeats other seven feature encoding algorithms on Saccharomyces cerevisiae in terms of AUROC and AUPRC. However, it is inferior to NPPS in terms of AUROC on Arabidopsis thaliana, Musculus and Homo sapiens, and also inferior to PSDP on Homo sapiens. The results in Figure 6 about each algorithm’s AUPRC show that our PSP-PMI is inferior to NPPS, PSNP and PSDP on Arabidopsis thaliana, and inferior to NPPS on Musculus. Its performance is poor on Homo sapiens in terms of AUPRC only superior to KNF and KSNPF. From all above analyses, we can say that PSP-PJMI is definitely superior to PSP-PMI and other seven compared feature encoding algorithms. It can extract features with strong categorical discernibility from RNA sequences. Although PSP-PMI is inferior to PSP-PJMI, it is superior to other seven compared algorithms in encoding features to identify m6A sites.

Comparison with the state-of-the-art predictors

To comprehensively compare the performance of our M6A-BiNP predictors based on the features encoded by proposed PSP-PMI or PSP-PJMI, we try our best to collect the state-of-the-art predictive models based on traditional machine learning algorithms and deep learning framework on four species benchmark m6A datasets, and compare them to our M6A-BiNP predictors. The performances of the state-of-the-art predictive models and M6A-BiNP predictors on four datasets are shown in Table 4. The bold fonts mean the best results. The CNN, EL, RF and DNN represent convolutional neural networks, ensemble learning, random forest and deep neural network, respectively.
Table 4.

Performance comparison between our M6A-BiNP and the state-of-the-art predictors on four species m6A benchmark datasets

DatasetsPredictorsClassifiersExperiment methodsEvaluation criteria
AccSnSpMCCAUROCAUPRC
Arabidopsis thalianaM6ATH [27]SVMjackknife0.8440.6881.0000.7200.8460.870
RAM-NPPS [31]SVMjackknife0.8950.8730.9160.790
m6A-word2vec [75]CNN10-fold cross-validation0.9050.9500.8590.8100.928
M6A-BiNPSVM (PSP-PMI)10-fold cross-validation0.8300.8250.8350.6610.8970.904
SVM (PSP-PJMI)10-fold cross-validation0.9610.9600.9620.9240.9950.994
MusculusiN6-Methyl [37]CNN10-fold cross-validation0.8950.7891.0000.8080.913
M6AMRFS [34]XGBoost10-fold cross-validation0.7930.8280.7580.588
MethyRNA [50]SVMjackknife0.8840.7781.000
iMRM [41]XGboostjackknife0.8900.7830.9960.7790.820
m6A-NeuralTool [76]CNN10-fold cross-validation0.9580.9151.0000.9120.960
pm6A-CNN [77]CNN10-fold cross-validation0.9380.9040.9720.8800.970
Second order-MM [78]Markov model10-fold cross-validation0.8830.8750.8890.775
SRAMP [28]RF10-fold cross-validation0.8890.7781.0000.798
M6A-BiNPSVM (PSP-PMI)10-fold cross-validation0.9010.9090.8940.8040.9620.962
SVM (PSP-PJMI)10-fold cross-validation0.9940.9960.9920.9881.0001.000
Homo sapiensM6AMRFS [34]XGBoost10-fold cross-validation0.9100.8201.0000.834
MethyRNA [50]SVMjackknife0.9040.8170.991
iRNA-Methyl [25]SVMjackknife0.6720.5750.769
iN6-Methyl [37]CNN10-fold cross-validation0.9110.8211.0000.8350.903
iMRM [41]XGboostjackknife0.9100.8250.9960.8200.940
m6A-NeuralTool [76]CNN10-fold cross-validation0.9600.9201.0000.8820.950
pm6A-CNN [77]CNN10-fold cross-validation0.9360.8860.9860.8780.960
m6A-word2vec [75]CNN10-fold cross-validation0.9270.9810.8820.8500.951
Second order-MM [78]Markov model10-fold cross-validation0.9060.8650.9470.814
SRAMP [28]RF10-fold cross-validation0.8980.7971.0000.814
M6A-BiNPSVM (PSP-PMI)10-fold cross-validation0.8490.8580.8410.7000.9280.928
SVM (PSP-PJMI)10-fold cross-validation0.9860.9820.9890.9720.9990.999
Saccharomyces cerevisiaeM6APredict-EL [35]EL10-fold cross-validation0.8080.8070.8100.6200.9020.901
RAM-NPPS [31]SVM10-fold cross-validation0.7990.7900.8080.598
M6AMRFS [34]XGBoost10-fold cross-validation0.7430.7520.7330.485
M6A-HPCS [79]SVMjackknife0.7240.7740.6740.4500.782
iRNA-Methyl [25]SVMjackknife0.6560.7060.6060.2900.705
pRNAm-PC [30]SVMjackknife0.6970.6970.6980.4000.763
RAM-ESVM [80]SVMjackknife0.7480.7890.7780.570
BERMP [81]DL and RFindependent0.7130.7300.6960.4300.800
iMethyl-STTNC [52]SVM10-fold cross-validation0.6980.7030.6820.380
iN6-Methyl [37]CNN10-fold cross-validation0.7540.7620.7460.5080.803
M6A-PXGB [51]XGBoost10-fold cross-validation0.7710.7640.7600.5350.839
DeepM6APred [36]SVM10-fold cross-validation0.8050.7950.8150.610
iMRM [41]XGboostjackknife0.7780.7700.7850.5550.85
m6A-NeuralTool [76]CNN10-fold cross-validation0.7900.7830.7960.614
pm6A-CNN [77]CNN10-fold cross-validation0.8500.8460.8550.7030.920
m6A-word2vec [75]CNN10-fold cross-validation0.8320.8650.7990.6600.901
iMethyl-deep [82]CNN10-fold cross-validation0.8920.8850.8990.7800.931
DNN-m6A [83]DNN10-fold cross-validation0.7850.7870.7830.571
M6A-BiNPSVM (PSP-PMI)10-fold cross-validation0.9050.9160.8940.8100.9680.967
SVM (PSP-PJMI)10-fold cross-validation0.9950.9960.9940.9901.0001.000
Performance comparison between our M6A-BiNP and the state-of-the-art predictors on four species m6A benchmark datasets The results in Table 4 show that our M6A-BiNP predictor built on features encoded by our PSP-PJMI is far superior to the state-of-the-art predictors on four species of m6A benchmark datasets in terms of Acc, Sn, MCC, AUROC and AUPRC, especially on Saccharomyces Cerevisiae, it is superior to all available predictors in terms of all metrics including Sp. Although it is not the best one on Arabidopsis Thaliana, Musculus and Homo Sapiens in terms of Sp, it is the best in terms of Acc, Sn, Mcc, AUROC and AUPRC. The accuracy of this M6A-BiNP is higher 6.19%, 3.76%, 2.71% and 11.55% than that of the best m6A-word2vec, m6A-NeuralTool, m6A-NeuralTool and iMethyl-deep predictors on Arabidopsis Thaliana, Musculus, Homo Sapiens and Saccharomyces Cerevisiae, respectively. The results in Table 4 also show that M6A-BiNP predictor built on features encoded by PSP-PMI defeats all the state-of-the-art models on Saccharomyces cerevisiae in terms of Acc, Sn, MCC, AUROC and AUPRC, except for a little inferior to iMethyl-deep in terms of Sp. However, this M6A-BiNP predictor does not perform well on Arabidopsis Thaliana, Musculus and Homo Sapiens. It is just superior to iRNA-Methyl and inferior to other predictive models on Homo Sapiens in terms of Acc. It is the worst one on Arabidopsis thaliana in terms of Acc, Sp and MCC. It can defeat compared models except for m6A-NeuralTool and pm6A-CNN on Musculus in terms of Acc. Although PSP-PMI-based M6A-BiNP predictor is not as good as the one based on PSP-PJMI, it is still a comparatively good predictive model in identifying m6A sites.

Performance evaluation on the single-base resolution m6A datasets

This section will test the performance of our PSP-PMI and PSP-PJMI feature encoding algorithms and M6A-BiNP models based on PSP-PMI and PSP-PJMI respectively, on the single-base resolution datasets in Table 2. We first carry out experiments on the Human51 data which is based on miCLIP technique, then on the other single-base resolution data based on m6A-REF-seq technique in Table 2.

Performance comparison with RAM-NPPS model on Human51 dataset

The reference [31] only provided the experimental results in terms of AUROC and AUPRC of RAM-NPPS model on Human51. To obtain the results of this model in terms of other evaluation metrics, we re-implement the RAM-NPPS prediction model. The experimental results of M6A-BiNP and RAM-NPPS models on Human51 are shown in Table 5. The best value of each criterion is displayed in bold fonts.
Table 5.

Comparison of M6A-BiNP models and RAM-NPPS on Human51 dataset

ModelAccSnSpMCCAUROCAUPRC
M6A-BiNP (PSP-PMI)0.7110.7330.6890.4230.7820.772
M6A-BiNP (PSP-PJMI)0.8510.8560.8450.7020.9270.927
RAM-NPPS0.7220.7330.7100.4430.7940.785
Comparison of M6A-BiNP models and RAM-NPPS on Human51 dataset As can be seen from the experimental results in Table 5 that our M6A-BiNP model based on PSP-PJMI algorithm obtained the best performance on the single-base resolution Human51 dataset no matter using any evaluation criterion. Although the performance of the M6A-BiNP model based on our PSP-PMI algorithm is the worst among three compared models in most cases on Human51 dataset, it obtains very similar performance as RAM-NPPS model.

Performance comparison with existing models on datasets of Human, Mouse and Rat

The base-resolution data based on m6A-REF-seq technique in Table 2 contain training and independent data. We trained the M6A-BiNP models on the 11 training datasets using 10-fold cross-validation experiments, and compared the performance of our M6A-BiNP models to that of predictors iRNA-m6A [43], im6A-TS-CNN [42] and DNN-m6A [83]. The results are shown in Table 6. After that the M6A-BiNP models are tested on independent datasets and compared with the models of iRNA-m6A, im6A-TS-CNN and DNN-m6A. The results are shown in Table 7. The best values of each criterion in Tables 6 and 7 are shown in bold fonts.
Table 6.

Performance comparison of our M6A-BiNP with iRNA-m6A, im6A-TS-CNN and DNN-m6A models on the training datasets

SpeciesTissuesNameMethodsAccSnSpMCCAUROC
HumanBrainHBM6A-BiNP (PSP-PMI)0.7200.7110.7290.4400.793
M6A-BiNP (PSP-PJMI)0.8200.8100.8310.6410.900
iRNA-m6A0.7130.7480.6620.4100.776
im6A-TS-CNN0.7250.7540.6970.4520.803
DNN-m6A0.7380.7850.6910.4800.817
KidneyHKM6A-BiNP (PSP-PMI)0.7460.7550.7380.4930.832
M6A-BiNP (PSP-PJMI)0.8160.8090.8230.6330.896
iRNA-m6A0.7900.8090.7630.5700.863
im6A-TS-CNN0.8000.8170.7830.6010.878
DNN-m6A0.8050.8360.7740.6100.884
LiverHLM6A-BiNP (PSP-PMI)0.7750.7690.7810.5500.856
M6A-BiNP (PSP-PJMI)0.8740.8740.8740.7480.951
iRNA-m6A0.8010.8130.7810.5900.874
im6A-TS-CNN0.8020.7970.7990.5990.881
DNN-m6A0.8130.8220.8040.6300.891
MouseBrainMBM6A-BiNP (PSP-PMI)0.7320.7440.7200.4640.818
M6A-BiNP (PSP-PJMI)0.7720.7680.7750.5440.858
iRNA-m6A0.7880.7930.7690.5800.870
im6A-TS-CNN0.7870.8150.7590.5750.871
DNN-m6A0.7940.8180.7700.5900.878
HeartMHM6A-BiNP (PSP-PMI)0.7940.8070.7800.5880.880
M6A-BiNP (PSP-PJMI)0.9370.9370.9360.8730.984
iRNA-m6A0.7280.7520.6900.4400.795
im6A-TS-CNN0.7300.7840.6760.4630.812
DNN-m6A0.7620.7750.7480.5200.844
KidneyMKM6A-BiNP (PSP-PMI)0.7750.7950.7540.5500.859
M6A-BiNP (PSP-PJMI)0.8480.8420.8530.6960.929
iRNA-m6A0.8000.8260.7730.6000.873
im6A-TS-CNN0.8050.7990.8100.6090.884
DNN-m6A0.8200.8320.8070.6400.895
LiverMLM6A-BiNP (PSP-PMI)0.7280.7550.7010.4560.813
M6A-BiNP (PSP-PJMI)0.8510.8560.8450.7020.927
iRNA-m6A0.7060.7490.6560.4100.774
im6A-TS-CNN0.7130.7240.7020.4290.795
DNN-m6A0.7360.7760.6960.4700.814
TestisMTM6A-BiNP (PSP-PMI)0.7430.7770.7090.4870.824
M6A-BiNP (PSP-PJMI)0.8500.8480.8520.7010.930
iRNA-m6A0.7440.7810.7000.4800.816
im6A-TS-CNN0.7540.7520.7560.5090.838
DNN-m6A0.7660.8100.7230.5300.849
RatBrainRBM6A-BiNP (PSP-PMI)0.7850.7840.7860.5700.869
M6A-BiNP (PSP-PJMI)0.9260.9180.9350.8530.980
iRNA-m6A0.7600.7700.7350.5000.828
im6A-TS-CNN0.7660.7900.7420.5380.847
DNN-m6A0.7830.7910.7750.5700.868
KidneyRKM6A-BiNP (PSP-PMI)0.7810.7920.7710.5630.868
M6A-BiNP (PSP-PJMI)0.8750.8670.8830.7500.946
iRNA-m6A0.8180.8250.8010.6300.888
im6A-TS-CNN0.8250.8420.8080.6500.902
DNN-m6A0.8340.8430.8250.6700.910
LiverRLM6A-BiNP (PSP-PMI)0.8260.8260.8260.6530.912
M6A-BiNP (PSP-PJMI)0.9500.9460.9540.9000.989
iRNA-m6A0.8090.8310.7630.6000.877
im6A-TS-CNN0.8060.8160.7960.6130.883
DNN-m6A0.8260.8420.8110.6500.899
Table 7.

Performance comparison of our M6A-BiNP with iRNA-m6A, im6A-TS-CNN and DNN-m6A models on the independent datasets

SpeciesTissuesNameMethodsAccSnSpMCCAUROC
HumanBrainHBM6A-BiNP (PSP-PMI)0.7080.7460.6700.4170.779
M6A-BiNP (PSP-PJMI)0.7670.5800.9540.5760.894
iRNA-m6A0.7110.6950.7300.4200.785
im6A-TS-CNN0.7270.7520.7020.4540.806
DNN-m6A0.7330.7500.7150.4700.815
KidneyHKM6A-BiNP (PSP-PMI)0.6940.8830.5060.4190.807
M6A-BiNP (PSP-PJMI)0.6820.9640.4000.4410.879
iRNA-m6A0.7780.7710.7840.5600.857
im6A-TS-CNN0.7920.8000.7850.5850.873
DNN-m6A0.7990.8320.7660.6000.878
LiverHLM6A-BiNP (PSP-PMI)0.7390.6500.8290.4870.824
M6A-BiNP (PSP-PJMI)0.8620.9200.8050.7300.948
iRNA-m6A0.7900.7820.7990.5800.868
im6A-TS-CNN0.7990.8480.7500.6010.881
DNN-m6A0.8100.8180.8010.6200.885
MouseBrainMBM6A-BiNP (PSP-PMI)0.7190.5960.8420.4510.815
M6A-BiNP (PSP-PJMI)0.7560.8380.6740.5180.849
iRNA-m6A0.7830.7720.7940.5700.861
im6A-TS-CNN0.7850.8620.7070.5770.872
DNN-m6A0.7860.7510.8210.5700.876
HeartMHM6A-BiNP (PSP-PMI)0.7740.6510.8980.5660.881
M6A-BiNP (PSP-PJMI)0.8380.6810.9960.7120.983
iRNA-m6A0.7130.7050.7210.4300.788
im6A-TS-CNN0.7360.7580.7140.4720.816
DNN-m6A0.7510.7730.7300.5000.834
KidneyMKM6A-BiNP (PSP-PMI)0.7650.7070.8220.5330.854
M6A-BiNP (PSP-PJMI)0.8320.9060.7580.6720.925
iRNA-m6A0.7930.7840.8030.5900.870
im6A-TS-CNN0.8080.8050.8100.6150.886
DNN-m6A0.8090.8120.8060.6200.889
LiverMLM6A-BiNP (PSP-PMI)0.7350.6760.7950.4740.817
M6A-BiNP (PSP-PJMI)0.8280.6990.9570.6800.937
iRNA-m6A0.6880.6780.6990.3800.762
im6A-TS-CNN0.7160.7560.6760.4330.793
DNN-m6A0.7300.7640.6950.4600.808
TestisMTM6A-BiNP (PSP-PMI)0.7460.8360.6570.5010.832
M6A-BiNP (PSP-PJMI)0.8510.8570.8450.7020.928
iRNA-m6A0.7350.7220.7510.4700.818
im6A-TS-CNN0.7620.8350.6890.5290.847
DNN-m6A0.7710.8010.7420.5400.854
RatBrainRBM6A-BiNP (PSP-PMI)0.7660.6120.9200.5590.883
M6A-BiNP (PSP-PJMI)0.8660.9880.7440.7550.982
iRNA-m6A0.7510.7390.7650.5000.827
im6A-TS-CNN0.7700.7810.7580.5390.852
DNN-m6A0.7800.7770.7830.5600.862
KidneyRKM6A-BiNP (PSP-PMI)0.7770.7860.7670.5530.861
M6A-BiNP (PSP-PJMI)0.7710.9660.5750.5880.936
iRNA-m6A0.8140.8020.8280.6300.897
im6A-TS-CNN0.8270.8490.8060.6550.908
DNN-m6A0.8300.8530.8070.6600.911
LiverRLM6A-BiNP (PSP-PMI)0.8290.8570.8010.6590.912
M6A-BiNP (PSP-PJMI)0.8870.9890.7860.7910.986
iRNA-m6A0.7990.7770.8230.6000.876
im6A-TS-CNN0.8020.8450.7590.6070.885
DNN-m6A0.8160.8280.8050.6300.896
Performance comparison of our M6A-BiNP with iRNA-m6A, im6A-TS-CNN and DNN-m6A models on the training datasets Performance comparison of our M6A-BiNP with iRNA-m6A, im6A-TS-CNN and DNN-m6A models on the independent datasets The results in Table 6 show that our M6A-BiNP model based on PSP-PJMI algorithm performs best in most cases except for on MB dataset. Its performance is better than that of SVM-based predictor iRNA-m6A and deep learning framework-based predictors im6A-TS-CNN and DNN-m6A in most cases. However, our M6A-BiNP model based on PSP-PMI algorithm is not as good as the one based on PSP-PJMI algorithm on these 11 training datasets. It only outperforms the iRNA-m6A, im6A-TS-CNN and DNN-m6A predictors in MH, RB and RL datasets. The experimental results in Table 7 show that our M6A-BiNP model based on PSP-PJMI algorithm is superior to the models of iRNA-m6A, im6A-TS-CNN, DNN-m6A and our M6A-BiNP model based on PSP-PMI algorithm on 8 among 11 datasets in terms of most criteria. Its performance on MB dataset is not good, nor on HK and RK datasets. The above analyses shown that our proposed M6A-BiNP model based on the PSP-PJMI algorithm has obtained the best performance on the datasets based on two different base-resolution techniques. Its performance is better than the m6A sites prediction models based on deep learning framework. This not only demonstrates the effectiveness of the PSP-PJMI feature encoding algorithm proposed in this paper but also proves the correctness of our using SVM classifier to build the prediction model. Although our M6A-BiNP model based on PSP-PMI algorithm does not perform as good as our M6A-BiNP model based on our proposed PSP-PJMI algorithm, it is still good enough and better than other compared prediction models in most cases, which shows that our proposed PSP-PMI is a useful feature encoding algorithm. Furthermore, PSP-PMI algorithm can be combined with other encoding algorithms to enhance its capability to extract informative features.

Conclusions

Two feature encoding algorithms named PSP-PMI and PSP-PJMI are proposed in this paper to extract features with more nucleotide position information and strong categorical information from RNA sequences. The bidirectional dinucleotide and trinucleotide position-specific propensities are proposed in PSP-PMI and PSP-PJMI based on PMI and our PJMI theories, respectively. The parameters and are introduced to represent the distance between nucleotides in a pair of nucleotides in PSP-PMI, and the distance from the nucleotide to its forward or backward consecutive dinucleotide in PSP-PJMI, respectively. The features corresponding to different and are, respectively, concatenated to comprise the high-dimensional features containing both local and global categorical information in PSP-PMI and PSP-PJMI. The SVM-based M6A-BiNP predictors are built on features encoded by PSP-PMI or PSP-PJMI. The 10-fold cross-validation experimental results on the m6A benchmark datasets including four species non-single-base resolution datasets and three species single-base resolution datasets using two different m6A sites detection techniques demonstrate that parameters and in PSP-PMI and PSP-PJMI are helpful to extract features with much more categorical information from RNA sequences. There is few redundant feature existing in final features by concatenating features corresponding to different in PSP-PMI and various in PSP-PJMI, respectively. Our PSP-PMI and PSP-PJMI are superior to the state-of-the-art feature encoding algorithms in extracting features with much better capability to identify m6A sites from RNA sequences. The PSP-PJMI is better than PSP-PMI. The M6A-BiNP predictor based on PSP-PJMI feature encoding algorithm outperforms the existing predictors for identifying m6A sites. Click here for additional data file.
  72 in total

1.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties.

Authors:  Zi Liu; Xuan Xiao; Dong-Jun Yu; Jianhua Jia; Wang-Ren Qiu; Kuo-Chen Chou
Journal:  Anal Biochem       Date:  2015-12-31       Impact factor: 3.365

2.  Phogly-PseAAC: Prediction of lysine phosphoglycerylation in proteins incorporating with position-specific propensity.

Authors:  Yan Xu; Ya-Xin Ding; Jun Ding; Ling-Yun Wu; Nai-Yang Deng
Journal:  J Theor Biol       Date:  2015-04-24       Impact factor: 2.691

3.  Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer.

Authors:  Xiaowei Zhao; Ye Zhang; Qiao Ning; Hongrui Zhang; Jinchao Ji; Minghao Yin
Journal:  J Theor Biol       Date:  2019-01-31       Impact factor: 2.691

4.  Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences.

Authors:  Zhen Chen; Pei Zhao; Fuyi Li; Yanan Wang; A Ian Smith; Geoffrey I Webb; Tatsuya Akutsu; Abdelkader Baggag; Halima Bensmail; Jiangning Song
Journal:  Brief Bioinform       Date:  2019-11-11       Impact factor: 11.622

5.  Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome.

Authors:  Bastian Linder; Anya V Grozhik; Anthony O Olarerin-George; Cem Meydan; Christopher E Mason; Samie R Jaffrey
Journal:  Nat Methods       Date:  2015-06-29       Impact factor: 28.547

6.  Gene2vec: gene subsequence embedding for prediction of mammalian N 6-methyladenosine sites from mRNA.

Authors:  Quan Zou; Pengwei Xing; Leyi Wei; Bin Liu
Journal:  RNA       Date:  2018-11-13       Impact factor: 4.942

7.  Single-base mapping of m6A by an antibody-independent method.

Authors:  Zhang Zhang; Li-Qian Chen; Yu-Li Zhao; Cai-Guang Yang; Ian A Roundtree; Zijie Zhang; Jian Ren; Wei Xie; Chuan He; Guan-Zheng Luo
Journal:  Sci Adv       Date:  2019-07-03       Impact factor: 14.136

8.  m6A-Atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine (m6A) epitranscriptome.

Authors:  Yujiao Tang; Kunqi Chen; Bowen Song; Jiongming Ma; Xiangyu Wu; Qingru Xu; Zhen Wei; Jionglong Su; Gang Liu; Rong Rong; Zhiliang Lu; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal:  Nucleic Acids Res       Date:  2021-01-08       Impact factor: 16.971

9.  Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome.

Authors:  Wei Chen; Hong Tran; Zhiyong Liang; Hao Lin; Liqing Zhang
Journal:  Sci Rep       Date:  2015-09-07       Impact factor: 4.379

10.  iRNA-m7G: Identifying N7-methylguanosine Sites by Fusing Multiple Features.

Authors:  Wei Chen; Pengmian Feng; Xiaoming Song; Hao Lv; Hao Lin
Journal:  Mol Ther Nucleic Acids       Date:  2019-08-28       Impact factor: 8.886

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.