Literature DB >> 35521556

DeepMC-iNABP: Deep learning for multiclass identification and classification of nucleic acid-binding proteins.

Feifei Cui^1,2,3, Shuang Li⁴, Zilong Zhang^1,2,3, Miaomiao Sui⁵, Chen Cao⁶, Abd El-Latif Hesham⁷, Quan Zou^2,3.

Abstract

Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play vital roles in gene expression. Accurate identification of these proteins is crucial. However, there are two existing challenges: one is the problem of ignoring DNA- and RNA-binding proteins (DRBPs), and the other is a cross-predicting problem referring to DBP predictors predicting DBPs as RBPs, and vice versa. In this study, we proposed a computational predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. DBPs, RBPs, DRBPs and non-NABPs as separate classes of data were used for training the DeepMC-iNABP model. The results on test data collected in this study and two independent test datasets showed that DeepMC-iNABP has a strong advantage in identifying the DRBPs and has the ability to alleviate the cross-prediction problem to a certain extent. The web-server of DeepMC-iNABP is freely available at http://www.deepmc-inabp.net/. The datasets used in this research can also be downloaded from the website.

Entities: Chemical

Keywords: DNA-binding protein; Deep learning; Multiclass classification; Nucleic acid-binding protein; RNA-binding protein

Year: 2022 PMID： 35521556 PMCID： PMC9065708 DOI： 10.1016/j.csbj.2022.04.029

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

Interactions between proteins and nucleic acids (general terms of DNA and RNA) define and regulate diverse cellular functions, such as splicing, translation, posttranscriptional modifications, protein synthesis, DNA transcription and replication [1], [2], [3], [4], [5], [6], [7]. They are also closely related to some diseases such as cancer [8], [9], [10], [11], [12] and involved in virus infection [13], [14]. Therefore, it is important to accurately identify the proteins that bind to nucleic acids (DNA or RNA). Nucleic acid-binding proteins (NABPs) are traditionally divided into proteins that have the ability to bind DNA or RNA, namely, DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs). However, there are some proteins that can bind to both DNA and RNA (DRBPs), and these types of protein also play an important role in gene expression [15], [16], [17]. Therefore, accurate identification of DBPs, RBPs and DRBPs is vitally important. In recent years, with the rapid development of high-throughput sequencing technologies and the reduction of sequencing costs, a large amount of bioinformatics data has been accumulated [18], [19]. Machine learning and deep learning techniques have been widely applied for the analysis of large amounts of biological data [20], [21], [22], [23], [24], [25]. Among them, protein sequences have shown exponential growth, which makes deep learning models for the prediction of nucleic acid-binding proteins from primary sequences strongly feasible [26], [27], [28]. Existing computational models treat DBP identification and RBP identification as two independent tasks, and most of them are binary-class prediction tasks. Because DBPs and RBPs have high similarities, the existing computational models generally have cross-predicting problems. Cross-prediction refers to the DBP predictor identifying true RBPs as DBPs, whereas the RBP predictor predicts true DBPs as RBPs. Moreover, they usually ignore the existence of DRBPs. That is because in these independent binary-class classification models, only DBPs or RBPs are used to build predictors of DNA binding or RNA binding, respectively. Thus, the predictors identified either “DBPs and non-DNA binding proteins (non-DBPs)”, or “RBPs and non-RNA binding proteins (non-RBPs)”. Recent studies [2], [29] on the prediction of nucleic acid-binding residues show that although the existing DNA-binding or RNA-binding residue predictors have strong predictive performance, they cannot discriminate DNA- and RNA-binding residues. Because independent predictors are built using exclusive data (either DNA-binding or RNA-binding residues), they are unable to learn to separate DNA-binding from RNA-binding residues. William et al. revealed through experiments that DRBPs are an important part of cellular proteins and play important cellular roles [17]. To solve the above problems, in this study, we proposed a new computational predictor called DeepMC-iNABP by applying a multiclass learning strategy to identify NABPs based on deep learning. Four categories, “DBP”, “RBP”, “DRBP” and “non-nucleic acid binding protein (non-NABP)”, were defined and used for learning the multiclass classification model. By applying the multiclass classification model based on a convolutional neural network and a recurrent neural network, the DeepMC-iNABP predictor predicts a protein as each of these classes. The DeepMC-iNABP predictor overcomes the problem of cross-predicting because both DBPs and RBPs are used for training the multiclass learning model. DeepMC-iNABP can not only identify nuclei acid-binding protein but also detect DBPs and RBPs, which will help with the functional annotation of proteins. Moreover, because DRBPs as separate class data are used for training, DRBPs can be identified by the DeepMC-iNABP predictor.

Material and methods

Data resources

In this study, an initial dataset containing 15,654 DBP chains, 15,009 RBP chains,1,218 DRBP chains and 146,900 non-NABP chains was first collected from the UnitProt Knowledgebase (UniProtKB) [30]. DBP chains were collected by searching reviewed data with the Gene Ontology (GO) annotation keywords “DNA binding” and “NOT RNA binding”, whereas RBP chains were collected by searching reviewed data with the GO annotation keywords “RNA binding” and “NOT DNA binding”. DRBP chains were obtained by searching reviewed data with the GO annotation keywords “DNA binding” and “RNA binding”. Non-NABP chains were collected by searching the reviewed data with the keyword “NOT nucleic acid binding”. Then, we deleted duplicate chains and filtered the chains with lengths of less than 40 or more than 1,000. The redundancy between each dataset was reduced by the BLASTP program [31] with the bit-score set as greater than 50.0. After data preprocessing, 12,471 DBP chains, 12,068 RBP chains,1,218 DRBP chains and 143,650 non-NABP chains were remained. Then, we randomly selected sequences data to form the final datasets containing 12,000 DBP chains, 12,000 RBP chains,1,200 DRBP chains and 12,000 non-NABP chains. We first split the final dataset into two (training and testing) at a ratio of 10:1. After this, we kept aside the test set and randomly chose 90% of our training data to be the actual training set and the remaining 10% to be the validation set. The statistical information of the training data, validation data and test data we collected is listed in Table 1.

Table 1

Data collected in this study for training, validation and testing.

Classes	Train data	Validation data	Test data	In total
DBP chains	9,720	1,080	1,200	12,000
RBP chains	9,720	1,080	1,200	12,000
DRBP chains	972	108	120	1,200
Non-NABP chains	9,720	1,080	1,200	12,000
In total	30,132	3,348	3,720	37,200

Abbreviations: DBP, DNA-binding protein; RBP, RNA-binding protein; DRBP, DNA- and RNA- binding protein; non-NABP, non-nucleic acid-binding protein.

Data collected in this study for training, validation and testing. Abbreviations: DBP, DNA-binding protein; RBP, RNA-binding protein; DRBP, DNA- and RNA- binding protein; non-NABP, non-nucleic acid-binding protein. In this study, we also used two independent test datasets (Table 2) to evaluate the performance of the DeepMC-iNABP predictor, including TEST474 [32] and DRBP206 [32]. Due to very little information found in the prior studies on the identification of DRBPs, TEST474 and DRBP206 are rare independent datasets containing DRBP data. A total of 175 DBP chains, 68 RBP chains, 8 DRBP chains and 233 non-NABP chains collected from the Swiss-Prot database to construct the TEST474 dataset. The DRBP206 dataset contains only 103 DRBP chains, and 103 non-NABP chains were also collected from the Swiss-Prot database.

Table 2

Independent test datasets.

Test datasets	DBPs	RBPs	DRBPs	non-NABP	In total
TEST474	175	68	8	233	474
DRBP206	103	0	0	103	206

Independent test datasets.

Multiclass learning strategy

RBP identification and DBP identification are usually considered two separate prediction tasks in most previous studies, resulting in cross-prediction problems. Although a few studies [33], [34] have focused on the cross-prediction problem for the identification of nucleic acid-binding protein, they obtained low recognition precision, possibly due to the limitations of datasets or sequence feature representations, such as small training datasets and manual selection of sequence features. Here, we treat the identification of NABPs as a multiclass learning task. Multiclass classification refers to the prediction of more than two classes; in fact, there are four protein categories containing DBPs, RBPs, DRBPs and non-NABPs. In multiclass classification, we first design the classifier model, then train the model using training data and validation data, and finally predict the collected test data or independent test data into multiple class labels. In this study, we applied the “one-vs.-all” multiclass classification technique. In one-vs.-all classification, the number of class labels present in the dataset and the number of generated binary classifiers must be the same [35], [36]. Thus, we create four classifiers here for four respective classes. For each data instance, our model will output one variable containing four different values (score) to show the probabilities of the input protein predicted as the corresponding class. The output value with the largest score will be taken as the class predicted by the model. Therefore, a protein can be identified as one of four categories of proteins (DBP, RBP, DRBP or non-NABP). In this study, we modelled a multiclass classification problem using deep neural networks, including convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). It should be noted that we applied one hot encoding to reshape the categorical variable (label variable) for the data instance to implement this deep multiclass classification model.

Network architecture of DeepMC-iNABP

The fundamental architecture of the DeepMC-iNABP model is shown in Fig. 1. The architecture outlines several parts: (1) deep representation learning for protein sequences, (2) the combination of the two neural networks (CNN and LSTM), and (3) a multiclass fully connected network. First, one-hot encoding and deep representation learning with neural networks were utilized to convert the protein sequence into a fixed-dimension matrix of vector embeddings. CNNs have the advantage of selecting good feature, and LSTMs have a good ability to learn sequential data, which is of great importance in protein structure and function prediction [37], [38], [39], [40], [41], [42], [43]. In this study, CNN and LSTM were combined for protein sequence feature learning [44], [45]. Moreover, a one-vs.-all multiclass fully connected network was employed to address the cross-prediction problem, in which each of the output nodes represents a different class. More specifically, the network of the DeepMC-iNABP model contains 15 layers, including an input layer, an embedding layer, 3 convolution layers, 3 pooling layers, 4 dropout layers, a long short -term layer, a dense layer, and an output layer.

Fig. 1

Fundamental architecture of the DeepMC-iNABP model.

Sequence representation learning

In protein-related tasks, deep learning approaches can directly learn rich data representations directly from primary sequences instead of manually extracting feature information for protein sequence numeric representation, as in traditional machine learning methods [46], [47], [48]. Deep representation learning in an end-to-end model [49], [50], [51] can be relatively easy to implement because the intermediate layers, including sequence embedding layers and other neural network layers for prediction between inputs and outputs, are trained as a whole part (i.e., treated as a “black box”). Here, we utilized the widely used end-to-end representation scheme, a one-hot encoding-based deep representation of protein sequences.

Combination of CNNs and LSTMs

In this study, we combine CNNs and LSTMs to conduct a multiclass classification model. Multiple convolutional filters slide over the embedded sequence matrix to produce a new feature map in convolutional layers. In the max-pooling layer, the maximum value is calculated as a feature corresponding feature to the specific filter. Then, the outputs of the max-pooling layer are inputted into the cell of the LSTM layer to learn the long-term dependencies of motifs (features of sequences). The output vectors of the LSTM cells are concatenated and become inputs to the dense layer. A softmax activation function is applied to generate the final outputs containing 4 output values (one for each class).

Loss function

The loss function assesses the distance between the predicted output and the actual output to evaluate the performance of the neural network during training. For multiclass classification tasks, the loss function of the neural network usually chooses categorical cross entropy, since one example can be considered to belong to a specific category with probability 1, and to other categories with probability 0. The categorical cross-entropy loss function calculates the loss of an example by computing the following sum:where is the i-th value in the model output, is the corresponding target value (true value), and N is the number of values in the model output (total number of classes).

Deep neural network parameters

As the length of input sequence was set to 1,000 (processing of padding or truncating), the number of nodes in the input layers of DeepMC-iNABP model was set as the number of the length of each sequence. As DeepMC-iNABP was designed to implement multi-class classification, the number of nodes in output layer was set just the same as the number of categories, i.e., 4. In embedding layer, the output dimensionality was set to 100 which showed the best performance after comparing the results of serval situations (64, 100, 110, 120). In DeepMC-iNABP, there are three convolutional layers, and the number of filters in each convolutional layer was set to 128, 64, 32; the number of kernels was set to 10, 5, 3, respectively. Max pooling is added to CNNs following individual convolutional layers. A dropout layer was designed after each convolutional part which helps to prevent overfitting, as the dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time. The dropout rate was set to 0.2. The output dimension of the LSTM layer was set to 70. All layers used rectified linear unit (ReLU) activation function except the output layer. The activation function of the output layer used softmax function. For the optimizer, we chose Adam, and the learning rate was set to 0.001. We chose categorical crossentropy loss function, and epoch was set to 100.

Performance evaluation

To evaluate the performance of the prediction, the assessment measurements used herein included accuracy, recall, precision and F1 score. These are defined as follows:where TP, TN, FN, and FP are the numbers of true positives, true negatives, false negatives, and false positives, respectively. Among these measures, recall indicates the accuracy of predicting positive samples, precision is the ratio of correctly predicted positive observations to total predicted positive observations, F1 score is the weighted average of precision and recall, and accuracy as the most intuitive performance measure is simply a ratio of correctly predicted observations to total observations. For all measures, the maximum value is 1, and the minimum value is 0. A value of 0 indicates the worst performance, which means that the predicted observations differ greatly from the actual observations, while a value of 1 indicates the best performance, which means that the predicted observations are very close to the actual observations. Moreover, an AU-ROC curve (area under the receiver operating characteristics curve) was used to visualize the performance of the multiclass classification problem in this study. The ROC curve is a graph showing the performance of a classification model at all classification thresholds, which can evaluate classifier output quality. ROC curves typically feature true-positive rates on the Y axis, and false-positive rates on the X axis, which means that a larger area under the curve (AUC) is usually better.

Results and discussion

Performance of the DeepMC-iNABP predictor on the test data and independent dataset

We evaluated the performance of DeepMC-iNABP on the test data we collected and two dependent datasets (TEST474 and DRBP206 datasets). The corresponding results are shown in Table 3 and Fig. 2.

Table 3

Performance evaluation of test data collected in this study.

	Precision	Recall	F1-score
DBP class	0.889	0.707	0.787
RBP class	0.804	0.819	0.812
DRBP class	0.926	0. 625	0.746
Non-NABP class	0.700	0.853	0.769
Average value	0.835	0.725	0.759

Abbreviations: DBP, DNA-binding protein; RBP, RNA-binding protein; DRBP, DNA- and RNA- binding protein; non-NABP, non-nucleic acid-binding protein.

Fig. 2

Performance of the DeepMC-iNABP model on the test dataset and independent test datasets. A and B. Confusion matrix and ROC-AUC curves of our model on the test dataset collected in this study, C and D. ROC-AUC curves of our model on independent test datasets (TEST474 and DRBP206). Class 0–3 in ROC-AUC curves refer to non-NABPs, DBPs, RBPs and DRBPs, respectively.

Performance evaluation of test data collected in this study. Abbreviations: DBP, DNA-binding protein; RBP, RNA-binding protein; DRBP, DNA- and RNA- binding protein; non-NABP, non-nucleic acid-binding protein. Performance of the DeepMC-iNABP model on the test dataset and independent test datasets. A and B. Confusion matrix and ROC-AUC curves of our model on the test dataset collected in this study, C and D. ROC-AUC curves of our model on independent test datasets (TEST474 and DRBP206). Class 0–3 in ROC-AUC curves refer to non-NABPs, DBPs, RBPs and DRBPs, respectively. The confusion matrix shown in Fig. 2A shows the results of multiclass classification on the test data collected in this study. It shows that 1024 of 1200 total sequences of non-NABPs (Class 0), 848 of 1200 total sequences of DBPs (Class 1), 983 of 1200 total sequences of RBPs (Class 2) and 75 of 120 total sequences of DRBP (Class 3) were correctly identified. Most multiclass data in the test data were correctly classified. From the confusion matrix (Fig. 2A), 64 and 112 sequences of the non-NABP class were identified as DBPs and RBPs, respectively; 241 sequences of the DBP class and 182 sequences of the RBP class were identified as non-NABPs. Furthermore, 109 sequences of the DBP class were identified as RBPs, while 31 sequences of the RBP class were identified as DBPs. However, few non-NABPs, DBPs and RBPs were identified as DRBPs. Moreover, the evaluation metrics, including precision, recall, F1 score and accuracy, were calculated, as shown in Table 3, to assess the performance of the DeepMC-iNABP predictor on the test data we collected. These results indicate that the evaluation values of DBP, RBP, DRBP and non-NABP are good, especially the classification of DRBP. Fig. 2B, 3C and 3D show the ROC curves and AUCs of DeepMC-iNABP on test data collected in this study, TEST474 and DRBP206, respectively. The confusion matrix (Fig. 2A), ROC-AUC curves (Fig. 2B) and evaluations (Table 3) indicate that the DeepMC-iNABP predictor achieves good results on the collected test data. Meanwhile, the ROC curves and AUCs in Fig. 2C and 3D also show that the DeepMC-iNABP model preforms well on the two independent test datasets.

Comparison with other methods on the independent dataset TEST474

For objective evaluation, we compared our model with other nucleic acid-binding protein classification methods on the independent test dataset TEST474. The predicted results are presented in Table 4 and Fig. 3.

Table 4

Comparison of DeepMC-iNABP and existing models on the independent dataset TEST474.

Model	DNA-binding			RNA-binding			DNA- and RNA-binding			non-NABP
Model	Recall	Precision	F1	Recall	Precision	F1	Recall	Precision	F1	Recall	Precision	F1
DeepDRBP-2La	0.817	0.877	0.846	0.456	0.620	0.525	0.125	0.043	0.065	0.888	0.832	0.859
iDRBP_MMC b	0.869	0.950	0.907	0.706	0.727	0.716	0.125	0.333	0.182	0.933	0.849	0.889
DeepMC-iNABP	0.834	0.869	0.851	0.588	0.714	0.645	0.500	0.800	0.615	0.892	0.812	0.850

The results were obtained using the webserver of DeepDRBP-2L [34].

The results were obtained using the webserver of iDRBP_MMC [32].

Fig. 3

Confusion matrix of DeepMC-iNABP and existing models on the independent dataset TEST474.

Comparison of DeepMC-iNABP and existing models on the independent dataset TEST474. The results were obtained using the webserver of DeepDRBP-2L [34]. The results were obtained using the webserver of iDRBP_MMC [32]. Confusion matrix of DeepMC-iNABP and existing models on the independent dataset TEST474. To date, there are almost no multiclass classifiers for NABPs. The iDRBP_MMC predictor [32] is actually composed of two binary classifiers (DBP predictor and RBP predictor), not a multiclass classifier. However, the combination of two classifiers can achieve the classification of non-NABP, DBP, RBP and DRBP. The DeepDRBP-2L predictor utilizes a two-level framework, in which the first level detects NABP or not and the second level further identifies the predicted NABP as DBP, RBP or DRBP [34]. In short, the DeepDRBP-2L predictor consists of a binary classifier (the first level) and a multiclass classifier (the second level). Because of the two-level strategy, it can finally achieve the classification of non-NABP, DBP, RBP and DRBP. Thus, we compared with these two existing nucleic acid-binding protein predictors. The evaluation metrics and predicted results have shown in Table 4, Fig. 3, Fig. 4A and 4C.

Fig. 4

Comparison of DeepMC-iNABP and existing models on the independent dataset. A and C. Independent dataset TEST474, B. Independent dataset DRBP206.

Comparison of DeepMC-iNABP and existing models on the independent dataset. A and C. Independent dataset TEST474, B. Independent dataset DRBP206. DeepMC-iNABP predicted 5 proteins as DRBPs, 4 of which are true DRBPs (Fig. 3C). The precision, F1 score and recall of DeepMC-iNABP are 0.80, 0.615 and 0.50, respectively. While iDRBP_MMC and DeepDRBP-2L predicted 3 and 23 proteins as DRBPs, respectively, but only one of them was a true DRBP (Fig. 3A and 4B). The precision, F1 score and recall of DRBPs in DeepMC-iNABP are far beyond those of iDRBP_MMC and DeepDRBP-2L. For the identification of DBPs and RBPs, DeepMC-iNABP predicted 168 proteins as DBPs, 146 of which were true DBPs, 6 of which were RBPs; 56 proteins were RBPs, of which 40 were true RBPs, and 6 were true DBPs. Whereas DeepDRBP-2L identified 7 proteins of true RBPs as DBPs, and 6 proteins of true DBPs as RBPs; iDRBP_MMC predicted 2 true RBPs as DBPs, and 4 true DBPs as RBPs. According to the results in this study, DeepMC-iNABP alleviates the cross-prediction problem to a certain extent. These results are likely to be related to the multiclass classification strategy adopted in this study, in which DBPs and RBPs as data of different classes were used for training the classification model. However, although a deep neural network-based model is able to automatically learn features from primary protein sequences, because similarity sequences exist between DBPs and RBPs, it may affect sequence representation learning, which may affect the identification of DBPs and RBPs. Further studies that take feature learning of similar sequences for deep neural networks into account will need to be undertaken. For the prediction of all proteins, including DBPs, RBPs, DRBPs and non-NABPs, the recall, precision and F1 score of DeepMC-iNABP were much better than those of the DeepDRBP-2L predictor, as shown in Table 4. Compared with the iDRBP_MMC predictor, our model outperforms it at the identification of DRBPs, although it is slightly closer to it at predicting the DBPs, RBPs and non-NABPs. The results show that the identification of DRBPs is a strong point of the DeepMC-iNABP predictor, rather than ignoring the existence of DRBPs like other methods.

Performance for identifying DNA- and RNA-binding proteins

For the purpose of comprehensively evaluating the ability of the DeepMC-iNABP predictor to identify DRBPs, we utilized another independent dataset called DRBP206 which contains only DRBPs and non-NABPs. Fig. 4B presents the performance of DeepMC-iNABP and the existing predictors on the DRBP206 dataset. According to the comparison, the accuracy, recall and F1 score of DeepMC-iNABP on the DRBP206 test dataset were quite higher than those of DeepDRBP-2L and iDRBP_MMC. The precision values of DeepMC-iNABP were slightly better or almost equal to those of iDRBP_MMC but were even better than those of DeepDRBP-2L. Taken together, these results tested on the DRBP206 dataset suggest that DeepMC-iNABP does have a strong advantage in identifying DRBPs. Moreover, feature visualization of represented sequence data also indicates that DeepMC-iNABP model classified the four types of protein sequences well, especially the DRBPs as shown in Fig. 5.

Fig. 5

Feature visualization of DeepMC-iNABP by t-SNE for dimension reduction. A. Feature representation of test dataset collected in this study, B. Feature representation of independent dataset TEST474, C. Feature representation of independent dataset DRBP206. In the current study, comparing our model with some NABP predictors showed that DeepMC-iNABP can successfully identify DRBPs, while few predictors, such as DeepDRBP-2L and iDRBP_MMC introduced above, may have the ability to recognize DRBPs but perform very poorly. A possible explanation for this might be that DeepDRBP-2L or iDRBP_MMC applied binary classification or multilabel classification, whereas the DeepMC-iNABP predictor employed a multiclass classification scheme, and DRBP data as a separate class were used to train the DeepMC-iNABP model.

Conclusions

There are two challenges in the prediction of NABP: one is the problem of ignoring DRBPs, and the other is the cross-predicting problem. Very little was found in the prior studies. In this study, we focused on these two problems, and proposed a NABP predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. The classic deep learning model, the architecture of which contains one-hot encoding-based sequence representation and the combination of deep neural networks (CNN and LSTM), was constructed for identifying the NABPs. DBPs, RBPs, DRBPs and non-NABPs were used as separate classes of data for training the DeepMC-iNABP model. The results on several test datasets showed that DeepMC-iNABP has a strong advantage in identifying DRBPs and alleviates the cross-prediction problem to a certain extent. Moreover, the web server of DeepMC-iNABP () was provided, which will be useful for researchers in the field of protein-nucleic acid interactions.

Code availability

The web server of DeepMC-iNABP, data resource and codes are freely available from .

Funding

The work was supported by the National Natural Science Foundation of China (No. 61922020, No. 62102064 and No.62101100) and the Special Science Foundation of Quzhou (2021D004).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

42 in total

1. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors: Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal: Nat Biotechnol Date: 2015-07-27 Impact factor: 54.908

2. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Authors: Emmanuel Boutet; Damien Lieberherr; Michael Tognolli; Michel Schneider; Parit Bansal; Alan J Bridge; Sylvain Poux; Lydie Bougueleret; Ioannis Xenarios
Journal: Methods Mol Biol Date: 2016

3. Prediction of bio-sequence modifications and the associations with diseases.

Authors: Chunyan Ao; Liang Yu; Quan Zou
Journal: Brief Funct Genomics Date: 2020-12-14 Impact factor: 4.241

4. Goals and approaches for each processing step for single-cell RNA sequencing data.

Authors: Zilong Zhang; Feifei Cui; Chunyu Wang; Lingling Zhao; Quan Zou
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

5. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning.

Authors: Feifei Cui; Zilong Zhang; Quan Zou
Journal: Brief Funct Genomics Date: 2021-03-02 Impact factor: 4.241

6. A comprehensive review of the imbalance classification of protein post-translational modifications.

Authors: Lijun Dou; Fenglong Yang; Lei Xu; Quan Zou
Journal: Brief Bioinform Date: 2021-04-08 Impact factor: 11.622

7. Prediction of antioxidant proteins using hybrid feature representation method and random forest.

Authors: Chunyan Ao; Wenyang Zhou; Lin Gao; Benzhi Dong; Liang Yu
Journal: Genomics Date: 2020-08-17 Impact factor: 5.736

8. Deepred-Mt: Deep representation learning for predicting C-to-U RNA editing in plant mitochondria.

Authors: Alejandro A Edera; Ian Small; Diego H Milone; M Virginia Sanchez-Puerta
Journal: Comput Biol Med Date: 2021-07-27 Impact factor: 4.589

9. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory.

Authors: Jun Zhang; Qingcai Chen; Bin Liu
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2021-08-06 Impact factor: 3.710

Review 10. Protein-DNA/RNA interactions: Machine intelligence tools and approaches in the era of artificial intelligence and big data.

Authors: Feifei Cui; Zilong Zhang; Chen Cao; Quan Zou; Dong Chen; Xi Su
Journal: Proteomics Date: 2022-02-13 Impact factor: 3.984