Literature DB >> 33005921

A survey on deep learning in DNA/RNA motif mining.

Ying He¹, Zhen Shen¹, Qinhu Zhang¹, Siguo Wang¹, De-Shuang Huang².

Abstract

DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN-RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.

Entities: Chemical Disease Gene Species

Keywords: convolutional neural network; deep learning; motif mining; protein binding site; recurrent neural networks

Year: 2021 PMID： 33005921 PMCID： PMC8293829 DOI： 10.1093/bib/bbaa229

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Motif plays a key role in the gene-expression regulating both transcriptional and posttranscriptional levels. DNA/RNA motifs involve many biological processes, including alternative splicing, transcription and translation [1-4]. From the late 1990s to the early 21st century, researchers through biological experiments gradually identified a large number of proteins with binding functions and their corresponding binding sites on the genome sequences, the binding sites of the same protein are certain conservative short sequences regarded as motifs, people initially used conservative sequences to describe protein binding sites [5-8]. With the deepening of researchers’ understanding of motif research, various motif mining algorithms emerge [9]. Early motif mining methods are mainly divided into two principal types: enumeration methods and probabilistic methods: enumeration approach and probabilistic method [10]. The first class is based on simple word enumeration. Yeast Motif Finder (YMF) algorithm used consensus representation to detect short motifs with a small number of degenerate positions in the yeast genome developed by Sinha et. al [11]. YMF is mainly divided into two steps: the first step enumerates all motifs of search spaces and the second step calculates the z-score of all motifs to find the greatest one. Bailey proposed discriminative regular expression motif elicitation algorithm that calculated the significance of motifs using Fisher’s Exact test [12]. To accelerate the running speed of word enumeration-based motif mining methods, some special methods were used, like suffix trees, parallel processing [13]. Besides, motif mining algorithms, such as LMMO [14], DirectFS [9], ABC [15], DiscMLA [16], CisFinder [12], Weeder [17], Fmotif [18] and MCES [19] all used this idea in the model. In probabilistic-based motif mining methods, a probabilistic model that needs a few parameters will be constructed [20]. These methods provided a base distribution of bases for each site in the binding region to distinguish the motif is exist or not [21]. These methods usually built distribution by the position-specific scoring matrix (PSSM/PWM) or motif matrix [22]. PWM was an m by n size matrix (m represents the length of a specific protein binding site, and n represents the type of nucleotide base), which was used to indicate the degree of preference of a specific protein binding motif at each position [23]. Just as Figure 1 shows, PWM can intuitively express the binding preference of a specific protein with fewer parameters, so if a set of specific protein binding site data is given, the parameters of PWM can be learned from these binding site data. Some methods are based on PWM approaches such as MEME [11], STEME [24], EXTREME [25], AlignACE [26] and BioProspector [27].

Figure 1

The process of generating PSSM, position frequency matrix (PFM) and logo of SPI1 [104]. The process of as follows generating PSSM, PFM and logo of SPI1. First, generate a PFM based on the number of times each type of nucleotide appears in each position of the alignment. Then, convert the PFM into a logarithmic scale PSSM/PWM. By adding the corresponding nucleotide values of PSSM, the score of any DNA sequence window with the same length as the matrix can be calculated and drawn into a logo map. ChIP-seq and high-throughput sequencing have tremendously increased the amount of data available in vivo [28], which makes it possible to study the motif mining by deep learning [29]. In bioinformatics, although deep learning methods are not many at present, it is now on the rise [30]. Known applications include DNA methylation [31, 32], protein classification [33-35], splicing regulation and gene expression [36-38] and biological image analysis tasks [39-42]. Of particular relevance to our work is the development of applications for motif mining, such as DNA-/RNA-protein binding sites [43], chromatin accessibility [36, 44–46], enhancer [47-49], DNA-shape [50, 51]. DeepBind [43] is the first study to apply deep learning in motif mining. Just as Figure 2 shows, DeepBind attempted to describe the method by CNN and predicts DNA-protein/RNA-protein binding sites in a way that machine learning or genomics researchers can easily understand. It treated a genome sequence window as a picture. Unlike an image composed of pixels with three color channels (R, G, B), it treated the genomic sequence as a fixed-length sequence window composed of four channels (A, C, G, T) or (A, C, G, U). Therefore, the problem of DNA protein binding site prediction is similar to the problem of binary classification of pictures.

Figure 2

The parallel training process of Deepbind [43]. (A) The DeepBind model processes five independent sequences in parallel. The data first passes through the convolutional layer to extract features, then passes through the pooling layer to optimize the features. Finally, features go through the activation function to output the prediction result and compare with the target to calculate the loss and update weight to improve the prediction accuracy. (B) It is shown in detail that the dataset is divided into validate set, train set and test set, which are used to calculate validate AUC (area under the curve), training AUC and test AUC, respectively, to select the optimal parameters. After this, a series of research on deep learning in motifs mining appeared. Some researchers focused on the impact of various parameters in deep learning, such as the number of layers, on motif mining [52]. Some researchers have made more attempts for deep learning frameworks, adding a long short-term memory (LSTM) layer to DeepBind, and obtained a new model combining CNN and RNN for motif mining [53]. Besides, there are methods such as iDeepS that combine CNN and RNN to target specific RNA binding proteins (RBP) [54]. The advantage of the combined model of RNN and CNN is that the newly added RNN layer can capture the long-term dependency between sequence features by learning the features extracted by the CNN layer to improve the accuracy of prediction. Other researchers used a pure RNN-based method: the KEGRU method [55] created an internal state of the network by using a k-mer representation and embedding layer, and it captures long-term dependencies by combining with a layer of bidirectional gated recurrent units (bi-GRUs). Besides, many researchers have done a lot of works based on three basic models, for example, Xiaoyong Pan [56], Qinhu Zhang [51, 57], Wenxuan Xu [58], Dailun Wang [59] and Wenbo Yu [60]. Although, there are currently many deep learning methods in motif mining. Those methods compared to the deep learning methods in the field of computer vision and NLP, such as image field [61, 62], video field [63] and question answering field [64], are also relatively primitive and simple. Therefore, it is necessary to summarize the motif mining through deep learning to help researchers to better understand the field. In this paper, we introduce the basic biological background knowledge about motif mining and provide insights into the differences between the basic models of deep learning CNN and RNN, and discuss some new trends in the development of deep learning. This article hopes to help researchers who do not have basic deep learning or basic biology Background knowledge to quickly understand topic mining. The remainder of this paper is organized as follows: The second section describes the basic biological background knowledge, several common databases and the basic knowledge of motif. Then, the third section describes different models of deep learning algorithms for DNA/RNA motif mining. Finally, we further discuss some new developments and challenges in motif mining deep learning and possible future directions in the fourth section.

Basic Knowledge of Motif

In this section, we introduce the some basic knowledge of motif mining. Motif mining (or motif discovery) in biological sequences can be defined as the problem of finding a set of short, similar, conserved sequence elements (‘motifs’) that are often short and similar in nucleotide sequence with common biological functions [65]. Motif mining has been one of the widely studied problems in bioinformatics, such as transcription factor binding site (TFBS) because its biological significance and bioinformatics significance is highly significant [66, 67]. As shown in Figure 3, it shows how multiple sequences recognize the same transcription factor (CREB). Their ‘consensus’ means that each position has its own more friendly nucleic acid by the transcription factor. Since transcription factor binding can tolerate approximate values, all oligos that differ from the consensus sequence to the maximum number of nucleotide substitutions can be considered as valid instances of the same TFBS.

Figure 3

A set of binding sites recognized by the same TF (CREB) [65]. It shows how multiple sequences recognize the same transcription factor (CREB). First, Zambelli built their ‘consensus’ (bottom left) by counting the frequency of each nucleic acid in the sequence [65]. And ‘consensus’ (bottom left) with the highest frequency of nucleotides at each position to indicate the motifs they form a ‘degenerate’ consensus, which includes nucleotides that have no obvious preference position (K = G or T; M = A or C; N = any nucleotide; according to IUPAC codes [105]). Besides, motifs can be converted into an alignment matrix of the nucleotide frequency (top right) by dividing each column by the number of sites used, as well as a ‘sequence logo’ (bottom left) [106] showing nucleotide conservation and corresponding information.

Figure 4

Sequence representation of motif mining [78]. It shows two data preprocessing methods(bottom left) and three architectures include CNN-only (left), RNN-only (center) and hybrid CNN–RNN models (right). The simple method is to use the one-hot encoding. One-hot is often used for indicating the state of a state machine [71]. For example, using one-hot codes to encode DNA sequences as binary vectors: A = (1,0,0,0), G = (0,1,0,0), C = (0,0,1,0) and T = (0,0,0,1). RNA sequences can also be encoded similarly by simply changing T to U. It is easy to design and modify, and easy to detect illegal states. However, it is easily sparse and context-free. Another method is to label with k-mers and vectorize by embedding [44]. For example, we can tokenize the DNA sequence ‘ATCGCGTACGATCCG’ as different k-mers, as shown in Table 1. Different k-mers can be vectorized using the embedding method widely used in the NLP field [72], such as word2vec [73]. RNA sequences can be represented similarly.

Table 1

Different parameters for k-mers

Length	Window	Tokenized	Vectorization
3	3	ATC GCG TAC GAT CCG	0321 3412 4532 4214
4	4	ATCG CGTA CGAT	0123 3412 4532
5	5	ATCGC GTACG ATCCG	4124 5124 2134
4	2	ATCG CGCG CGTA TACG CGAT ATCC	2563 3124 4236 3578 2145
4	3	ATCG GCGT TACG GATC	4252 5134 2136 3451 2411

It shows DNA sequence ‘ATCGCGTACGATCCG’ is cut into multiple different k-mers and his vector when the length is (3,4,5,4,4) and the window is (3,4,5,2,3).

Different parameters for k-mers It shows DNA sequence ‘ATCGCGTACGATCCG’ is cut into multiple different k-mers and his vector when the length is (3,4,5,4,4) and the window is (3,4,5,2,3).

Deep Learning in Motif Mining

In recent years, deep learning has achieved great success in various application scenarios, which makes researchers try to apply it to DNA or RNA motif mining. Next, we introduce these models in detail. There are three main types of deep learning frameworks in motif mining: CNN-based models (Figure 4, left), RNN-based models (Figure 4, center), hybrid CNN–RNN-based models (Figure 4, right). We summarize several classic deep learning methods in motif mining, as shown in Table 2.

Table 2

Deep learning algorithm in DNA motif mining

Model	DeepBind	DeepSNR	DeepSEA	Dilated	DanQ	BiRen	KEGRU	iDeeps
Architecture	CNN	CNN	CNN	CNN	CNN + RNN	CNN + RNN	RNN	CNN + RNN
Embedding	NO	NO	NO	NO	NO	NO	YES	NO
Input	One-hot	One-hot	One-hot	One-hot	One-hot	k-mer	k-mer	One-hot

It shows the architecture, embedding and input of eight classic deep learning models in motif mining.

Deep learning algorithm in DNA motif mining It shows the architecture, embedding and input of eight classic deep learning models in motif mining. DeepBind [43] is the first attempt to use CNN to predict DNA or RNA motifs from original DNA or RNA sequences. DeepBind used a single CNN layer, which consists of one convolutional layer, followed by rectification and pooling operation, and one fully connected network (FCN) augmented at the end to transform feature vectors into a scalar binding score. It also opened up a precedent for deep learning in motif mining and provides a basic framework for other deep learning methods. It corresponded to each base to four channels similar to the RGB channel in color and used one-hot encoding to complete vectorization. Many subsequent methods use this to build their models. DeepSEA [38] was a deep learning method based on CNN, which used three convolution layers with 320, 480 and 960 kernels, respectively. Higher-level convolutional layers receive input from a larger spatial range, and lower-level convolutional network layers can represent more complex features. DeepSEA added an FCN layer on top of the third convolutional layer, in which all neurons receive input from all outputs of the previous layer so that the information of the entire sequence data can be completely obtained. The convolution step of the DeepSEA model consisted of three convolutional layers and two maximum merge layers, and the motif was learned in alternating order. DeepSNR [74] was a deep learning method based on CNN. The convolution part of the DeepSNR model had the same structure as the DeepBind network. But DeepSNR added that the deconvolution network is a mirrored version of the convolution network, which can reduce the size of the activation and enlarges the activations through combinations of unpooling and deconvolution operations. Dilated [75] was a deep learning method based on dilated multilayer CNN. This method learns the mapping from the DNA region of the nucleotide sequence to the position of the regulatory marker in this region. The dilated convolution can capture a hierarchical representation of the input space that is larger than the standard convolution so that they can be scaled to larger before and after sequences. DanQ [53] used a single layer CNN followed by a bidirectional LSTM (BLSTM). The first layer of the DanQ model aimed to scan the position of the motif in the sequence through convolution filtering. The convolution step of the DanQ model was much simpler than DeepSEA. It contained a convolutional layer and a maximum merge layer to learn the motif. After the largest pooling layer was the BLSTM layer. Motifs can follow the adjustment grammar determined by physical constraints, which determine the spatial arrangement and frequency of the pattern combination in vivo, which is a feature related to tissue-specific functional elements (such as enhancers). So the LSTM layer is after the maximum pooling layer. The last two layers of the DanQ model were dense layers of rectified linear units and multitask sigmod output, similar to the DeepSEA model. The advantage of the combined model of RNN and CNN was that the newly added RNN layer can capture the long-term dependency between sequence features by learning the features extracted by the CNN layer to improve the accuracy of prediction. BiRen [49] developed a hybrid architecture based on deep learning, which combines the sequence encoding and representation capabilities of CNN and bidirectional recurrent neural network of processing long sequences of DNA excellent ability. BiRen had undergone limited experimental verification of enhancer element training, which comes from the VISTA enhancer browser [76], and has enhanced gene activity, as evaluated in transgenic mice. BiRen could learn regulatory codes directly from genomic sequences, and demonstrate excellent recognition accuracy, overcoming the robustness of noisy data, and two new methods for other species based on sequence features for other species General k-mer for enhancer prediction. BiRen enabled researchers to have a deeper understanding of the regulatory codes of enhancer sequences. KEGRU [55], which usesd a layer of GRU and k-mer embedding, was a pure RNN layer model without CNN layer. KEGRU mainly used the k-mer and embedding layer to achieve the purpose of CNN feature extraction tasks in other models. Such a structure made it perform better in sequence relationships and achieves a good structure in RNA motif mining. iDeeps [54] which used convolutional neural networks (CNNs) and a BLSTM network to simultaneously identify the binding sequence and structure motifs from RNA sequences. The CNN module embedded in iDeep can also automatically capture the interpretable binding motif of RBP. The BLSTM network made the iDeep framework to not only achieve better performance on binding sequence but also easily capture structure motifs. Model selection may be the most challenging step in deep learning because the performance of deep learning algorithms is very sensitive to different parameters [77]. The deepRAM [78] provides implementations of several existing architectures and their variants: DeepBind (single layer CNN), DeepBind* (multilayer CNN), DeepBind-E* (multilayer CNN, k-mer embedding), DanQ (single layer CNN, bidirectional LSTM), DanQ* (multilayers CNN, bidirectional LSTM), Dilated (multilayer dilated CNN), KEGRU (k-mer embedding, single layer GRU), ECLSTM (k-mer embedding, single-layer CNN and LSTM) and ECBLSTM (k-mer embedding, single-layer CNN and bidirectional LSTM). They conducted a lot of experimental comparisons, which gave researchers a deeper understanding of these methods. Before introducing the experimental results of deepRAM [78], we introduce two sets of datasets used in the experiment. The first group is the DNA datasets include 83 ChIP-seq data from the ENCODE project [70]. The second group is the RNA datasets include 31 CLIP-seq data for 19 proteins [79-81]. The deepRAM [78] has conducted a large number of experiments on these two datasets of experimental data and conducted an in-depth comparison and description of the above deep learning models. The experimental results of the model on these datasets are shown in Figure 5.

Figure 5

Comparison results of nine deep learning models [78]. It compares the performance of these models in predicting DNA and RNA motif mining tasks. (A) The AUC distribution of nine models in 83 ChIP-seq datasets. (B) P-value annotated heat maps using paired models of nine models in 83 ChIP-seq datasets. (C) The AUC distribution of nine models in 31 CLIP-seq datasets. (D) P-value annotated heat maps using paired models of nine models in 31 datasets. Among all models, the ECBLSTM model performed best, whether it was a median AUC of 0.930 on ChIP-seq data or a median AUC of 0.951 on CLIP-seq data, and the simplest DeepBind of all models is here. The median AUC on the two datasets was 0.902 and 0.914, respectively. DeepBind is the simplest model considered here: it uses a single hot sequence encoding and a single convolutional layer. By comparing the performance of ECBLSTM with the model of DeepBindE*, it can be seen that adding an LSTM layer can further improve performance. Because LSTM layers are better at capturing long-term dependencies than CNN layers. Compared with the original DeepBind, both DeepBind* or DeepBind-E* can provide improved performance. By comparing the performance of DanQ and DanQ*, it is further found that the performance of models deeper than single-layer CNN tends to perform better. Experiment results demonstrate the performance advantages of deeper and more complex networks. Zhang [17] found that the simpler model performs best in this task, and the conclusions found through deepRAM’s experiment are just the opposite. Based on the experimental results and theoretical analysis, it is found that the complexity of the model should be related to the task and data. Too many parameters can easily cause over-fitting [82]. Generally, the parameters of our task model should not exceed the data sample too much.

Discussion

From the traditional method of motif to the latest development process of deep learning, we can find great progress with the development of sequencing technology and new algorithms. We analyzed the existing models, and their variants found that the more complex models tend to perform better when data are sufficient in the third section. The recent research trends can be found that the model is usually more and more complex. For example, researchers try to combine existing models with new models, such as combining attention units [83, 84], capsule network [85], multiscale convolutional gated recurrent unit networks [86], weakly supervised CNN [87] and multiple-instance learning [88]. However, the existing deep learning models in motif mining are too simple, no more than three layers, compared to the model in the image field usually over 10 layers. Therefore, there is still much room for improvement. Recently, since the adversarial training of neural networks can lead to regularization to provide higher performance, this field has developed rapidly, including involving adversarial generative networks [89] and a series of related research such as Wasserstein GAN [90], MolGAN [91] and NetGAN [92]. In motif mining, GAN may be used to automatically generate negative examples instead of simple random generation or shuffling the positive sequence. Besides, pretraining models [93] that have achieved significant results in the NLP field, from word2vec [73, 94] to now Bert [95] and GPT [96]. In motif mining, pretraining can be used to enhance the robustness and generalization ability of the model. The great success of AlphaGo [97] has set off an unprecedented change in the Go world, and it has made deep reinforcement learning familiar to the public. In particular, AlphaGo Zero does not require any history of human chess, and only uses deep reinforcement learning [98]. The achievement of training from 0 to 3 days has far exceeded the knowledge of Go that humans have accumulated for thousands of years. In motif mining, reinforcement learning may enable people to learn more motifs beyond human knowledge. As we enter the era of big data, whether it is in academic or industrial, deep learning is already a very important development direction. In bioinformatics, which has made great progress in traditional machine learning, deep learning is expected to produce encouraging results [99]. In this review, we conducted a comprehensive review of the application of deep learning in the field of motif mining. We desire that this review will provide help researchers understand this field and promote the application of motif mining in research. Of course, we also need to recognize the limitations of deep learning methods and the promising direction of future research. Although deep learning is promising, it is not a panacea. In many applications of motif mining, there are still many potential challenges, including unbalanced or limited data, interpretation of deep learning results [71] and the choice of appropriate architecture and hyperparameters. For unbalanced or limited data, the common methods are enhanced datasets [48] or few-shot learning [100]. For interpretation of deep learning results, common methods are the interpretability of the model itself [101] or the interpretation after the prediction [71]. For the choice of appropriate architecture and hyperparameters, frameworks such as Spearmint [102], Hyperopt [103] and DeepRAM [78] allow to automatically explore the hyper-parameter space. Besides, how to make full use of the ability of deep learning to accelerate the training process of deep learning also needs further research. Therefore, we hope that the issues discussed in this article will be helpful to the success of future deep learning methods in motif mining. Motif mining (or motif discovery) in biological sequences can be defined as the problem of finding a set of short, similar, conserved sequence elements (‘motifs’) that are often short and similar in nucleotide sequence with common biological functions. Motif plays a key role in the gene-expression regulating both transcriptional and posttranscriptional levels. In recent years, deep learning has achieved great success in various application scenarios, which makes researchers try to apply it to DNA or RNA motif mining. There are three main types of deep learning frameworks in motif mining: CNN-based models, RNN-based models and hybrid CNN–RNN-based models. Briefly, we also introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models.

73 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. A discrete artificial bee colony algorithm for detecting transcription factor binding sites in DNA sequences.

Authors: D Karaboga; S Aslan
Journal: Genet Mol Res Date: 2016-04-27

3. Identifying protein-binding sites from unaligned DNA fragments.

Authors: G D Stormo; G W Hartzell
Journal: Proc Natl Acad Sci U S A Date: 1989-02 Impact factor: 11.205

Review 4. Revealing protein-lncRNA interaction.

Authors: Fabrizio Ferrè; Alessio Colantoni; Manuela Helmer-Citterich
Journal: Brief Bioinform Date: 2015-06-02 Impact factor: 11.622

5. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins.

Authors: Martin Stražar; Marinka Žitnik; Blaž Zupan; Jernej Ule; Tomaž Curk
Journal: Bioinformatics Date: 2016-01-18 Impact factor: 6.937

6. An Entropy-Based Position Projection Algorithm for Motif Discovery.

Authors: Yipu Zhang; Ping Wang; Maode Yan
Journal: Biomed Res Int Date: 2016-11-02 Impact factor: 3.411

7. Recurrent Neural Network for Predicting Transcription Factor Binding Sites.

Authors: Zhen Shen; Wenzheng Bao; De-Shuang Huang
Journal: Sci Rep Date: 2018-10-15 Impact factor: 4.379

8. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network.

Authors: Qinhu Zhang; Zhen Shen; De-Shuang Huang
Journal: Sci Rep Date: 2019-06-11 Impact factor: 4.379

9. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts.

Authors: Surag Nair; Daniel S Kim; Jacob Perricone; Anshul Kundaje
Journal: Bioinformatics Date: 2019-07-15 Impact factor: 6.937

10. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

Authors: Ehsaneddin Asgari; Mohammad R K Mofrad
Journal: PLoS One Date: 2015-11-10 Impact factor: 3.240

4 in total

1. DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes.

Authors: Siguo Wang; Qinhu Zhang; Ying He; Zhen Cui; Zhenghao Guo; Kyungsook Han; De-Shuang Huang
Journal: PLoS Comput Biol Date: 2022-10-07 Impact factor: 4.779