Literature DB >> 34141139

Representation learning applications in biological sequence analysis.

Hitoshi Iuchi^1,2, Taro Matsutani^2,3, Keisuke Yamada⁴, Natsuki Iwano³, Shunsuke Sumi^3,5, Shion Hosoda^2,3, Shitao Zhao¹, Tsukasa Fukunaga^6,7, Michiaki Hamada^2,3,4,8.

Abstract

Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.

Entities: Chemical Disease Gene Species

Keywords: BERT; Natural language processing; Representation learning; Sequence analysis; Word2vec

Year: 2021 PMID： 34141139 PMCID： PMC8190442 DOI： 10.1016/j.csbj.2021.05.039

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Considerable advances in high-throughput sequencing have resulted in rapid data accumulation [1]. Although these modern technologies produce a considerable amount of data, they do not provide interpretation or biological information. Thus, the analysis of biological sequences, such as DNA/RNA/protein sequences, to realize biological discoveries has become more critical and challenging. To tackle this issue, the application of natural language processing (NLP) to sequence analysis has attracted considerable attention in terms of treating biological sequences as sentences and k-mers in these sequences as words [2], [3]. NLP aims to allow computers to understand the content of natural language, including the context, to accurately extract information, and to provide valuable insights [4]. Natural language is composed of characters, such as the alphabet, and the meaning is deduced and constructed using grammar and semantics. In the same manner, biological sequences can be regarded as sentences with different letters, and biophysical and biochemical rules can be used to define properties, such as the function and structure [5]. Biological sequences are consistent with natural language where characters are used to define their meaning, and the meaning depends on the neighboring sequence. For example, whether the word “bank” in a sentence refers to a financial institution or raised portion of seabed depends on the context. Similarly, whether a part of an RNA sequence forms a secondary structure depends on its neighboring sequences. Thus, considering the similarities between natural language and biological sequences, the application of NLP has the ability to provide a comprehensive understanding of the function and structure encoded in the biological sequence. Representation learning is an essential step in NLP and indicates automatic systems to explore the representation of raw data, such as words or characters [6]. In general, the representation is provided as a real-valued vector known as distributed representation. Successful representation learning may convert words into vectors while preserving their semantic similarity. For example, the names of foods, like “sushi” and “pizza,” should be converted into similar vectors and the names of organisms, such as “frog,” should be assigned entirely different vectors (Fig. 1). In biological sequences, N-methyl-d-aspartate receptor and -amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid receptor, which are both ionotropic glutamate receptors, may be converted into similar vectors, whereas green fluorescent protein may be converted into a completely different vector. Thus, representation learning indicates the transformation from words to vectors while preserving the similarities and differences between words.

Fig. 1

Ideal representation learning should perform the conversion of the names of foods, such as “sushi” and “pizza,” into similar vectors and assign different vectors to the names of organisms, such as “cow” and “frog.”. Biological sequences vectorized by representation learning can be directly used for biological tasks, such as function and structure prediction [7], [8]. If the vector similarity between proteins is high, it can be inferred that they possess similar functions and structures. Note that vector similarity/distance can be calculated using linear algebra operations, such as dot product, Euclidian distance, and cosine similarity. Particularly, the successful encoding of words via representation learning has been recognized as an essential research area because the performance of NLP and deep learning depends on the quality of the representation [6]. Thus, a good representation of a biological sequence is critical for clustering, function, structure, and disorder prediction [2]. Considering the significance and growing trend in the application of representation learning in biology (Fig. 2), in the present study, we have described a review of representation learning for biological sequence analysis. It should be noted that this review covers concepts on the application of representation learning to biological sequence analysis, while its use in biological literature and medical records is beyond the scope of this review. This review is organized as follows: Section 2 introduces the basic representation techniques for NLP. Section 3 provides a comprehensive survey of representation learning approaches for sequence analysis. Section 4 presents a summary and an outlook of representation learning applications in biological sequence analysis.

Fig. 2

Change in the number of hits for the search term “representation learning” (with double quotation) in PubMed ( https://pubmed.ncbi.nlm.nih.gov/).

Representation learning techniques

Currently, the acquisition of distributed representations of biological sequences is mainly achieved using neural networks developed in NLP. In representation learning for NLP, it is assumed that the words that appear in the same context have similar meanings according to the distribution hypothesis [9]. Representation learning methods based on the distribution hypothesis are used with an aim to vectorize words or phrases by training the neural networks with architectures specialized for understanding the relationships among words from a corpus (a set of documents. Various representation learning methods presented in this review are based on neural-network-based language models specialized for biological sequences; thus, it is essential to understand the underlying architecture of the neural networks developed for NLP. In this section, we have briefly summarized the development of basic representation learning techniques. word2vec was the first successful method used to obtain distributed representations using a neural network [10], [11]. There are two types of neural networks used in word2vec and they are as follows: a skip-gram model, that predicts the words around the input word, and a continuous bag-of-words model, that predicts the target word from the surrounding words. Until the advent of word2vec, researchers used neural networks to describe the syntactic structure [12], [13]. The skip-gram model proposed by Mikolov attracted attention owing to its ability to capture not only grammatical correctness but also semantic features, as described in the introduction. word2vec with the skip-gram model acquires a distributed representation for each word by training the three-layer neural network, as shown in Fig. 3. Considering a sentence with T words and the t-th word , the model predicts the words present in the vicinity of in that sentence. Pre-defined vicinity is a hyper-parameter that is denoted as a constant, c. It shows the number of words that should be included in the prediction around .The parameters to be estimated in the skip-gram model include the weight matrix X to predict the d-dimensional hidden layer from the one-hot encoded input layer and weight matrix Y to predict the output from h. They are predicted using the formula described below:

Fig. 3

Skip-gram model used in word2vec. This neural network model includes the following three fully connected layers: the input, hidden, and output layers. In this case, it attempts to learn the features from the sentence, “I am majoring in biology,” and to predict the words surrounding , “majoring. ”. The model performs the same operation for all sentences and repeats multiple epochs to complete training. In this case, the weight matrix X is a matrix, where V represents the number of words in the vocabulary. If is the v-th word in the vocabulary, we can obtain the distributed representation of the word as the v-th vector of the predicted X (i.e., ). The word2vec representation has additive compositionality and has garnered fame for allowing intuitive operations, such as , as shown previously [11]. Hence, the use of word2vec succeeded in obtaining highly interpretable distributed representations for the first time and helped to direct subsequent development in representation learning. The fact that word2vec captures semantic features is a remarkable breakthrough in representation learning, which has prompted the proposal of various extended models based on word2vec. GloVe uses word co-occurrence matrices, which have been used in classical latent semantic analysis, such as singular value decomposition [14]. It shows higher semantic accuracy than word2vec. FastText is an embedding method based on the skip-gram model [15]. It considers sub-word information that allows for the prediction of words that do not appear in the training data. Additionally, several methods have been developed to obtain a distributed representation for each sentence (not word) based on the word2vec concept. doc2vec utilizes the paragraph vectors, which captures the context for each paragraph and provides the features for each sentence [16]. Although word2vec has enabled considerable progress in representation learning, it cannot be used to express the semantic polysemy of words as it yields a single d-dimensional vector for a single lexicon, as mentioned above. For example, “right” that appears in “right to vote” and “turn right” differ in meaning; however, they are embedded at the same point using word2vec. The approach to solving this problem is known as word sense disambiguation in NLP [17], and it prescribes architecture for considering the context and meaning of a sentence. In biological sequences, the context of a word in a sentence is equivalent to the role of a particular nucleic/amino acid in the whole sequence. Hence, the polysemy in biological sequences is critical, similar to that observed in natural languages. Here, we have introduced the following two methods that can allow the consideration of such contexts: one method that can be performed to achieve this by rendering the neural network recursive using a recurrent neural network (RNN) or long short-term memory (LSTM) [18] and another method that uses the attention mechanism. RNN and LSTM are developments of the classical autoregressive language models that have been primarily utilized for sequential tasks, such as document generation and machine translation [19], [20]. In the language model with a forward LSTM, as shown in Fig. 4, the occurrence probability of the t-th word in a sentence, , depends on the set of words that appear before (denoted as ). The model trains the parameters to maximize the joint probability for all words, . To calculate , LSTM uses the hidden layer of (the output for which is denoted by ), which depends on and . As the hidden layer is computed recursively depending on the word order, LSTM-based models allow context-aware learning. Currently, most LSTM-based language models are based on bidirectional-LSTM (bi-LSTM), which can be used to consider the context not only in the forward but also in the reverse direction. In a backward LSTM, the hidden layer of and its output depend on and . By considering word dependency in the backward direction, bi-LSTM can incorporate relationships among words that cannot be captured by using the forward LSTM alone. In bi-LSTM, all hidden layers are trained to maximize the joint probability of generating the entire sentence as follows:

Fig. 4

Graphical representation of a forward LSTM. Input shows the embedding of the t-th word , and the output is transformed to a probability using a softmax function. For example, if is “majoring”, the model is trained to increase the possibility that “majoring” is output from , which is calculated from words up to , “am”. Embeddings from language models (ELMo) represent the distributed representations provided by the model stacked with multiple bi-LSTM [21]. This model is referred to as bidirectional language model (bi-LM), and contains a stack of L bi-LSTM modules. ELMo is obtained by estimating the weighted-sum of outputs from layers, which are hidden layers for both forward and backward LSTM modules and an input embedding layer. ELMo avoids polysemy as it refers to the hidden layers of LSTM which considers the context for the input sentence, in addition to the input embedding layer which depends only on the lexicon. In fact, ELMo successfully embeds the same lexicon to different points in a high-dimensional space, depending on the context. Another approach for addressing the polysemy issue is to use the attention mechanism. Briefly, attention quantifies the degree of dependency between words [22], [23]. Neural networks with attention mechanisms comprise an attention weight that is obtained by calculating the association of hidden layers (e.g., using the inner product) for arbitrary combinations of words in sentences. If the two words used to compute the attention weight originate from different sentences, this attention is referred to as the source-target-attention. On the other hand, if they originate from an identical sentence, it is designated as self-attention. Models that are based on the use of attention weights in the forward propagation are extremely expressive, allowing for a natural introduction of an attention mechanism to representation learning. Transformer, which implements the attention mechanism and positional encoding [24] in Key-Value Memory neural network [25], [26] without conventional context-aware architectures, such as RNN or LSTM, has demonstrated achievement of a state-of-the-art accuracy in several tasks, including machine translation [27]. Bidirectional encoder representations from transformers (BERT) is the model with multiple stacks of transformers (see Fig. 5) [28]. In the pre-training of BERT, the input is a set of tokens connecting two sentences. A part of the input words is randomly masked. When the masked word is the t-th word, , the model predicts what is considered as the context before and after . This language model is called Masked Language Model (MLM). Compared to the traditional autoregressive language models, MLM can “jointly”, rather than “independently”, consider the context before and after. That is, the occurrence probability of , cannot be factorized into ; this modification contributes to the improved accuracy. Therefore, in contrast to the Eq. (2), MLM maximizes the following joint likelihood:where shows the set of masked tokens. Additionally, the model performs a binary classification of whether the two input sentences are semantically consecutive. Similar to the approaches used in other methods, we can use the outputs of pre-trained transformer layers as the distributed representations of input sentences.

Fig. 5

The graphical representation of Bidirectional encoder representations from transformers (BERT) architecture. Preparation of special tokens ([CLS], [MASK] and [SEP]) enables the model to extract features based on the self-attention of the whole sentence. BERT is trained with the following two tasks: masked language model (MLM) and next sentence prediction (NSP). In pre-training for MLM, the model predicts the masked tokens original meaning (e.g., predicting “have” and “dollars” from ) considering the context before and after the masked tokens. Neural networks with attention mechanisms, such as transformer and BERT, capture distal word associations better than conventional recursive models represented by RNN and LSTM [22], [29]. This is because, in recursive models, a hidden layer of a certain word depends on the hidden layers of the neighboring words only, and the contribution of distal words becomes small or converges to zero. In contrast, the use of attention is robust against such weight loss since the model always refers to its association with all words. This feature of BERT is attractive from the biological perspective since distal interactions are important for structural predictions and other purposes. Meanwhile, the calculation of the all-against-all attention always involves substantial computational complexity, for a sentence with T words. Thus, it is important to reduce this complexity, for which various approximation methods have been proposed [30], [31]. Another advantage of using BERT is task-independent versatility. For instance, when we use ELMo, it is necessary to prepare a task-specific model to transfer the obtained distributed representations to other tasks. In such cases of transfer learning, the new model may forget the features learned in pre-training, thus necessitating the conduction of careful retraining of the model in a sophisticated manner represented by ULMfit [32]. In contrast, with BERT, we can utilize the same architecture used in pre-training (as shown in Fig. 5) without modification. Fine-tuning, which uses pre-trained hidden layers for initialization and optimizes the parameters for each task, has achieved state-of-the-art accuracy in several NLP tasks [28]. The main advantage of obtaining features through unsupervised learning is that it can retain versatility for the transfer learning to various tasks. However, to build a specialized model for a specific task, representation learning in a supervised manner is also useful. StarSpace is a supervised learning method [33], which uses labeled documents as the training dataset, and embeds words and labels in the same space so that a label is close to words associated with it. Embedding with StarSpace allows for text classification, that is, the prediction of labels used in the course of learning with higher accuracy than the other unsupervised methods, and it provides highly interpretable vectors. As shown by this example, supervised representation learning is also a practical option if the correct labels are known. Since the development of word2vec in 2013, the field of representation learning in NLP has been expanding at an astonishing pace. Considering the models based on transformer or BERT, several modern improved methods have continued to provide increased accuracy [34], [35]. Furthermore, similar to the considerable impact of the attention mechanism, the emergence of new concepts may also help reconstruct the current paradigm of language modeling. These substantial developments in machine learning will be useful for bioinformatics and sequence analyses. As numerous examples are introduced in later sections, we believe that application of the latest representation learning techniques to biological sequences will lead to a discovery or elucidation of novel information in this domain.

Survey of representation learning applications in sequence analysis

We conducted an exhaustive survey, as shown in Table 1 and supplementary data, for articles that met the following criteria: (i) peer-reviewed and published in PubMed, except for BERT, which was recently published with a limited number of peer-reviewed articles; (ii) explicitly used a language model, such as word2vec or BERT; (iii) provided the source code or the model for repeatability or verification.

Table 1

Comprehensive survey of representation learning application in biological sequences

Method name	Model	Training data	Task	Avail. and repr.	Ref.
ProtVec	word2vec	547 K proteins	family classification, disorder prediction	+	[36]
HLA-vec	word2vec	HLA-I binding/non-binding peptides	HLA-I binding prediction	++	[37]
m-NGSG	word2vec	0.1 K–3 K proteins	protein classification	++	[38]
ene2vec	word2vec	89 K positive and 495 K negative mRNAs	N6-methyladenosine site prediction	++	[39]
–	word2vec	3 K–101 K of 300 bp genomic regulatory regions	regulatory region prediction	++	[40]
ProtVecX	word2vec	371–44 K proteins	venom toxin prediction, enzyme prediction	+++	[41]
MHCSeqNet	word2vec	228 K peptide-MHC pairs	MHC binding prediction	+++	[42]
–	word2vec	1 M 16S rRNAs	sample class (e.g., body part) prediction	+++	[43]
fastDNA	word2vec	356–3 K bacterial genomes	species identification	++	[44]
NucleoNN	word2vec	86/72 SNPs in the control/exposure samples	investigating allele-interactions	++	[45]
–	word2vec	3 K–22 K CPI pairs	CPI prediction	+++	[46]
FastTrans	word2vec	1 K membrane transporter and 1 K membrane non-transporter proteins	substrate prediction of transport proteins	++	[47]
INSP	word2vec	78 nuclear proteins	nuclear localization prediction	++	[48]
–	word2vec	9 M proteins	function prediction	++	[49]
Its2vec	word2vec	126 K ITSs	species identification	++	[50]
4mCNLP-Deep	word2vec	C. elegans genome (WBcel235/ce11)	N4-methylcytosine sites prediction	++	[51]
–	doc2vec	525 K proteins	localization, T50, absorption, enantioselectivity prediction	+++	[52]
EP2vec	doc2vec	650 K enhancers and 93 K promotors	enhancer-promoter interaction prediction	++	[53]
IDP-Seq2Seq	Seq2Seq	3 K proteins	disorder prediction	++	[54]
–	Glove	244 K–504 K chromatin accessible regions	chromatin accessibility prediction	++	[55]
CircSLNN	Glove	37 dataset of RBP-binding sites on circular RNAs	RBP-binding sites prediction of circRNAs	+	[56]
–	FastText	3 K promoters and 3 K non-promoters	promoter stregnth classification	++	[57]
iEnhancer-5Step	FastText	1 K human enhancers and 1 K human non-enhancers	enhancer prediction	++	[58]
TNFPred	FastText	18 tumor and 133 non-tumor necrosis factors	tumor necrosis factors classification	++	[59]
eDNN-EG	FastText	518 essential and 1 K non-essential genes	essential gene prediction	+	[60]
ProbeRating	FastText	440 K proteins and 274 K nucleic acids	nucleic acid-binding proteins binding preference prediction	++	[61]
CSCS	bi-LSTM	4 K–58 K viral proteins	viral escape mutation prediction	+++	[62]
UniRep	mLSTM	24 M proteins	structure and function prediction	+++	[63]
UDSMProt	AWD-LSTM language model	499 K proteins	enzyme class prediction, gene ontology prediction, remote homology, fold detection	+++	[64]
USMPep	AWD-LSTM language model	23 K–120 K MHC binding peptides	MHC binding affinity prediction	++	[65]
BindSpace	StarSpace	505 K TF-associated and 505 K non-associated DNA	TF-binding prediction	++	[66]
MutSpace	StarSpace	cancer mutation sites	cancer type prediction	++	[67]
SeqVec	ELMo	33 M proteins	3-state secondary structure prediction, disorder prediction, localization prediction, membrane prediction	++	[68]
NuSpeak	ULMfit	92 K RNAs	designing RNA toehold switches	++	[69]
DNA- transformer	transformer	E. coli genome (MG1655)	transcription start sites, translation initiation sites, 4mC methylation sites prediction	++	[70]
TAPE	BERT	31 M proteins	3-state secondary structure prediction, contact prediction, remote homology detection, fluorescence prediction, stability prediction	+++	[71]
ESM-1b	BERT	27 M–250 M proteins	remote homology detection, 8-state secondary structure prediction, contact map prediction, quantitative prediction of mutational effects	++	[72]
ProtBert	BERT	216 M–2B proteins	3-/8-state secondary structure prediction, subcellular localization prediction, membrane-boundness prediction	++	[73]
DNABERT	BERT	H. sapiens genome (GRCh38.p13)	promoter prediction, TF-binding site prediction, splicing site prediction, functional variant analysis	+++	[74]
BERT4Bitter	BERT and bi-LSTM	256 bitter and 256 non-bitter peptides	prediction of bitter peptides	++	[75]
BERT- Enhancer	BERT and CNN	1 K human enhancers and 1 K human non-enhancers	enhancer prediction	++	[76]
BERT-RBP	BERT	10 K RBP-bound and 10 K RBP-unbound RNA sequences	RNA-RBP interaction prediction	++	[77]

Avail. and repr. indicate availability and reproductivity, respectively. (+++) The source code for the generation of the model, pre-trained model, and for conducting detailed documentation, including data links and installation instructions, are available. (++) Either the source code for the generation of the model or the pre-trained model is available, and detailed documentation, including data links and installation instructions, are available. (+) Either the source code for the generation of the model or the pre-trained model is available, but the documentation is limited. Model indicates a general model (described in Section 2) utilized in the method. K, kilo; M, mega; B, billion; HLA, human leukocyte antigen; MHC, major histocompatibility complex; CPI, compound–protein interaction; ITS, internal transcribed spacer; RBP, RNA binding protein; TF, transcription factor.

Comprehensive survey of representation learning application in biological sequences Avail. and repr. indicate availability and reproductivity, respectively. (+++) The source code for the generation of the model, pre-trained model, and for conducting detailed documentation, including data links and installation instructions, are available. (++) Either the source code for the generation of the model or the pre-trained model is available, and detailed documentation, including data links and installation instructions, are available. (+) Either the source code for the generation of the model or the pre-trained model is available, but the documentation is limited. Model indicates a general model (described in Section 2) utilized in the method. K, kilo; M, mega; B, billion; HLA, human leukocyte antigen; MHC, major histocompatibility complex; CPI, compound–protein interaction; ITS, internal transcribed spacer; RBP, RNA binding protein; TF, transcription factor.

Applications for structure/function prediction

ProtVec is the first model to use the embedding method for biological sequences [36]. This method regarded 3-mers of amino acids as words and used data on 546,790 protein sequences obtained from the Swiss-Prot database as the training dataset. Subsequently, word2vec using the skip-gram model was applied to the dataset, and 100-dimensional protein vectors were calculated. Originally, ProtVec was evaluated based on protein family classification and disordered protein prediction accuracies and it achieved high performance in both. Currently, ProtVec has also been utilized for predicting kinase activity [78] and gene function [79]. As ProtVec is a straightforward model, various extensions have been proposed. One of the extensions is seq2vec, which embeds not the k-mers of amino acids but embeds the whole protein sequences [80]. Seq2vec utilizes doc2vec [16], an NLP method that embeds documents instead of words, which showed a higher performance than ProtVec in terms of protein family classification performance. Another extension is dna2vec [81], which embeds variable-length k-mers rather than fixed-length DNA k-mers using word2vec. ProtVecX is a similar method that uses word2vec to embed variable-length amino acid k-mers [41]. SeqVec is the first model that uses ELMo to achieve amino acid representation based on the whole protein sequence [68]. ELMo was applied to the UniRef50 dataset, which contains 33 M proteins with 9.6G residues, regarding single amino acids as words. The extracted features were then used as input into the per-residue prediction and per-protein prediction. With and without the evolutionary information, the model could accurately predict the secondary structure, disorder, localization, and membrane binding. The performance did not exceed that of the state-of-the-art methods [82], [83]. However, it was better than ProtVec [36] which is a context-independent model. In certain tasks, such as protein function prediction, it outperformed one-hot encoding of k-mer-based embeddings and provided competitive results obtained using ELMo [84]. UDSMProt is another language model representation extractor using a variant of LSTM [64]. The structure used is called AWD-LSTM [85], which is a three-layered bi-LSTM that introduces different types of dropout methods to achieve accurate word-level language modeling. UDSMProt was initially applied to the Swiss-Prot database and then fine-tuned for specific tasks, such as enzyme commission classification, gene ontology prediction, and remote homology detection. UDSMProt showed that upon pre-training with external data, the model performed in a manner that was comparable to the existing methods that were tailored to the task using a position-specific scoring matrix (PSSM) and outperformed them in two out of the three tasks conducted. Additionally, it demonstrated that utilization of pre-training information could compensate for the lack of data, compared to the case where PSSM information was provided. These results and extensions, such as USMPep, which revealed the ability to successfully predict MHC class I binding [65], imply that language models can be used to efficiently contextualize and achieve word-based representation. ESM-1b is a BERT-based model trained on a massive biological corpus, particularly amino acid sequences [72]. The study presented a series of BERT models with varying parameter sizes. After conducting pre-training using up to 250 million protein sequences, where each amino acid residue in a sequence was treated as a word, models could accurately predict the structural characteristics of proteins, including remote homology, secondary structure, and residue–residue contact. Representations put forth by the pre-trained 34-layer model were merged with multiple sequence alignments (MSAs), which were considered as the original input of the existing secondary structure or contact prediction methods, and data on their prediction accuracy was improved. This result indicated that embedded representations based on the pre-trained BERT incorporated more information than the MSAs. Furthermore, the 34-layer model was fine-tuned to predict the quantitative effect of mutations and was found to outperform the state-of-the-art methods. Apart from the model trained on individual sequences, Rao et al. proposed a model trained on the sets of amino acid sequences in the form of MSAs [86]. As an attractive alternative, other protein BERT models, such as TAPE transformer and ProtBert, have also been developed [71], [73]. Meticulous inspection of the TAPE transformer revealed that data on attention maps extracted from the pre-trained model reflected the context of input amino acid sequences [87]. For instance, one attention module, which specializes in deciphering residue–residue interactions, exhibited a significant correlation with experimental labels although no structural information was provided. This phenomenon was later investigated by reconstructing protein contact maps using data obtained from the attention maps of pre-trained ESM-1b [7]. The collection of studies illustrates that BERT-based models are highly interpretable and widely applicable to protein-related bioinformatics problems. DNABERT, in contrast, is the only model currently available that can be used to pre-train BERT-based models using a whole human reference genome [74]. During preprocessing, the genome, whose gaps and unannotated regions were excluded, was split into 5 to 510 consequent nucleotide sequences without overlapping and subsequently converted to 3- to 6-mer representations. In a simple sense, each subsequence of length 3 to 6 was regarded as a word. BERT models were pre-trained using k-mers with a masked language modeling objective and applied to downstream tasks. Upon performing task-specific fine-tuning, DNABERT demonstrated state-of-the-art or comparative performance in predicting promoter regions, binding sites of transcription factors (TFs), and splice sites. Attention analysis revealed that fine-tuned models captured the characteristics of each set of target sequences. For example, DNABERT fine-tuned using splicing datasets exhibited high attention weights in intronic regions in addition to the target splice sites, indicating the ability of the model to learn the contextual significance of splicing enhancers or silencers in predicting splice sites. The study further applied DNABERT to predict promoters in the mouse genome and reported higher performance than those of existing deep learning methods. Additionally, we recently adapted DNABERT for predicting RNA–protein interactions and demonstrated that the fine-tuned model could translate transcript region type and RNA secondary structure through attention analysis [77]. Overall, two-step training of the BERT architecture demonstrated its broad application to translate various genomic features in a cross-organism manner.

Applications for molecular interactions

Tsubaki et al. proposed a model by combining a graph neural network for compounds and a convolutional neural network (CNN) for proteins to predict compound–protein interactions (CPIs) [46]. Representations of compounds and proteins were obtained in an end-to-end manner. The word embeddings in the protein were learned from the training dataset using word2vec (3-mer of amino acids as words). To obtain protein vector representation, the average value of a set of hidden vectors was used with d-dimensional embedding after a hierarchical convolutional filter. Extensive evaluations were conducted for three CPI datasets (human, C. elegans [88] and DUD-E dataset [89]). The results showed that using the raw amino acid sequence as the input, the proposed approach significantly outperformed the existing methods utilizing traditional chemical and biological features. They also established that the model could highlight 3D structural interaction sites between the compounds and proteins through an attention mechanism similar to that observed with words in sentences. ProbeRating is a neural network-based recommender system utilizing word embeddings in NLP to infer binding profiles for unexplored nucleic acid-binding proteins (NBPs) [61]. ProbeRating achieves this goal using a two-stage framework. In the first stage, representation learning is performed using a package called FastBioseq, implementing FastText. Thus, data on the input feature vectors are extracted from the NBP sequences and nucleic acid probes. The authors previously selected 3-mers amino acids for proteins and 5-mers for nucleic acids as words. Three datasets (Uniprot400k [90], RRM3k [91], and Homeo8k [92]) were used to pre-train the FastBioseq protein embedding models, whereas RNA embedding models were trained directly from the RRM162 dataset [91]. In contrast, 8-mer frequency features were used for the DNA sequences in the Homeo215 dataset [93]. In the second stage, prediction of the NBP binding preference was redefined as a recommender system formulation, where NBPs are considered as users and RNAs or DNAs are considered as products to be recommended. When no preference was available for a given user, the authors adapted and extended a strategy that converted the binding intensity prediction problem into a similarity prediction problem, solved it, and then converted it back. Extensive evaluation experiments were conducted for the following two tasks: RBP–RNA interaction and TF–DNA interaction. The results showed that ProbeRating outperformed three baseline methods (Nearest-Neighbor, Co-Evo [94] and AffinityRegression [93]). Further analysis suggested that this advantage was beneficial using both the neural network approach and data on input features extracted via word embeddings.

Applications in synthetic biology

Valeri et al. proposed a model that could predict synthetic riboregulators called toehold switches [95]. The model comprised a language model for toehold switch classification and a CNN-based model for toehold switch performance regression. In the language model, a sequence of toehold switches was embedded using ULMfit regarding a nucleotide as a word. They trained the model using toehold switches experimentally characterized by Angenent-Mari et al. [96]. The results showed that the model exhibited good and robust performance even for sparse training data and that the features obtained by the model revealed unknown properties of the toehold switches. They also showed that the trained model is easily fine-tuned by transfer learning using small external data [97], [98], and the fine-tuned model exhibited superior performance compared to an existing model. Finally, they showed that the fine-tuned model could help in the efficient design of toehold switches for various applications, such as SARS-CoV2 detection. UniRep is a representation that comprehensively summarizes the semantics of arbitrary proteins and can be useful for various types of prediction tasks [99]. A protein sequence is embedded into UniRep using multiplicative LSTM (mLSTM), trained with 24 M UniRef50 sequences [100], where an amino acid is regarded as a word. UniRep is used to recapitulate biophysical properties, phylogenetics, and secondary structures of proteins. The authors also showed that UniRep outperformed other representations for predicting the structural and functional properties of de novo proteins, single point mutants, and natural proteins. These results suggest that UniRep is useful for the rational design of proteins. As a proof-of-concept, UniRep re-trained using deep mutational scanning data of GFP [101] was shown to effectively extrapolate GFP brightness outside the training domain. Therefore, UniRep was suggested to markedly reduce the cost for the rational design of GFP. Collectively, UniRep embodies various known protein characteristics and may be a versatile representation for protein bioinformatics.

Applications for other tasks

StarSpace is a supervised embedding method, which is different from the unsupervised embedding methods that we have introduced in Section 2 [33]. Although StarSpace was originally developed for general NLP tasks, such as text classification, there are currently two bioinformatics applications available. The first application is BindSpace, which is used to predict the binding sites of TFs [66]. BindSpace uses HT-SELEX experiments as the training dataset and applies StarSpace to the dataset by considering 8-mers and TFs as words and labels, respectively. In performance evaluation using the ENCODE ChIP-seq dataset, BindSpace achieved high classification performance even between paralogous TFs, which contain highly similar binding motifs. The second application is MutSpace, which is used to estimate the cancer types of patients from somatic mutation patterns [67]. This method regarded mutation patterns and cancer types as words and labels, respectively. MutSpace shows state-of-the-art performance in a breast cancer subclass classification problem. The high performance of these two applications means that StarSpace is likely to perform well in countering other bioinformatics problems. A constrained semantic change search (CSCS) is a method for discovering word changes that significantly alter the semantics from an original sentence based on embedding techniques [102]. The key feature of this method is that it does not detect word changes that would abolish the grammar of the sentence but those that preserve the grammatical structure. For example, in an NLP task, CSCS can change “winegrowers revel in good season” to “winegrowers revel in flu season.” We define x and as the original and mutated sentences, respectively. The embedded representations of x and are defined as z and , respectively. Here, the semantic change is modeled as the distance between these embedded representations, that is, . Additionally, the preservation of the grammatical structure is evaluated by , which is also modeled using embedding techniques. Finally, maximizing , where is a scaling factor. One biological application of CSCS is the modeling of viral evolution [62]. This application considered viral proteins, preservation of the infectivity, and escape from antibody recognition as sentences, preservation of grammar, and semantic change, respectively, and detected escape mutations from immune systems as a result of the CSCS analysis. The analyses of HIV-1 and influenza viruses showed that mutations detected by the CSCS were in good agreement with the experimental mutation results. Woloszynek et al. applied word2vec to a metagenomic dataset by regarding 4–15-mers in sequencing reads as words [43]. They trained word2vec with a skip-gram model using 2,262,986 full-length 16S rRNA amplicon sequences from GreenGenes [103], a microbial 16S rRNA sequence database obtained using metagenomic analysis. They verified the robustness of the model in a taxonomic identification task using an independent dataset of 16,699 full-length 16S rRNA sequences from the KEGG REST server [104] as a validation dataset. The embedding features exhibited superior performance to the k-mer frequency features. Additionally, the embedding has also performed using the American Gut project dataset [105], which comprises 11,341 partial 16S rRNA sequences from three body sites (the gut, skin, and oral cavity), and showed comparable performance to conventional methods, such as sequence alignment in the body site classification task. These results suggest the availability of embedding with pre-trained models instead of sequence alignment for metagenomic sequence profiling.

Summary and outlook

In this study, we introduced basic algorithms and reviewed the recent literature concerning representation learning applications in sequence analysis. Heinzinger, et al. highlighted three difficulties in biological sequence modeling with NLP [68] as follows: (i) proteins range from approximately 30 to 33,000 residues, which is markedly longer than the average English sentence, which consists of 15 to 30 words [106]; (ii) proteins use only 20 amino acids in most cases; if we consider one amino acid as a word, the word repertoire is 1/100,000 of English language, and if we consider 3-mer as a word, the word repertoire is 1/10 to 1/100 of English language; (iii) UniProt [90] is 10 times larger than the size of Wikipedia in terms of data repository size, and extracting information from a very large biological database may require the use of a commensurate model. Embedding of biological sequences using NLP overcomes these difficulties and outperforms existing methods in several tasks, such as function, structure, localization, and disorder prediction (Table 1). In addition to these general biological tasks, representation learning has also been used to solve specific problems, such as RNA aptamer optimization [107], viral mutation prediction [62], and venom toxin prediction [41]. In these studies, representation learning of biological sequences could capture biophysical and biochemical properties of the biological systems. The development of novel representation learning methods has been actively studied in machine learning research. For example, hyperbolic embedding methods have been pursued in recent years [108], [109]. These methods allow embedding of the data not in Euclidean space, which is utilized in all the studies introduced in this review, but in the hyperbolic space. The hyperbolic space exhibits constant negative curvature; thus, it shows characteristic geometric features not observed in Euclidean space, such as the sum of the interior angles of a triangle being less than 180. Changes in the embedding space can considerably alter the efficiency of representation learning, while theoretical and experimental analyses have shown that hyperbolic embedding methods are suitable for data with hierarchical latent structure. Furthermore, research on embedding into more complex spaces, such as mixed-curvature spaces, has also attracted attention [110]. Although these non-Euclidean embedding methods have recently been used for various biological analyses, such as phylogenetic [111], and single-cell RNA-seq analyses [112], no applications exist for the biological sequence analysis emphasized in this review. Thus, the development of such an application is warranted. Data on new approaches are published daily in this field, and the scientific community is engaging relentless efforts to compare their accuracy and to validate their potential uses [71], [113], [114]. It is, therefore, important to ensure the models are available in an easy-to-use format with documentation. Considering that powerful computer resources are required for the establishment of large-scale language models, such as transformer-based models, researchers without access to these resources will be unable to reproduce them even with the source code. Additionally, considering the rapid growth of biological databases, the source code for creating models should be made available for future updates. Only a limited number of studies have published data on both the source code and the pre-trained model with the relevant documentation. Finally, participants in this community must publish their papers in a reproducible and verifiable format. In this study, we comprehensively surveyed and reviewed the application of representation learning to biological sequence analysis. Although NLP-based biological sequence analysis is in its early stages and warrants further development, in the light of novel challenges in biology, such as single-cell analysis, genome design, and personalized medicine, representation learning may contribute to the progression of bioinformatics studies thus revealing the grammar of life.

CRediT authorship contribution statement

Hitoshi Iuchi: Conceptualization, Writing - original draft. Taro Matsutani: Writing - original draft. Keisuke Yamada: Writing - original draft. Natsuki Iwano: Writing - original draft. Shunsuke Sumi: Writing - original draft. Shion Hosoda: Writing - original draft. Shitao Zhao: Writing - original draft. Tsukasa Fukunaga: Writing - original draft. Michiaki Hamada: Supervision, Writing - review & editing.

Conflict of Interest

The authors have no conflicts of interest directly relevant to the content of this article.

68 in total

1. IDP-Seq2Seq: identification of intrinsically disordered regions based on sequence to sequence learning.

Authors: Yi-Jun Tang; Yi-He Pang; Bin Liu
Journal: Bioinformatics Date: 2021-01-29 Impact factor: 6.937

2. Protein classification using modified n-grams and skip-grams.

Authors: S M Ashiqul Islam; Benjamin J Heil; Christopher Michel Kearney; Erich J Baker
Journal: Bioinformatics Date: 2018-05-01 Impact factor: 6.937

3. Discovering nuclear targeting signal sequence through protein language learning and multivariate analysis.

Authors: Yun Guo; Yang Yang; Yan Huang; Hong-Bin Shen
Journal: Anal Biochem Date: 2019-12-26 Impact factor: 3.365

4. DeepKinZero: zero-shot learning for predicting kinase-phosphosite associations involving understudied kinases.

Authors: Iman Deznabi; Busra Arabaci; Mehmet Koyutürk; Oznur Tastan
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

Review 5. A primer on deep learning in genomics.

Authors: James Zou; Mikael Huss; Abubakar Abid; Pejman Mohammadi; Ali Torkamani; Amalio Telenti
Journal: Nat Genet Date: 2018-11-26 Impact factor: 38.330

6. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

Authors: Michael M Mysinger; Michael Carchia; John J Irwin; Brian K Shoichet
Journal: J Med Chem Date: 2012-07-05 Impact factor: 7.446

7. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.

Authors: Baris E Suzek; Yuqi Wang; Hongzhan Huang; Peter B McGarvey; Cathy H Wu
Journal: Bioinformatics Date: 2014-11-13 Impact factor: 6.937

8. Embeddings from deep learning transfer GO annotations beyond homology.

Authors: Maria Littmann; Michael Heinzinger; Christian Dallago; Tobias Olenyi; Burkhard Rost
Journal: Sci Rep Date: 2021-01-13 Impact factor: 4.379

9. CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks.

Authors: Yuqi Ju; Liangliang Yuan; Yang Yang; Hai Zhao
Journal: Front Genet Date: 2019-11-22 Impact factor: 4.599

6 in total

1. Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training.

Authors: Hanyu Luo; Wenyu Shan; Cheng Chen; Pingjian Ding; Lingyun Luo
Journal: Interdiscip Sci Date: 2022-09-22 Impact factor: 3.492

2. Analysis of the landscape of human enhancer sequences in biological databases.

Authors: Juan Mulero Hernández; Jesualdo Tomás Fernández-Breis
Journal: Comput Struct Biotechnol J Date: 2022-05-30 Impact factor: 6.155

3. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets.

Authors: Zhen Chen; Xuhan Liu; Pei Zhao; Chen Li; Yanan Wang; Fuyi Li; Tatsuya Akutsu; Chris Bain; Robin B Gasser; Junzhou Li; Zuoren Yang; Xin Gao; Lukasz Kurgan; Jiangning Song
Journal: Nucleic Acids Res Date: 2022-05-07 Impact factor: 19.160