| Literature DB >> 34141139 |
Hitoshi Iuchi1,2, Taro Matsutani2,3, Keisuke Yamada4, Natsuki Iwano3, Shunsuke Sumi3,5, Shion Hosoda2,3, Shitao Zhao1, Tsukasa Fukunaga6,7, Michiaki Hamada2,3,4,8.
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.Entities:
Keywords: BERT; Natural language processing; Representation learning; Sequence analysis; Word2vec
Year: 2021 PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Ideal representation learning should perform the conversion of the names of foods, such as “sushi” and “pizza,” into similar vectors and assign different vectors to the names of organisms, such as “cow” and “frog.”.
Fig. 2Change in the number of hits for the search term “representation learning” (with double quotation) in PubMed ( https://pubmed.ncbi.nlm.nih.gov/).
Fig. 3Skip-gram model used in word2vec. This neural network model includes the following three fully connected layers: the input, hidden, and output layers. In this case, it attempts to learn the features from the sentence, “I am majoring in biology,” and to predict the words surrounding , “majoring. ”.
Fig. 4Graphical representation of a forward LSTM. Input shows the embedding of the t-th word , and the output is transformed to a probability using a softmax function. For example, if is “majoring”, the model is trained to increase the possibility that “majoring” is output from , which is calculated from words up to , “am”.
Fig. 5The graphical representation of Bidirectional encoder representations from transformers (BERT) architecture. Preparation of special tokens ([CLS], [MASK] and [SEP]) enables the model to extract features based on the self-attention of the whole sentence. BERT is trained with the following two tasks: masked language model (MLM) and next sentence prediction (NSP). In pre-training for MLM, the model predicts the masked tokens original meaning (e.g., predicting “have” and “dollars” from ) considering the context before and after the masked tokens.
Comprehensive survey of representation learning application in biological sequences
| Method name | Model | Training data | Task | Avail. and repr. | Ref. |
|---|---|---|---|---|---|
| ProtVec | word2vec | 547 K proteins | family classification, disorder prediction | + | |
| HLA-vec | word2vec | HLA-I binding/non-binding peptides | HLA-I binding prediction | ++ | |
| m-NGSG | word2vec | 0.1 K–3 K proteins | protein classification | ++ | |
| ene2vec | word2vec | 89 K positive and 495 K negative mRNAs | N6-methyladenosine site prediction | ++ | |
| – | word2vec | 3 K–101 K of 300 bp genomic regulatory regions | regulatory region prediction | ++ | |
| ProtVecX | word2vec | 371–44 K proteins | venom toxin prediction, enzyme prediction | +++ | |
| MHCSeqNet | word2vec | 228 K peptide-MHC pairs | MHC binding prediction | +++ | |
| – | word2vec | 1 M 16S rRNAs | sample class (e.g., body part) prediction | +++ | |
| fastDNA | word2vec | 356–3 K bacterial genomes | species identification | ++ | |
| NucleoNN | word2vec | 86/72 SNPs in the control/exposure samples | investigating allele-interactions | ++ | |
| – | word2vec | 3 K–22 K CPI pairs | CPI prediction | +++ | |
| FastTrans | word2vec | 1 K membrane transporter and 1 K membrane non-transporter proteins | substrate prediction of transport proteins | ++ | |
| INSP | word2vec | 78 nuclear proteins | nuclear localization prediction | ++ | |
| – | word2vec | 9 M proteins | function prediction | ++ | |
| Its2vec | word2vec | 126 K ITSs | species identification | ++ | |
| 4mCNLP-Deep | word2vec | N4-methylcytosine sites prediction | ++ | ||
| – | doc2vec | 525 K proteins | localization, T50, absorption, enantioselectivity prediction | +++ | |
| EP2vec | doc2vec | 650 K enhancers and 93 K promotors | enhancer-promoter interaction prediction | ++ | |
| IDP-Seq2Seq | Seq2Seq | 3 K proteins | disorder prediction | ++ | |
| – | Glove | 244 K–504 K chromatin accessible regions | chromatin accessibility prediction | ++ | |
| CircSLNN | Glove | 37 dataset of RBP-binding sites on circular RNAs | RBP-binding sites prediction of circRNAs | + | |
| – | FastText | 3 K promoters and 3 K non-promoters | promoter stregnth classification | ++ | |
| iEnhancer-5Step | FastText | 1 K human enhancers and 1 K human non-enhancers | enhancer prediction | ++ | |
| TNFPred | FastText | 18 tumor and 133 non-tumor necrosis factors | tumor necrosis factors classification | ++ | |
| eDNN-EG | FastText | 518 essential and 1 K non-essential genes | essential gene prediction | + | |
| ProbeRating | FastText | 440 K proteins and 274 K nucleic acids | nucleic acid-binding proteins binding preference prediction | ++ | |
| CSCS | bi-LSTM | 4 K–58 K viral proteins | viral escape mutation prediction | +++ | |
| UniRep | mLSTM | 24 M proteins | structure and function prediction | +++ | |
| UDSMProt | AWD-LSTM language model | 499 K proteins | enzyme class prediction, gene ontology prediction, remote homology, fold detection | +++ | |
| USMPep | AWD-LSTM language model | 23 K–120 K MHC binding peptides | MHC binding affinity prediction | ++ | |
| BindSpace | StarSpace | 505 K TF-associated and 505 K non-associated DNA | TF-binding prediction | ++ | |
| MutSpace | StarSpace | cancer mutation sites | cancer type prediction | ++ | |
| SeqVec | ELMo | 33 M proteins | 3-state secondary structure prediction, disorder prediction, localization prediction, membrane prediction | ++ | |
| NuSpeak | ULMfit | 92 K RNAs | designing RNA toehold switches | ++ | |
| DNA- transformer | transformer | transcription start sites, translation initiation sites, 4mC methylation sites prediction | ++ | ||
| TAPE | BERT | 31 M proteins | 3-state secondary structure prediction, contact prediction, remote homology detection, fluorescence prediction, stability prediction | +++ | |
| ESM-1b | BERT | 27 M–250 M proteins | remote homology detection, 8-state secondary structure prediction, contact map prediction, quantitative prediction of mutational effects | ++ | |
| ProtBert | BERT | 216 M–2B proteins | 3-/8-state secondary structure prediction, subcellular localization prediction, membrane-boundness prediction | ++ | |
| DNABERT | BERT | promoter prediction, TF-binding site prediction, splicing site prediction, functional variant analysis | +++ | ||
| BERT4Bitter | BERT and bi-LSTM | 256 bitter and 256 non-bitter peptides | prediction of bitter peptides | ++ | |
| BERT- Enhancer | BERT and CNN | 1 K human enhancers and 1 K human non-enhancers | enhancer prediction | ++ | |
| BERT-RBP | BERT | 10 K RBP-bound and 10 K RBP-unbound RNA sequences | RNA-RBP interaction prediction | ++ |
Avail. and repr. indicate availability and reproductivity, respectively. (+++) The source code for the generation of the model, pre-trained model, and for conducting detailed documentation, including data links and installation instructions, are available. (++) Either the source code for the generation of the model or the pre-trained model is available, and detailed documentation, including data links and installation instructions, are available. (+) Either the source code for the generation of the model or the pre-trained model is available, but the documentation is limited. Model indicates a general model (described in Section 2) utilized in the method. K, kilo; M, mega; B, billion; HLA, human leukocyte antigen; MHC, major histocompatibility complex; CPI, compound–protein interaction; ITS, internal transcribed spacer; RBP, RNA binding protein; TF, transcription factor.