Literature DB >> 33897979

The language of proteins: NLP, machine learning & protein sequences.

Dan Ofer¹, Nadav Brandes², Michal Linial³.

Abstract

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

Entities: Chemical Disease Gene Species

Keywords: Artificial neural networks; BERT; Bag of words; Bioinformatics; Contextualized embedding; Deep learning; Language models; Natural language processing; Tokenization; Transformer; Word embedding; Word2vec

Year: 2021 PMID： 33897979 PMCID： PMC8050421 DOI： 10.1016/j.csbj.2021.03.022

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Proteins and natural language

Like human language, protein sequences can be naturally represented as strings of letters (Fig. 1A). The protein alphabet consists of 20 common amino acids (AAs) (excluding unconventional and rare amino acids). Furthermore, like natural language, naturally evolved proteins are typically composed of reused modular elements exhibiting slight variations that can be rearranged and assembled in a hierarchical fashion. By this analogy, common protein motifs and domains, which are the basic functional building blocks of proteins, are akin to words, phrases and sentences in human language [96], [102], [98], [93], [117].

Fig. 1

Computational analysis of natural language and proteins (A) Texts and proteins can be represented as strings of letters and processed with NLP methods to study local and global properties. (B) A common preprocessing step in NLP is the tokenization of text or protein sequences into distinct tokens, which are the atomic units of information. There are many different ways to tokenize text, e.g. as letters, words, or other substring pieces of equal or unequal length. (C) Bag-of-words representation can be used to count unique tokens in a text, turning every input text into a fixed-size vector. Subsequently, these vector representations can be analyzed through any machine-learning algorithm. Another central feature shared by proteins and human language is information completeness. Even though a protein is much more than a mere sequence of amino acids – it is also a three-dimensional machine with a determined structure and function – these other aspects are all predetermined by its amino-acid sequence. While protein structure and function is dynamic and context-dependent (e.g. on cellular state, other molecules and PTMs), it is still defined by the underlying amino-acid sequence. This means that from an information-theory perspective, the protein’s information (e.g. its structure) is contained within its sequence [74]. Given these similarities in shape and substance, it seems natural to apply natural language processing (NLP) methods to protein sequences. Although the term NLP refers to natural languages, the same computational methods are also used to study non-natural languages such as programming code [96], [49], [30]. Past decades have seen a continuous trickle of statistical and machine-learning algorithms from the field of NLP into bioinformatics [67], [117], [66], [6], [34], [60], [88], [109], [117], [105]. It should be kept in mind that the analogies between proteins and human language only go so far. Most importantly, we can read and understand human languages. Additionally, unlike proteins, most human languages include uniform punctuation and stop words, with clearly separable structures such as words, sentences and paragraphs. With proteins, we do not always know whether a sequence of amino-acids is part of a functional unit (e.g. a domain). There is no clear analogy between the building blocks of language and those of proteins. For example, considering protein domains as being equivalent to words is often misleading. Furthermore, protein functional units often overlap. As a result, while natural languages have a well-defined vocabulary (with ~ million words in English), proteins lack a clear vocabulary. From an information-theory perspective, the entropy of sequences within protein domains was shown to be lower than the English language, while still being significantly different from a random distribution [117]. Proteins also exhibit high variability in length, which in human ranges over three orders of magnitudes (from less than 20 AAs for peptide hormones to tens of thousands of AAs in some structural proteins). Such a wide range of protein sequence lengths is prevalent in all domains of life, from viruses to humans [17]. In NLP, specific words might have critical influence (e.g. “I love you” vs. “I loved you”), while in proteins, effects may be more aggregate (e.g. hydrophilic chains in intermembrane sequences). Finally, natural languages typically have fewer distant interactions, while proteins, due to their 3D structure, commonly form interactions between residues that are far away on the linear sequence. In this review, we present a modern view on applications of NLP methods to the study of protein sequences. We begin by exemplifying the types of prediction tasks that one might be interested in when studying proteins at a global or local level. We then discuss the concept of tokenization, namely the choosing of a discrete set of tokens for representing given text or protein sequences, which is typically the first preprocessing step in NLP tasks. Next, we present classical NLP methods such as bag-of-words and k-mer, which provide a strong baseline for many applications. We also mention other classic searching and text-similarity methods such as BLAST and hashing. We then move to more modern approaches, focusing on word embedding, contextualized embedding and deep-learning methods. We concentrate on deep language models (especially protein-language models). We end by reflecting on some limitations of deep-learning models and current trends in the field. The aim of this review is to introduce readers to applications of NLP methods to protein research, and to inform them about recent developments in the field. While aimed at a broad audience, we assume familiarity with basic concepts in biology (e.g. amino acids, phosphorylation) and machine learning (e.g. feature extraction, deep learning). To assist the reader with this background knowledge, we provide a short glossary with some important terms.

Sequence-based prediction tasks: Global vs. Local

NLP methods have been used to address a large spectrum of sequence-based prediction tasks in text and proteins. At the most fundamental level, sequence-based tasks are either global or local (Fig. 1A). Global tasks output predictions for the entire sequence. For example, in classic NLP, the sentiment of a movie review (e.g. “It was excellent”: +1, “It was terrible”: −1) is a global property of the text. Local tasks, on the other hand, attempt to make a prediction over every element of a given sequence. Part-of-speech tagging, namely the categorization of the grammatical role of every word in a text (e.g. noun, verb), is a classic example of a local NLP task. When dealing with global protein tasks, we are interested in making some inference or predictions about the protein sequence as a whole. For example, we may want to determine what type of protein we are dealing with (e.g. an enzyme, a receptor or a structural protein) [52], [51], or where it is expressed in the cell (e.g. in the nucleus, cytoplasm or extracellular space) [37], [5], [88]. Other global properties include thermal stability, source organism, and functional protein annotations (e.g. gene ontology, GO) such as antiviral activity [79], [64], [41], [84], [5], [25]. With local protein tasks, on the other hand, the goal is to make claims about specific residues in the protein sequence, potentially about all of them. For example, a common task is the prediction of 2D and 3D structure from an AA sequence (which, as mentioned, in theory contains all the necessary information). The output of this task would be the 2D structure for each residue in the protein (e.g. helix, turn, beta strand). In the case of 3D structure, the output might be the exact 3D coordinates of each residue [91], [14] or its location relative to other residues (contact-map prediction). Another local task is the prediction of post-translational modifications (PTMs) such as phosphorylation or cleavage sites [90], [18]. Sequence-based predictions can also include other inputs (that could be collected experimentally or computed) in addition to the sequence itself, such as publication date, organism, protein annotations (e.g. PTMs) and domains [41], [88].

The atomic unit of information: Tokenization

Computational text analysis requires tokenization, i.e. splitting the text into individual tokens, which are the atomic units of information in a chosen language representation (Fig. 1B). English NLP models typically use words as tokens, although some approaches use individual characters. Individual-character tokens offer greater flexibility, especially for out-of-vocabulary or misspelled words, or for languages without clear separation between words such as Mandarin. Character-level tokenization also entails smaller vocabulary, which often results in lower memory requirement. However, more tokens must be used to form a sentence, leading to more long-distance dependencies. A middle-ground approach between words and characters is subword segmentation. Common examples are WordPiece, SentencePiece and Byte Pair Encoding (BPE) [92], [48], where a vocabulary is initialized to individual characters, and the most frequent combinations of symbols in the vocabulary are iteratively merged into the vocabulary. For example, the frequently co-occurring “i” and ”t” would be merged into a token “it”, while “ending” might be represented by the two tokens “end” and “ing”. This approach offers good coverage of words (including rare and out-of-vocabulary words), while still limiting the vocabulary size. In proteins, the simplest and most common tokenization method is to regard individual AAs as character-level tokens. Since proteins do not have a well-defined vocabulary of words, word-level tokenization is not a well-defined option in the case of proteins. Subword segmentation, on the other hand, does not require any predefined knowledge of words in the target language, making it a potentially interesting approach for discovering “words'' or motifs in proteins [107], [8], [12], [53].

Counting: bag-of-words and k-mers

Most machine-learning algorithms (e.g. logistic regression, SVM, random forest) require a fixed-size input vector of features. Bag-of-words (BoW) is the simplest and most popular feature extraction method in text. In BoW, a text is split into its constituent parts (tokens), which are then counted without regard to their original order (Fig. 1C). The assumption is that texts using similar words are also similar in other ways (e.g. topic, author, sentiment). BoW can normalize the resulting vector with respect to the total number of tokens in the same text to produce token frequencies, or with respect to their normalized counts in all texts in a dataset to produce Term Frequency–Inverse Document Frequency (TF-IDF) vectors [87]. BoW can also be encoded as binary features marking the occurrence of tokens in a text (instead of counts). In addition to counting individual tokens, BoW can also count multi-token combinations, with overlap (k-mers) or without (n-grams). Since they are largely the same, we refer to n-grams and k-mers interchangeably. BoW is fast, efficient, simple and suited for large texts of varying lengths. On the other hand, since the order of tokens in the input text is lost in BoW representation, this approach might be too simplistic for many tasks (yet it still often succeeds surprisingly well) [33], [63], [111]. BoW has been used in many bioinformatics studies on proteins [66], [18], [64], [11], [62], [103], [51]. Protein BoW commonly count character-level tokens (i.e. AAs). However, one can go beyond AAs and extract BoW features from other sequence representations such as 2D structure or reduced AA alphabets (e.g. hydrophobic/polar binary tokens) [66], [108]. An advantage of the latter is a smaller alphabet, allowing for longer k-mers or n-grams. For example, AA-level 4-grams would result in 160,000 (20^4) features, whereas 7-grams over a 3-state alphabet (e.g. for the major classes of secondary structure) would result in only 2187 (3^7) features. The latter would be more discriminative in capturing longer patterns, while maintaining lower sparsity (i.e. less zeros) in the feature space. Another possible variation is k-mer mirror symmetry, meaning that AA k-mers such as “PG” and “GP” are treated as the same feature, yielding a more compact feature space [64], [66]. Ideally, we would count 3D protein topologies, domains or motifs instead of letters to characterize protein function [20]. However, we lack reliable annotations for most proteins [17]. Despite its simplicity, BoW is an effective method for many protein-related tasks [42], [113], [79], [66], [64], [69], [86].

Searching by similarity

A common task in NLP and bioinformatics is finding similar strings and sequences. In bioinformatics, the most common algorithm for sequence searching is BLAST. Another option is Locality-Sensitive Hashing (LSH), a popular method for indexing and finding texts at scale. In brief, a document is represented as a BoW vector. The vector is then hashed multiple times using a hash function that encourages “collisions'' between similar document vectors. Retrieval of documents from a hash bucket can be done in O(1), as items within the same bucket will be similar to one another. This can be adapted to bioinformatics sequence databases to complement existing slow sequence-similarity methods such as BLAST, whose speed (~O(mn)) is adversely affected by the exponential growth in sequences [100]. A common application of fast approximate search is the identification of short peptides in mass-spectrometry proteomic databases [75], [1], [28]. Sequence similarity metrics are also useful for machine-learning methods. For example, support vector machines (SVMs) with string kernels compare proteins or texts by sub-sequence similarity [51], [52], [12]. Another method to index and search protein sequences is through AA k-mers. For example, HHblits, HHsuite and Pep2Pro use k-mers to rapidly find similar proteins, peptides and domains [82], [97], [10].

Word embeddings

Word embeddings are a family of algorithms that represent tokens (e.g. words) with fixed-size dense vectors (“embeddings”), such that similar words obtain similar vector representations [32]. Word similarity is usually based on neighbouring words, meaning that different words with a related use (e.g. “spoon” and ”fork”) will obtain similar representations, while unrelated words that rarely appear together (e.g. “Darth Vader” and “mRNA”) are expected to obtain distinct embeddings. Word embeddings provide useful low-dimensional representations (compared to the sparse, high-dimensional BoW representations) that still preserve semantic information about the input sequence. For example, taking the average embedding over all words in a text often leads to strong results with downstream supervised or clustering algorithms [7], [9]. The method has been popularized with efficient algorithms such as word2vec [59], [70], [42], [16]. These popular algorithms are all fundamentally “vector space models”, and conceptually similar to decomposition of the co-occurrence matrix, which captures the probabilities of tokens to occur next to each other in the text [32]. Word2vec has two model architectures: continuous bag-of-words or skip-grams. In the bag-of-words architecture, the model predicts the current word from a window of surrounding context words (while the order of context words is ignored). The skip-gram architecture is exactly the opposite: the model uses the current word to predict the surrounding window of context words. FastText combines words with subword information when learning representations, to better handle unknown or syntactically similar words. Low-dimensional embeddings are popular in NLP due to the huge vocabulary (often >100 k of words) of natural languages. In proteins we have only ~20 AAs. While we can embed AAs onto a lower-dimensional space, it is not as clearly beneficial [8]. While dimensionality reduction is of limited use when working on single AAs, it can provide useful compact representations when considering extended AA combinations. For example, ProtVec used word2vec on AA 3-mers to extract a 300-dimensional vector representation instead of 8 k distinct trigrams [9]. It is interesting to note that AA embeddings learned by machine-learning models closely resemble those resulting from decomposing AA substitution matrices in terms of the functional clusters they induce [83], [65], [61], [72].

Contextualized embeddings & deep learning

Word2vec & similar methods do not take local word order into account. When representing words as embeddings (after training), they also do not consider the surrounding context of words. For example, in the sentences “man bites dog”, “dog bites man” and “love bites”, the vector representation of “bites'' will always be the same, regardless of its context. Contextualized embeddings, on the other hand, are aware of the surrounding word context (including order), yielding different representations for the same word in different contexts [95], [57]. While more complex and computationally expensive, contextualized embedding models yielded state-of-the-art results (at the time) on a number of benchmarks [57], [71]. Contextualized embeddings are typically based on neural networks. Popular deep-learning architectures are long short-term memory (LSTM) [36], sequence-to-sequence (seq2seq) [101] and attention [104]. In seq2seq models, a text is transformed using an encoder component, then a separate decoder uses the encoded representation to solve some task (e.g. translating between English and French). Attention models use attention layers (also called attention heads) that allow the network to concentrate on specific tokens in the text [104]. For example, in the sentence “The law will never be perfect, but its application should be just”, when the network analyzes the word “its” we expect its attention heads to concentrate on the words “The law”. Deep-learning architectures commonly rely on transfer learning, where a model is trained on one problem and then fine-tuned (transferred) to another problem in a similar domain, for which data is often scarce. Early contextualized embedding models included ELMO [71] and CoVe [57]. ELMO uses representations derived from the hidden states of bidirectional LSTMs. CoVe is a seq2seq model with attention, originally developed for language translation. A CoVe model pretrained on translation was then used on other NLP tasks. There have been works using contextualized embedding models (e.g. ELMO) on proteins for supervised-learning tasks (such as GO annotation, subcellular localization or structure prediction), as well as clustering sequences based on the learned representations [15], [29], [54], [34].

Deep language models

Many fields hope for an “Imagenet moment” [85] – namely, a model, dataset and pretraining tasks that provide strong off-the-shelf performance for most tasks, even with little data. Arguably, the field of NLP has recently reached this milestone, thanks to deep language models such as ULMFiT, BERT, XLNet and a range of other BERT variations (e.g. ALBERT, RoBERTA) [30], [27], [114], [77], [50], [55], [38]. In language modelling, a model is trained to predict tokens in a text, based on their surrounding context (Fig. 2). For example, an English language model might be given a masked sentence such as “The ____ sat on the mat” and be tasked to predict what English words are plausible candidates for the mask token (e.g. “cat” or “dog”). While language modeling problems may not have unique solutions (e.g. both cats and dogs are plausible mat-sitting entities), it serves as an excellent generalizable proxy for understanding general language structures. A good English language model should score the sentence “Moriarty on Cthulhu sat” as less probable than (the grammatically correct) “The cat sat on the mat” (Cthulhu being a Lovecraftian entity larger than mountains). A crucial advantage of language models for pretraining is that they are self-supervised (as in the masked language task): the model predicts an explicit ground truth, but it doesn’t require labelled data, making it usable on any corpus, at potentially massive scale [77], [21].

Fig. 2

Language models (A) Language models are trained on self-supervised tasks over huge corpuses of unlabeled text. For example, in the masked language task, some fraction of the tokens in the original text are masked at random, and the language model attempts to predict the original text. (B) (Pre-)trained language models are commonly fine-tuned on downstream tasks over labeled text, through a standard supervised-learning approach. Fine-tuning is typically much faster and provides superior performance than training a model from scratch, especially when labeled data is scarce. BERT [27] uses bidirectional masked language modelling, where a fraction of words are masked out and have to be predicted by the model. In bidirectional language modelling, the model looks at all surrounding context of a masked token, instead of just at the tokens preceding it. ELECTRA [24] (a BERT variant) predicts which tokens have been replaced with adversarially-generated tokens. This pretraining task was demonstrated as more efficient than masked language modeling, presumably because the task is defined over all input tokens (which might have been replaced by the adversarial model) rather than just a subset of mask tokens. ELECTRA was shown to yield comparable results to BERT with less than 10% of the compute time. Effectively all state-of-the-art neural language models are attention-based, typically using the Transformer architecture [27]. Research into why deep language models work so well is ongoing. These models share the following characteristics: 1) state-of-the-art performance on a wide variety of benchmarks [106]; 2) self-supervised language modelling pretraining on large text corpuses [27]; 3) huge, deep neural networks, with continued improvement from ever larger, deeper models and datasets [21]. Deep language models have started to show promise in protein and genomic research [119], [45], [83], [79], [29], [107], [34], [40]. Successful architectures used in protein language modeling include LSTM and attention [104] (in particular BERT). Examples of LSTM-based protein language models include UDSMProt [99] and Unirep (which is based on ULMFiT) [3]. Downstream tasks for protein language models include the detection of the taxonomic origin of proteins, or scoring the likelihood or stability of natural or synthetically-designed sequences. For example, protein language models trained on different taxa could be used to identify viral protein sequences in metagenomic samples (e.g. from mass spectrometry) [34], [56]. Progress in the development of protein models is dependent on representative evaluation benchmarks encompassing a variety of protein-related tasks, such as the TAPE collection of benchmarks (Tasks Assessing Protein Embeddings) [79]. TAPE combined a diverse set of tasks in a convenient, standardized format, and evaluated different models. For each model, they showed the benefit of language-model pretraining. Evaluated models included a BERT-like Transformer pretrained on ~30 M Pfam domains, deep convolutional networks pretrained on 3D-structure contact prediction [13], [116], LSTM language models [3], as well as non-deep-learning methods which excelled at some tasks (specifically secondary structure prediction). A recent work sought to interpret how attention-based models work in proteins, and the parts of proteins that attention heads focus on across different tasks (e.g. which residues are most relevant in predicting which AAs interact in a protein’s 3D structure) [105]. Facebook’s Evolutionary Scale Model is currently the largest developed protein language model, with 36 layers and over 700 million parameters [83]. This attention-based model was pretrained on 250 M protein sequences with a masked language task. Like in NLP, ever larger models and protein datasets yield consistent improvement on TAPE’s benchmark tasks, while language modelling performance is not saturated even with the largest models. Some pretraining methods add additional tasks, such as sub-sequence order prediction (e.g. in NLP, which of two sentence comes first). Pretraining tasks that are relevant to the final task are usually more helpful than generic tasks such as sentence order prediction [73], but should be weighed against the amount, quality and representativeness of data available. For example, a pretraining task of predicting whether one Wikipedia article links to another is relevant to a downstream task of link prediction between entities in knowledge graphs, and indeed improves performance [78], [115]. In protein research, an analogous task of pretraining on protein–protein interactions is more problematic due to the far smaller, sparser and more biased data [94], [89].

Language generation

Natural language generation is a challenging NLP task, where a model generates realistic-looking text (e.g. article writing, chatbots). Language generation based on deep language models has shown great improvement, with increasingly massive models such as OpenAI’s GPT-2 and GPT-3 often fooling humans. These models demonstrate the trend of increased parameter size and larger datasets resulting in improved performance, with GPT-3 having over 170 billion parameters. Generated texts can be controlled to match user-defined styles, task-specific behaviour and other attributes [76], [19], [43]. Language models have been used to understand and predict viral mutations that evade neutralizing antibodies [35]. Language generation models can also be applied to synthetic protein design, as in ProGen and other works [56], [110], [4]. For example we might generate peptides with the gene-ontology attribute “defense response to virus” and an initial primer sequence amenable to binding a sequence of interest, such as the ACE2 receptor targeted by SARS-CoV-2 [112].

No (deep) silver bullet

Deep-learning models are not a magical panacea, and have a number of disadvantages. They are slow to train, and can underperform simpler methods (such as BLAST-based nearest-neighbour search or logistic regression over BoW representation [79], [58], [2]. In particular, simpler methods are more suitable for small datasets, noisy data, or when the underlying signal follows simple rules (e.g. an exact motif). Deep learning is also sensitive to the lengths of the texts or sequences involved – a dataset containing protein sequence lengths ranging from 10 to 10,000 AAs can be challenging to process. The high memory and computation requirements of large, deep models (such as BERT) make processing long sequences, or pairs of sequences (e.g. for predicting interactions between proteins) challenging [23]. Deep-learning models also easily overfit (even on random noise), and may not necessarily generalize to new, unseen data [120], [118]. Another major disadvantage of deep learning is its sensitivity to hyperparameters (such as the optimizer or learning rate) and other choices. A related problem is its lack of stability, as opposed to more robust algorithms such as logistic regression or random forests. Deep-learning models are also hard to interpret, even by experienced practitioners. Even with deep models, the incorporation of expert knowledge and feature engineering can improve performance more than sophisticated models, especially with features that are not directly derived from the sequence, such as protein post-translational modifications or evolutionary information [80].

Where are we heading?

NLP methods are becoming an important inspiration for bioinformatics. Deep learning and NLP methods are making inroads into protein research, and the recent successes of AlphaFold in practically solving the protein structure prediction problem may well be considered an “imagenet moment” for the field [26]. The trend in NLP is towards deeper and larger language models such as GPT-3. In particular, unsupervised and self-supervised learning on huge datasets are an important feature of state-of-the-art methods. For how long this trend will last and how far it will enable us to push the boundaries of NLP and protein research – only time will tell. As of the present day, fine-tuning of pretrained deep models have shown considerable promise for improving our ability to solve problems, even on small data [79], [83], [19]. Exciting as recent progress may have been, the impressive performance of state-of-the-art models should not lead us to neglect more mundane but crucially important efforts in the field. Above all, abundant and high-quality data plays a major role in the progress of any domain of machine learning, and especially so in protein research. High-throughput molecular assays and data curation in resources such as UniProt [17] are the field’s engine. Open, standardized data and methods (including the open-source Keras [22] and Pytorch [68] libraries) are valuable research catalyzers. Competitions such as CAFA (GO annotations and protein function prediction) [31], [41], CASP (3D structure prediction) [47] and CAPRI (protein–protein docking prediction) [39] provide the protein research community with rigorous tests for evaluating and comparing prediction algorithms, with new, unseen test sets on every competition cycle. The protein research community excels in coordinating such competitions, and is considered an inspiration for other fields. According to DeepMind, this was one of the primary factors pushing them to develop AlphaFold and participate in CASP13 and CASP14 (2020), which ultimately led to what appears as one of the major scientific breakthroughs of recent years [26]. In terms of open benchmarks, on the other hand, the field of computational protein research is still behind compared to NLP and other machine-learning domains, where datasets such as Imagenet and GLUE are de-facto standards for model evaluation [46], [106]. Benchmarks, unlike competitions, are instantaneously accessible at any given point in time and, as a result, are crucial for continuous research work and publication. The existence of standardized, objective yardsticks for comparing methods is crucial to focusing efforts on the most promising methods and ideas.Glossary

CRediT authorship contribution statement

Dan Ofer: Conceptualization, Investigation, Writing. Nadav Brandes: Conceptualization, Visualization, Writing. Michal Linial: Conceptualization, Supervision, Writing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Term	Definition
Artificial neural networks	Artificial neural networks are a class of machine-learning models that can fit nonlinear, complex data.
Attention layer	A type of layer used in deep learning that allows the network to concentrate on specific elements in the input sequence [104].
Deep learning	Neural networks with many hidden layers are commonly referred to as “deep learning”. Deep-learning architectures include convolutional, recurrent and attention layers.
Features	Input properties fed into machine-learning algorithms (e.g. the length of a sequence) are commonly referred to as “features”.
Feature engineering	The creation and selection of features to extract from data, which is considered a crucial part of machine-learning projects.
Low-dimensional embedding	A mathematical mapping from a high-dimensional space of inputs to a lower-dimensional space of representations.
Post-translational modification (PTM)	Chemical enzymatic alterations of amino-acid residues in proteins which often lead to functional changes. Major PTMs include phosphorylation, glycosylation and proteolytic cleavage.
Protein domain	An evolutionary-conserved protein region with independent, well-defined 3D structure and function. Many proteins contain multiple domains.
Protein motif	A short, conserved segment of amino acids in a protein associated with some function such as binding properties.
Self-supervised learning	A machine-learning paradigm for training supervised models over unsupervised (namely unlabeled) datasets by automatically generating labels. With text, this might be the prediction of the next word in a text [38], [27].
Transfer learning	Taking a model trained to solve one problem, and fine-tuning its parameters to solve another, related task. For example, training a computer-vision model to recognize cars, and then teaching it to recognize trucks [81].
Transformers	A deep-learning architecture consisting of attention-based layers that is particularly suited for sequence inputs and outputs ([27], [104], [44].

19 in total

Review 1. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies.

Authors: Rahmad Akbar; Habib Bashour; Puneet Rawat; Philippe A Robert; Eva Smorodina; Tudor-Stefan Cotet; Karine Flem-Karlsen; Robert Frank; Brij Bhushan Mehta; Mai Ha Vu; Talip Zengin; Jose Gutierrez-Marcos; Fridtjof Lund-Johansen; Jan Terje Andersen; Victor Greiff
Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857

2. Contrastive learning on protein embeddings enlightens midnight zone.

Authors: Michael Heinzinger; Maria Littmann; Ian Sillitoe; Nicola Bordin; Christine Orengo; Burkhard Rost
Journal: NAR Genom Bioinform Date: 2022-06-11

3. Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation.

Authors: Shicheng Li; Lizong Deng; Xu Zhang; Luming Chen; Tao Yang; Yifan Qi; Taijiao Jiang
Journal: J Med Internet Res Date: 2022-06-03 Impact factor: 7.076

4. GeMI: interactive interface for transformer-based Genomic Metadata Integration.

Authors: Giuseppe Serna Garcia; Michele Leone; Anna Bernasconi; Mark J Carman
Journal: Database (Oxford) Date: 2022-06-03 Impact factor: 4.462

5. An Improved Deep Learning Model: S-TextBLCNN for Traditional Chinese Medicine Formula Classification.

Authors: Ning Cheng; Yue Chen; Wanqing Gao; Jiajun Liu; Qunfu Huang; Cheng Yan; Xindi Huang; Changsong Ding
Journal: Front Genet Date: 2021-12-22 Impact factor: 4.599

6. Rhea, the reaction knowledgebase in 2022.

Authors: Parit Bansal; Anne Morgat; Kristian B Axelsen; Venkatesh Muthukrishnan; Elisabeth Coudert; Lucila Aimo; Nevila Hyka-Nouspikel; Elisabeth Gasteiger; Arnaud Kerhornou; Teresa Batista Neto; Monica Pozzato; Marie-Claude Blatter; Alex Ignatchenko; Nicole Redaschi; Alan Bridge
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

7. Embeddings from protein language models predict conservation and variant effects.

Authors: Céline Marquet; Michael Heinzinger; Tobias Olenyi; Christian Dallago; Kyra Erckert; Michael Bernhofer; Dmitrii Nechaev; Burkhard Rost
Journal: Hum Genet Date: 2021-12-30 Impact factor: 5.881

Review 8. Protein Design with Deep Learning.

Authors: Marianne Defresne; Sophie Barbe; Thomas Schiex
Journal: Int J Mol Sci Date: 2021-10-29 Impact factor: 5.923

9. PRIP: A Protein-RNA Interface Predictor Based on Semantics of Sequences.

Authors: You Li; Jianyi Lyu; Yaoqun Wu; Yuewu Liu; Guohua Huang
Journal: Life (Basel) Date: 2022-02-18

10. ProtPlat: an efficient pre-training platform for protein classification based on FastText.

Authors: Yuan Jin; Yang Yang
Journal: BMC Bioinformatics Date: 2022-02-11 Impact factor: 3.169