| Literature DB >> 33897979 |
Dan Ofer1, Nadav Brandes2, Michal Linial3.
Abstract
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.Entities:
Keywords: Artificial neural networks; BERT; Bag of words; Bioinformatics; Contextualized embedding; Deep learning; Language models; Natural language processing; Tokenization; Transformer; Word embedding; Word2vec
Year: 2021 PMID: 33897979 PMCID: PMC8050421 DOI: 10.1016/j.csbj.2021.03.022
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Computational analysis of natural language and proteins (A) Texts and proteins can be represented as strings of letters and processed with NLP methods to study local and global properties. (B) A common preprocessing step in NLP is the tokenization of text or protein sequences into distinct tokens, which are the atomic units of information. There are many different ways to tokenize text, e.g. as letters, words, or other substring pieces of equal or unequal length. (C) Bag-of-words representation can be used to count unique tokens in a text, turning every input text into a fixed-size vector. Subsequently, these vector representations can be analyzed through any machine-learning algorithm.
Fig. 2Language models (A) Language models are trained on self-supervised tasks over huge corpuses of unlabeled text. For example, in the masked language task, some fraction of the tokens in the original text are masked at random, and the language model attempts to predict the original text. (B) (Pre-)trained language models are commonly fine-tuned on downstream tasks over labeled text, through a standard supervised-learning approach. Fine-tuning is typically much faster and provides superior performance than training a model from scratch, especially when labeled data is scarce.
| Term | Definition |
|---|---|
| Artificial neural networks | Artificial neural networks are a class of machine-learning models that can fit nonlinear, complex data. |
| Attention layer | A type of layer used in deep learning that allows the network to concentrate on specific elements in the input sequence |
| Deep learning | Neural networks with many hidden layers are commonly referred to as “deep learning”. Deep-learning architectures include convolutional, recurrent and attention layers. |
| Features | Input properties fed into machine-learning algorithms (e.g. the length of a sequence) are commonly referred to as “features”. |
| Feature engineering | The creation and selection of features to extract from data, which is considered a crucial part of machine-learning projects. |
| Low-dimensional embedding | A mathematical mapping from a high-dimensional space of inputs to a lower-dimensional space of representations. |
| Post-translational modification (PTM) | Chemical enzymatic alterations of amino-acid residues in proteins which often lead to functional changes. Major PTMs include phosphorylation, glycosylation and proteolytic cleavage. |
| Protein domain | An evolutionary-conserved protein region with independent, well-defined 3D structure and function. Many proteins contain multiple domains. |
| Protein motif | A short, conserved segment of amino acids in a protein associated with some function such as binding properties. |
| Self-supervised learning | A machine-learning paradigm for training supervised models over unsupervised (namely unlabeled) datasets by automatically generating labels. With text, this might be the prediction of the next word in a text |
| Transfer learning | Taking a model trained to solve one problem, and fine-tuning its parameters to solve another, related task. For example, training a computer-vision model to recognize cars, and then teaching it to recognize trucks |
| Transformers | A deep-learning architecture consisting of attention-based layers that is particularly suited for sequence inputs and outputs ( |