| Literature DB >> 30867681 |
Sebastian Spänig1, Dominik Heider1.
Abstract
Antimicrobial peptides (AMPs) are part of the inherent immune system. In fact, they occur in almost all organisms including, e.g., plants, animals, and humans. Remarkably, they show effectivity also against multi-resistant pathogens with a high selectivity. This is especially crucial in times, where society is faced with the major threat of an ever-increasing amount of antibiotic resistant microbes. In addition, AMPs can also exhibit antitumor and antiviral effects, thus a variety of scientific studies dealt with the prediction of active peptides in recent years. Due to their potential, even the pharmaceutical industry is keen on discovering and developing novel AMPs. However, AMPs are difficult to verify in vitro, hence researchers conduct sequence similarity experiments against known, active peptides. Unfortunately, this approach is very time-consuming and limits potential candidates to sequences with a high similarity to known AMPs. Machine learning methods offer the opportunity to explore the huge space of sequence variations in a timely manner. These algorithms have, in principal, paved the way for an automated discovery of AMPs. However, machine learning models require a numerical input, thus an informative encoding is very important. Unfortunately, developing an appropriate encoding is a major challenge, which has not been entirely solved so far. For this reason, the development of novel amino acid encodings is established as a stand-alone research branch. The present review introduces state-of-the-art encodings of amino acids as well as their properties in sequence and structure based aggregation. Moreover, albeit a well-chosen encoding is essential, performant classifiers are required, which is reflected by a tendency towards specifically designed models in the literature. Furthermore, we introduce these models with a particular focus on encodings derived from support vector machines and deep learning approaches. Albeit a strong focus has been set on AMP predictions, not all of the mentioned encodings have been elaborated as part of antimicrobial research studies, but rather as general protein or peptide representations.Entities:
Keywords: Antimicrobial peptides; Encodings; Machine learning
Year: 2019 PMID: 30867681 PMCID: PMC6399931 DOI: 10.1186/s13040-019-0196-x
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Summary of sequence based encodings
| Encoding | Description | Summary | Used in | Used along with | Main |
|---|---|---|---|---|---|
| Sparse | each amino acid is represented as an one-hot vector of length 20, where each position, except one, is set to 0 | Density: - | [ | Substitution Matrix, Amino Acid Composition | Sparse encoding |
| Amino Acid Composition | feature vector contains at each position the proportion of an amino acid in relation with the sequence length | Density: + | [ | Distance Frequency, | Amino acid composition |
| Distance Frequency | calculates the distance between amino acids of similar properties and bins the occurrence according to the gap length | Density: + | [ | Amino acid composition | |
| Quantitative Matrix | encodes the propensity of each amino acid at a position | Density: + | [ | Amino acid composition | |
| CTD | describes the composition (C), transition (T) and distribution (D) of similar amino acids along the peptide sequence | Density: + | [ | Amino acid composition | |
| Pseudo-amino Acid Composition (PseAAC) | computes the correlation between different ranges among a pair of amino acids | Density: + | [ | Dipeptide Composition | Pseudo amino acid composition |
| Reduced Amino Acid Alphabet | similar amino acids are grouped together | Density: + | [ | N-gram Model, AAIndexLoc | Reduced amino acid alphabet |
| N-gram Model | occurrences of n-mers for an alphabet of size m, leading to a mn dimensional, sparse representation of the initial sequence | Density: - | [ | Reduced amino acid alphabet | |
| AAIndexLoc | k-nearest neighbor clustering to aggregate amino acids into 5 classes using their amino acid index, i.e., amino acids with the respective highest(T), high (H), medium (M), low (L), and lowest (B) values of a particular physicochemical property are clustered together | Density: o | [ | Dipeptide Composition | Reduced amino acid alphabet |
| Physicochemical Properties | translation of an amino acid to a particular physicochemical property | Density: o | [ | z-descriptor, d-descriptor and many more | Physicochemical properties |
| z-descriptor | derived from the principal components of physicochemical properties by means of partial least squares (PLS) projections, PLS leads to a subset of five final features, capable to describe the 20 proteinogenic as well as 67 additional amino acids | Density: + | [ | Physicochemical properties | |
| d-descriptor | amino acid sequence is squeezed between the y- (N-terminus) and the x-axis (C-terminus) with gradually bending of the single amino acids and subsequent vector summation | Density: + | [ | Physicochemical properties | |
| Autocorrelation | interdependence between two distant amino acids in a peptide sequence | Density: + | [ | Autocorrelation | |
| Substitution/Scoring Matrix | provide accepted mutations between amino acid pairs, i.e., sequence alterations with either no or positive impact in terms of the protein function | Density: + | [ | BLOMAP, Sparse, Amino Acid Composition, Dipeptide Composition, PseAAC, AAIndexLoc | Substitution and scoring matrix |
| BLOMAP | incorporates the BLOSUM62 to calculate distances in a high dimensional input space, i.e., the substitution matrix, to a lower dimension, using the Shannon-projection | Density: + | [ | Substitution and scoring matrix | |
| Fourier Transformation | to detect underlying patterns in time series, by transforming the time signal to a frequency domain | Density: o | [ | Fourier Transformation |
+ (good), o (neutral/no declaration), − (bad). For instance, “Density: -” means the encoding results in a high dimensional feature space and “Information: +” reflects a representative mapping from the residue sequence to the numerical vector. “o” denotes encodings, which are difficult to classify, due to missing details in the respective publication or can be considered as neutral. In general, the classification rests upon the authors experience and shall support researchers to quickly grasp suitable encodings. Nevertheless, an encoding which has been rated “-” still might work well for a particular application and should by no means regarded as the final evaluation
Summary of structure derived encodings
| Encoding | Description | Summary | Used in | Used along with |
|---|---|---|---|---|
| Quantitative structure-activity relationship (QSAR) | describes amino acids sequences by their chemical properties, molecular characteristics and structure | Density: o | [ | z-Descriptors |
| General Structure | protein structure is described by means of their total 3D shape, secondary structure, solvent accessibility, aggregation tendency, contact number, residue depth | Density: + | [ | |
| Electrostatic Hull | wraps superimposed shapes of the proteins sub-structure | Density: o | [ | Physicochemical Properties |
| Spheres | incorporates structural variations as consequence of sequential rearrangements | Density: o | [ | Physicochemical Properties |
| Distance Distribution | distribution of euclidean distances between each atom type | Density: o: | [ | |
| Delaunay Triangulation | encodes the complete protein shape by finding the optimal edges between representative atoms | Density: o | [ |
+ (good), o (neutral/no declaration), − (bad) (see Table 1 for further details)
Summary of alternative encodings (see Table 1 for further details)
| Encoding | Description | Summary | Used in | Used along with |
|---|---|---|---|---|
| Chaos Game Representation (CGR) | a visual encoding of a sequence, generating a fractal | Density: - | [ | Physicochemical Properties |
| Linguistic Model | description of AMPs by a grammar | Density: o | [ |
Fig. 1The single letter amino acid composition counts the occurrence of the respective amino acids
Fig. 2Sketch of sequence-based encodings derived from autocorrelation and reduced amino acid alphabet. a Autocorrelation and pseudo-amino acid composition from adjacent residues, considering a gap size of one. b Reduced amino acid alphabet. Clustering corresponds to similar physicochemical properties, according to Veltri et al. (2017)
Fig. 3Similarly to the amino acid composition, the k-mer composition counts the presence of k-mers. In this example k is set to three
Fig. 4Sketch of sequence-based encodings derived from physicochemical properties and Fourier transformation. a The numerical representations are based on the physicochemical properties of Serine (S), Glutamine (Q), Valine (V), Threonine (T), Asparagine (N) and Alanine (A). b Fourier transformation derived from the encoded peptide sequence
Fig. 5Exemplary structure-based encodings for antimicrobial peptide Human Defensin 5 (PDB:2LXZ). a Solvent accessible surface. Color coding according to hydrophobicity scale (Eisenberg et al., 1984) b Delaunay triangulation of the same peptide calculated from Cα-atoms. Bose et al. (2011) used the summed distances between amino acid pairs to encode protein structure
Different encodings from deep learning models (see Table 1 for details)
| Encoding | Description | Summary | Used in | Used along with |
|---|---|---|---|---|
| ProtVec | amino acid sequences are encoded as a distributed representation of k-mers | Density: + | [ | |
| Voxel | structures of proteins are encoded as voxels | Density: o | [ | |
| Matrix | mimicks images by regarding the respective entries of PSSMs as pixel densities | Density: o | [ | PSSM |
| Autoencoder | extracts representative characteristics in order to reproduce the input as good as possible | Density: + | [ |
Different types of string kernels (see Table 1 for further details)
| Encoding | Description | Summary | Used in | Used along with |
|---|---|---|---|---|
| Spectrum Kernel | generates all possible subsequences of length k and counts the occurrences of these k-mers | Density: - | [ | |
| Mismatch Kernel | considers a certain distance, hence mismatches, between two k-mers | Density: - | [ | General Structure |
| Distant Segment Kernel | allows a gap between two k-mers | Density: - | [ | |
| Local Alignment Kernel | obtained from local alignment scores | Density: + | [ | Spectrum Kernel, Mismatch Subsequence Kernel |
| Subsequence Kernel | measures sequence similarity, gaps within k-mers are taken into account | Density: + | [ | Frequency of Amino Acid Pairs |
| Frequency of Amino Acid Pairs | similar to dipeptide composition | Density: - | [ | |
| String Kernels + Physicochemical Properties | optimization of existing string kernels such that these involve physicochemical properties | Density: + | [ | Physicochemical Properties |
| Generic String Kernel | string kernel with physicochemical properties and penalization of non adjacent segments | Density: + | [ |