Literature DB >> 35860409

Research progress of reduced amino acid alphabets in protein analysis and prediction.

Yuchao Liang¹, Siqi Yang¹, Lei Zheng¹, Hao Wang¹, Jian Zhou¹, Shenghui Huang¹, Lei Yang², Yongchun Zuo¹.

Abstract

Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.

Entities: Chemical

Keywords: Machine learning; Protein classification; Reduced amino acid alphabets; Sequence alignment; Structure analysis

Year: 2022 PMID： 35860409 PMCID： PMC9284397 DOI： 10.1016/j.csbj.2022.07.001

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 6.155

Introduction

As the direct execution molecules of cellular life activities, the study of proteins has received much attention in the past few decades. With the maturity of technologies such as high-throughput sequencing, mass spectrometry, and co-immunoprecipitation, more and more protein sequence, structure, and function data have been annotated and published, which opened the way for human proteomics research [1], [2]. However, it has been gradually discovered that there are many drawbacks in the method of annotating protein information experimentally, such as time-wasting, expensive consumables, inefficiency, etc. In recent years, the analysis and prediction methods based on machine learning and artificial intelligence have been continuously developed and applied to the research of biology and bioinformatics, which greatly shorten the experimental time and improve the experimental efficiency [3], [4]. However, researchers' work is hindered by the cumbersome feature engineering, the increased complex network model architectures, and ever-upgrading hardware requirements [5], [6]. To this end, people are also seeking balance, resulting in various feature analysis and optimization methods, such as principal component analysis (PCA), relief algorithm, F-score, linear dimension reduction algorithm (LDA) and more streamlined model architectures such as deep residual networks (ResNet) [7], [8], [9], [10], [11]. The simplified amino acid composition greatly reduces the dimensions of traditional feature engineering, effectively suppresses the negative effects of noise, and provides the model with richer biological prior knowledge to extract key features [12], [13]. In addition, it is highly inclusive and has good compatibility with many existing methods, which helps to promote the further integration of traditional machine learning and biology [14], [15]. RAA alphabets are not a recent product, which had been mentioned as early as the 1960s. Morita et al. proposed in 1967 that three-clusters random polypeptide segments (Glu, Lys, Ala) can form α helix [16]. In 1992, Heinz et al. confirmed the existence of a lot of redundant information in the amino acid sequence through the phage T4 lysozyme mutation experiment [17]. In the same year, the evolution of amino acid types from simple to complex was demonstrated by Osawa et al. [18]. A five-clusters reduction scheme was proposed by Riddle et al. in 1997 through the phage SH3 domain [19], which was tested by Wolynes from the perspective of energy [20]. Schafmeister et al. also proposed the use of a seven-clusters reduction scheme to synthesize 4 helical protein bundles [21]. In 1999, Wang et al. proposed a minimal mismatch-based RAA alphabets named HP, which laid a theoretical foundation for the research on RAA alphabets [22]. Their model still plays an important role in many theories until now. As part of feature engineering, the most important feature of the reduced amino acid (RAA) composition is the fundamental redefinition of sequence. For any protein sequence, 20 amino acid residues can be grouped by specific methods and assigned new identifiers to each class (Fig. 1A). We construct sequences using the c residues and map them one-to-one with the natural sequence (Fig. 1C). According to specific clustering rules, we construct RAA alphabets of different clusters (Size 2–19), which is more conducive to the wide adaptation of the same reduced alphabet to different protein data (Fig. 1B).

Fig. 1

A reduced amino acid alphabet of two-clusters (AMWLYCFIV-PGHTSDEKNQR) can be represented in protein sequence by AP. A: Sankey diagram of RAA alphabet, the different colors on the left represent 20 different amino acids; the right side represents 20 amino acids are gradually clustered into two clusters. B: RAA alphabets of different clusters under the same reduction method (Size 2–11). C: Alignment of original sequence and RAA sequence. D: The application of RAA alphabet in WebLogo. Left, middle and right respectively represent the original Weblogo, RaacLogo by the first letter of each cluster and RaacLogo by color of each cluster. In the following, we systematically review the methodological studies of the reduced amino acid alphabets and their major progress in protein sequence alignment, functional classification, and prediction of structural properties. The 672 RAA alphabets of the 74 reduction methods will be comprehensively discussed in the end.

The reduction methods of natural amino acid alphabets

Since the 21st century, the rapid development of computer technology and the raise of various amino acid mutation matrices (such as Miyazawa and Jernigan's MJ-matrix [23], BLOSUM matrix [24], [25], PAM matrix [26], [27], JTT matrix [28], WAG matrix [29]) have expanded the application direction of RAA. Murphy et al. used the BLOSUM50 mutation matrix to illustrate the effect of RAA on protein folding and predict that only 10–12 clusters of RAA alphabets would be required to represent different families of proteins [30]. Kosiol et al. constructed the new RAA alphabets using a Markov model based on PAM matrix and WAG matrix which were famous in the field of sequence alignments and phylogenetic trees [31]. Cannata et al. used multiple substitution matrices such as PAM and BLOSUM to perform an exhaustive analysis of all possible RAA alphabets and built it into the WebServer platform AlphaSimp [32]. Then, some biologists boldly put the RAA alphabets into practical application, trying to apply RAA alphabets to existing research. Akanuma et al. replaced 88% of the amino acid sequence with AAA reduced sequences (A, D, G, L, P, R, T, V, Y) by site-directed mutagenesis of Escherichia coli whey phosphoglycosyltransferase, which did not affect the structure and function of the protein [33]. Davies et al. developed a G protein-coupled receptor (GPCR) classifier through artificial immune algorithm (AIS) combined with RAA alphabets, and achieved great results [34]. In the past research, the RAA alphabets based protein prediction methods mostly relied on traditional machine learning techniques like support vector machine (SVM) [35]. They achieved superior performance in many scenarios. For example, in 2004, Weathers et al. used RAA alphabets based on SVM to classify and predict intrinsically disordered proteins, and achieved an accuracy of about 87% [36]. In 2009, Bohnstingl et al. used the RAA-based BioHEL to predict the number of contacts and relative solvent accessibility of protein structures [37]. Yang et al. proposed an RAA-SVM model for predicting protein subcellular localization in 2015, and compared the prediction performance of different machine learning models in detail [38]. A new generation of deep learning based machine learning algorithms greatly enhanced the customization and application of RAA alphabets. In 2001, Meiler et al. published an RAA alphabets generation method based artificial neural network, and proposed that each amino acid can be replaced by several sets of physical features [39]. In 2020, Oberti et al. used a convolutional neural network based RAA alphabets to predict the intrinsically disordered regions of proteins [40].

The application of reduced amino acid alphabets for sequence alignment

Sequence alignment and sequence search algorithms are not only one of the most commonly used methods in bioinformatics but also the cornerstone of many mainstream protein analysis methods. However, with the continuous increase of protein data and sequence complexity, the efficiency of multiple sequence alignment in huge databases is gradually unsatisfactory. There have been a lot of studies to improve the speed of sequence alignment from different methods, among which the RAA composition has been used as a common dimension reduction method in many excellent studies. Algorithms for sequence alignment of proteins usually have high time complexity due to the diversification of sequences. Murphy et al. analyzed in detail the protein alignment effect of the RAA alphabets with different sizes, and pointed out that alphabets with less than 10 clusters would greatly lose sequence information. Ye et al. developed the fast protein similarity search tools RAPSearch and RAPSearch2 based on the 10-clusters RAA alphabet, which are 20–90 times faster than BLAST, and more significantly for shorter reads [41], [42]. Buchfink et al. constructed DIAMOND, which is a fast protein sequence alignment algorithm using an algorithm based on a double index of the RAA alphabet [43]. It is 40–20,000 times faster than BLAST and has close sensitivity, which greatly improves alignment efficiency in large databases. Steinegger et al. proposed that the Kmer based on RAA alphabets and BLOSUM62 matrix can greatly improve the efficiency of sequence alignment, and developed a series of sequence search/clustering algorithms and tools for MMSeq based on this method [44], [45], [46]. Melo et al. used the RAA composition to align distant homologous sequences, and pointed out that fewer amino acid species would improve the alignment performance of conserved structures of distant homologous sequences [47].

The classification of protein function based on reduced amino acid alphabets

With the exponential expansion of proteomic data, using machine learning methods to mine the sequence intrinsic regularities behind the functions of known proteins from massive data and make accurate predictions about the functions, families and cellular localization of unknown proteins has become a focus of the current research work. Simplified amino acid alphabets greatly expand the method of protein sequence feature representation, and restore the seemingly complex and disordered sequence due to evolutionary mutation to a more conservative and concise state. It not only explains the sequence properties and evolutionary direction of proteins in biology, but also improves the prediction performance of the model. In 2007, Chen et al. constructed a six-clusters reduced alphabet based on amino acid hydrophilicity and hydrophobicity, which successfully predicted the subcellular localization of apoptotic proteins, emphasizing the importance of hydrophilicity in the study of protein subcellular localization sex [48]. In 2012, Lin et al. constructed a multi-classification model of the ketoacyl synthase family based on RAA-SVM, which enabled SVM to obtain important compositional features of proteins [49]. In 2013, Feng et al. developed iHSP-PseRAAAC for predicting heat shock proteins and achieved good performance in complex classification tasks [50]. In 2014, Liu et al. published a prediction model for DNA-binding proteins based on RAA alphabets, which greatly reduced the feature dimension of traditional pseudo-amino acids and improved the prediction performance [51]. Similarly, our previous works successfully applied RAA alphabets in important research fields such as protein subtype classification, protein subfamily classification, and protein subcellular localization [52], [53], [54]. Veltri et al. published a reduced alphabet model based on deep learning in 2018, and successfully improved the recognition accuracy of antimicrobial peptides [55]. It is worth noting that the reduction alphabets of amino acids directly affect the performance of classification prediction, and it is important to choose the most suitable reduction scheme among a large number of imputation models. In 2008, Davies et al. used the artificial immune system (AIS) to screen the RAA alphabets most suitable for G protein-coupled receptors, and analyzed the contribution and significance of the reduced alphabet in the GPCR classification model through classifier prediction results [34]. By comparing different reduction alphabets, they found that cysteines always tend to be grouped independently, which is closely related to the formation of disulfide bonds and the maintenance of spatial structure of GPCRs, and is a key feature of GPCR classification. In 2019, we used a RAA-based Kmer method to predict defensins, small antimicrobial proteins that play an important role in cellular nonspecific immunity [12]. By modeling the predictions for the K = 2 and K = 3 features of more than 600 reduced alphabets, the best prediction performance was finally achieved in the “PGEKRQDSNTHClVW-YF-ALM” scheme with K = 2, and the highest prediction scores were achieved in different species and different excellent results were obtained in the defensin prediction of the family. In addition, a large number of researchers are also working on the construction and popularization of RAA alphabets platforms, which can also be obtained in Table 1. In 2007, Shimizu proposed POODLE-S, a protein disorder prediction platform based on amino acid physicochemical properties and position-specific scoring matrix, which has received extensive attention and citations [56]. In 2017, our group built an RAA platform PseKRAAC based on pseudo-amino acids and Kmers, and integrated 16 amino acid sequence reduction schemes, which facilitated non-bioinformatics researchers [57]. In 2019, Xi et al. proposed a mapping tool platform based on RAA method, RaaMLab. They organize a large database of amino acid physicochemical properties and support user-defined reduced alphabets [58]. In recent years, we successively constructed iDEF-PseRAAC, RaacLogo, RaacBook, OGFE-RAAC and other protein analysis and prediction platforms based on RAA alphabets, which enriched the application scope of RAA alphabets and emphasized the important role of simplified amino acid composition in sequence-structure–function (Fig. 1D and Table 1) [12], [13], [14], [15], [59].

Table 1

RAA Webserver platform summary.

Webserver Name	Link	Cite
PseKRAAC	http://bigdata.imu.edu.cn/	[57]
RAACBook	http://bioinfor.imu.edu.cn/raacbook	[14]
RaacLogo	http://bioinfor.imu.edu.cn/raaclogo	[59]
iSP-RAAC	http://bioinfor.imu.edu.cn/ispraac/public	[60]
iDEF-PseRAAC	http://bioinfor.imu.edu.cn/idpf	[12]
iHEC-RAAC	http://bioinfor.imu.edu.cn/ihecraac	[13]
POODLE-S	http://mbs.cbrc.jp/poodle/poodle-s.html (Inaccessible)	[56]
RaaMLab	https://github.com/bioinfo0706/RaaMLab	[58]
iHSP-PseRAAAC	http://lin-group.cn/server/iHSP-PseRAAAC	[50]
OGFE-RAAC	http://bioinfor.imu.edu.cn/ogferaac	[15]
iDNA-Prot	http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/	[51]
PROFEAT	http://jing.cz3.nus.edu.sg/cgi-bin/prof/prof.cgi (Inaccessible)	[61]
cnnAlpha	https://github.com/mauricioob/shiny-pred	[40]
iDPF-PseRAAAC	http://wlxy.imu.edu.cn/college/biostation/fuwu/iDPF-PseRAAAC/index.asp (Inaccessible)	[54]

RAA Webserver platform summary.

The prediction of protein structure property based on reduced amino acid alphabets

The structure of protein is a decisive factor in its functioning. A large number of proteins with unique functions are obviously conserved in their natural structures. For example, GPCRs have seven transmembrane domains, and their structures show clear rules of solvent accessibility. However, the detection methods of protein structure and properties are complicated, and the manual analysis is inefficient, which has been plaguing the whole biological world. The traditional identification of protein structure properties requires professional technicians to gradually explore through methods such as X-ray crystallography and nuclear magnetic resonance, which takes a long time. After the rise of bioinformatics, people used early experimental data to analyze structural laws through machine learning methods, and tried to predict protein structure properties, such as intrinsic disorder, solvent accessibility and contact number. Weathers et al. used hydrophilicity and hydrophobicity as the reduction rule for functional classification prediction of intrinsically disordered proteins, and pointed out that hydrophobic amino acids play a central role in stabilizing folded proteins in 2004 [36]. In 2006, Melo proposed the use of RAA alphabets to improve sequence alignment and protein folding accuracy [47]. They developed a new genetic algorithm to obtain a five-clusters reduction scheme based entirely on structural information, and supposed that the five-clusters-based reduction model also has good predictive performance in evaluating protein folding. In 2009, Bacardit et al. proposed a method for predicting protein structure contact number and solute accessibility on the basis of the mutual information reduced alphabet, and emphasized that the reduction well preserved the physicochemical properties of amino acid residues and improved the accuracy [37]. In 2020, Oberti et al. used a convolutional neural network based on simplified amino acid composition to predict the intrinsically disordered regions of proteins, and proposed that RAA alphabets help convolution to recognize complex patterns in sequences [40]. In recent years, the AlphaFold series created by Google DeepMind has raised the accuracy and efficiency of protein structure prediction to a new level based on a powerful artificial neural network architecture. With the support of AlphaFold structure database, a large number of protein structural properties analysis predictions continue to emerge. Recently, a protein structure analysis platform RaacFold based on RAA alphabets has been constructed. It combines RAA alphabets with the structural database predicted by AlphaFold2 and previous protein structure database, which provides users with a convenient protein structure and property analysis service by using different RAA alphabets [62]. The 3D rendering service of reduction structure properties provided by RaacFold enriched the application of RAA alphabets in the analysis of protein sequence and structural properties.

A comprehensive analysis of the 672 reduced amino acid alphabets

In recent years, we have collected a large number of RAA alphabets and achieved many excellent results in predicting protein functional classification by using these RAA alphabets. Based on our research work, 672 RAA alphabets from 74 reduction methods have been arranged, and annotated with the source and reduction method of each reduced alphabet in detail (Please refer to the supplementary file for full data). According to different principles, we summarize the 74 reduction methods into 6 types, namely Clustering Algorithm, Mutation Matrix, Computer Method, Physical and Chemical Method, Information Theory and Statistical Analysis (Fig. 2B and Table 2). Clustering Algorithm and Mutation Matrix are widely used in RAA research, accounting for more than half of the papers published in the past 20 years. Many RAA alphabets are still in use today (Fig. 2A).

Fig. 2

Table 2

The 6 reduction categories of 74 reduction methods.

Categories	Reduction Alphabets	Reduced clusters	Cite
Clustering Algorithm	24	259	[55], [63], [64], [65], [66], [67], [68], [69], [70], [71]
Mutation Matrix	20	239	[22], [30], [31], [32], [51], [72], [73], [74], [75], [76], [77], [78], [79], [80]
Computer Method	12	60	[34], [37], [47], [78], [81], [82]
Physical and Chemical Method	12	52	[36], [37], [48], [61], [78], [83], [84], [85], [86], [87], [88], [89]
Information Theory	3	32	[90], [91], [92]
Statistical Analysis	3	30	[63], [78], [93]

Statistics of 672 RAA alphabets in 74 reduction methods. A: The 74 reduction methods are divided into 6 categories according to different principles, and are arranged on the timeline. B: The 672 RAA alphabets are divided into 6 categories according to different principles, and correspond to Type. C: RAA alphabets of different clusters (Size 2–19) in the 74 alphabets. D: Summarize all RAA alphabets contained in the 74 reduction methods and cluster according to application scenarios, and the shade of color indicates the number of reduced clusters. See the attachment for the full content. The 6 reduction categories of 74 reduction methods. We counted 672 RAA alphabets and the reduced sizes they contained (Fig. 2C), and classified them into seven categories according to the application scenarios of each reduced alphabet, namely protein folding, build reduced alphabets, functional classification, secondary structure prediction, sequence alignment, structure prediction, and protein interaction (Fig. 2D). Among all alphabets, Size2-Size5 has the largest proportion, which is related to the early results of a large number of RAA studies by Wang et al (Fig. 2C) [22]. However, with the development of research, a large number of research results pointed out that too small simplified alphabets can easily lead to a large loss of sequence information. Reduced alphabets of Size 10 and above perform better for most jobs while retaining the protein information [30], [75]. Of the 672 RAA alphabets, nearly half of the alphabets have only been created and not put into specific research work. Most of the rest are devoted to protein alignment, folding, and functional structure prediction, laying a solid foundation for protein diversification analysis. The combined frequencies of all words showed that the five words “ST”, “FY”, “RK”, “DE” and “IV” were distributed more frequently (over 40 times) in most alphabets (Fig. 3). This means that these five words may be recognized by many researchers due to their similar properties in a lot of cases. For example, Wang's article points out that “DE” (Asp and Glu) can be reduced to one class by MJ matrix and contact potential, which is verified in Yu's article by a multi-species classification model, and the same reduction results are obtained in Mirny's article by structurally derived substitution matrices [22], [84], [86].

Fig. 3

Five high-frequency words and their structures. A: The word ST, which is composed of serine and threonine, is contained in 18 reduced methods and occurs 43 times, and their R groups are both polar OH−. B: The word FY, which is composed of phenylalanine and tyrosine, is contained in 23 reduced methods and occurs 77 times, and their R groups are both phenyl rings. C: The word RK, which is composed of arginine and lysine, is contained in 15 reduced methods and occurs 66 times, and both of them contain amino groups in their R groups. D: The word IV, which is composed of isoleucine and valine, is contained in 20 reduced methods and occurs 94 times, and both of them have nonpolar R groups. E: The word DE, which is composed of aspartic and glutamic is contained in 23 reduced methods and occurs 77 times, and their R groups are both carboxyl groups.

Conclusion

The research on the structure and function of proteins has been accelerating, and the methods and tools that have been kept in dust for many years have gradually shown their powerful advantages. Protein analysis and prediction methods based on machine learning improve analytical efficiency, achieve higher precision, and solve deeper biological problems. As an important part of protein feature engineering, the reduction of amino acid alphabets has realized the redefinition of sequence and structure. It not only has strong inclusive power, allowing it to be used as an upstream processing step for almost all existing methods, but also provides the model with richer biological prior knowledge, which greatly optimizes the biological background of traditional computer models and is expected to decipher proteins under the complex structures. In addition, it provides better solutions to problems such as the cumbersomeness and dimension explosion of current machine learning and artificial intelligence methods, and is more suitable for deployment on small and medium-sized computers and servers to reduce the computing pressure of equipment. The current research results and evaluation criteria for RAA alphabets have not formed a set of recognized systems, and RAA alphabets have not been fully and maturely used in current research. Under the joint promotion of all researchers, simplified amino acid composition still has space for optimization and important significance in the new era, and the technology and platform based on RAA alphabets may still create higher and far-reaching value in the future.

Funding

This work was supported by the National Nature Scientific Foundation of China (No: 62171241, 62061034, 61861036), the Key Technology Research Program of Inner Mongolia Autonomous Region (2021GG0398), and the Science and Technology Major Project of Inner Mongolia Autonomous Region of China to the State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock (2019ZD031).

CRediT authorship contribution statement

Yuchao Liang: Writing - original draft, Investigation, Formal analysis. Siqi Yang: Investigation, Writing - review & editing. Lei Zheng: Software. Hao Wang: Writing - review & editing. Jian Zhou: Writing - review & editing. Shenghui Huang: Software, Investigation. Lei Yang: Writing - review & editing. Yongchun Zuo: Writing - Review & Editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

88 in total

1. Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses.

Authors: Zu-Guo Yu; Vo Anh; Ka-Sing Lau
Journal: J Theor Biol Date: 2004-02-07 Impact factor: 2.691

2. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

3. Solving the protein sequence metric problem.

Authors: William R Atchley; Jieping Zhao; Andrew D Fernandes; Tanja Drüke
Journal: Proc Natl Acad Sci U S A Date: 2005-04-25 Impact factor: 11.205

4. Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization.

Authors: James O Wrabl; Nick V Grishin
Journal: Proteins Date: 2005-11-15

5. Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids.

Authors: Jing Li; Wei Wang
Journal: Sci China C Life Sci Date: 2007-06

10. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition.

Authors: Bin Liu; Jinghao Xu; Xun Lan; Ruifeng Xu; Jiyun Zhou; Xiaolong Wang; Kuo-Chen Chou
Journal: PLoS One Date: 2014-09-03 Impact factor: 3.240