| Literature DB >> 35860409 |
Yuchao Liang1, Siqi Yang1, Lei Zheng1, Hao Wang1, Jian Zhou1, Shenghui Huang1, Lei Yang2, Yongchun Zuo1.
Abstract
Proteins are the executors of cellular physiological activities, and accurate structural and function elucidation are crucial for the refined mapping of proteins. As a feature engineering method, the reduction of amino acid composition is not only an important method for protein structure and function analysis, but also opens a broad horizon for the complex field of machine learning. Representing sequences with fewer amino acid types greatly reduces the complexity and noise of traditional feature engineering in dimension, and provides more interpretable predictive models for machine learning to capture key features. In this paper, we systematically reviewed the strategy and method studies of the reduced amino acid (RAA) alphabets, and summarized its main research in protein sequence alignment, functional classification, and prediction of structural properties, respectively. In the end, we gave a comprehensive analysis of 672 RAA alphabets from 74 reduction methods.Entities:
Keywords: Machine learning; Protein classification; Reduced amino acid alphabets; Sequence alignment; Structure analysis
Year: 2022 PMID: 35860409 PMCID: PMC9284397 DOI: 10.1016/j.csbj.2022.07.001
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1A reduced amino acid alphabet of two-clusters (AMWLYCFIV-PGHTSDEKNQR) can be represented in protein sequence by AP. A: Sankey diagram of RAA alphabet, the different colors on the left represent 20 different amino acids; the right side represents 20 amino acids are gradually clustered into two clusters. B: RAA alphabets of different clusters under the same reduction method (Size 2–11). C: Alignment of original sequence and RAA sequence. D: The application of RAA alphabet in WebLogo. Left, middle and right respectively represent the original Weblogo, RaacLogo by the first letter of each cluster and RaacLogo by color of each cluster.
RAA Webserver platform summary.
| Webserver Name | Link | Cite |
|---|---|---|
| PseKRAAC | ||
| RAACBook | ||
| RaacLogo | ||
| iSP-RAAC | ||
| iDEF-PseRAAC | ||
| iHEC-RAAC | ||
| POODLE-S | ||
| RaaMLab | ||
| iHSP-PseRAAAC | ||
| OGFE-RAAC | ||
| iDNA-Prot | ||
| PROFEAT | ||
| cnnAlpha | ||
| iDPF-PseRAAAC |
Fig. 2Statistics of 672 RAA alphabets in 74 reduction methods. A: The 74 reduction methods are divided into 6 categories according to different principles, and are arranged on the timeline. B: The 672 RAA alphabets are divided into 6 categories according to different principles, and correspond to Type. C: RAA alphabets of different clusters (Size 2–19) in the 74 alphabets. D: Summarize all RAA alphabets contained in the 74 reduction methods and cluster according to application scenarios, and the shade of color indicates the number of reduced clusters. See the attachment for the full content.
The 6 reduction categories of 74 reduction methods.
| Categories | Reduction Alphabets | Reduced clusters | Cite |
|---|---|---|---|
| Clustering Algorithm | 24 | 259 | |
| Mutation Matrix | 20 | 239 | |
| Computer Method | 12 | 60 | |
| Physical and Chemical Method | 12 | 52 | |
| Information Theory | 3 | 32 | |
| Statistical Analysis | 3 | 30 |
Fig. 3Five high-frequency words and their structures. A: The word ST, which is composed of serine and threonine, is contained in 18 reduced methods and occurs 43 times, and their R groups are both polar OH−. B: The word FY, which is composed of phenylalanine and tyrosine, is contained in 23 reduced methods and occurs 77 times, and their R groups are both phenyl rings. C: The word RK, which is composed of arginine and lysine, is contained in 15 reduced methods and occurs 66 times, and both of them contain amino groups in their R groups. D: The word IV, which is composed of isoleucine and valine, is contained in 20 reduced methods and occurs 94 times, and both of them have nonpolar R groups. E: The word DE, which is composed of aspartic and glutamic is contained in 23 reduced methods and occurs 77 times, and their R groups are both carboxyl groups.