| Literature DB >> 31802128 |
Lei Zheng1, Shenghui Huang1, Nengjiang Mu1, Haoyue Zhang1, Jiayu Zhang1, Yu Chang1, Lei Yang2, Yongchun Zuo1.
Abstract
By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules could produce distinctive results for protein sequence analysis. Thus, it is urgent to construct a systematical frame for reduced alphabets. In this work, we constructed a comprehensive web server called RAACBook for protein sequence analysis and machine learning application by integrating reduction alphabets. The web server contains three parts: (i) 74 types of reduced amino acid alphabet were manually extracted to generate 673 reduced amino acid clusters (RAACs) for dealing with unique protein problems. It is easy for users to select desired RAACs from a multilayer browser tool. (ii) An online tool was developed to analyze primary sequence of protein. The tool could produce K-tuple reduced amino acid composition by defining three correlation parameters (K-tuple, g-gap, λ-correlation). The results are visualized as sequence alignment, mergence of RAA composition, feature distribution and logo of reduced sequence. (iii) The machine learning server is provided to train the model of protein classification based on K-tuple RAAC. The optimal model could be selected according to the evaluation indexes (ROC, AUC, MCC, etc.). In conclusion, RAACBook presents a powerful and user-friendly service in protein sequence analysis and computational proteomics. RAACBook can be freely available at http://bioinfor.imu.edu.cn/raacbook. Database URL: http://bioinfor.imu.edu.cn/raacbook.Entities:
Year: 2019 PMID: 31802128 PMCID: PMC6893003 DOI: 10.1093/database/baz131
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1A schematic view of a protein 5TCD in PDB with secondary structures. Subfigure (A) shows the three-dimensional structure of this protein. All secondary structural elements are indicated as different labels. Subfigure (B) shows its corresponding chain view, where the gray background represents the portion of the reduced amino acid sequence that matches the protein secondary structural elements.
Figure 2The framework of the RAACBook. Block diagrams showing the modules and functions of RAACBook. Input data are on the left, output data presented on the right. The data of the manually curated database is collected in PubMed by a keyword filter. Users can provide protein sequences in the webpage to generate reduced sequence vector files and visualizations. The users can also upload the protein sequence datasets as input to obtain the corresponding classifier model and evaluation.
Figure 3RAACBook analysis and machine learning workflow. Subfigure (A) The workflow shows the reduction analysis of natural amino acid sequence. Settings pane: After uploading primary sequences in fasta format (Step 1) and the alphabet types of interest were used as input (Step 2). If only these parameters are submitted, the server can generate reduced sequence files (Step 3). If the aim is to produce sequence feature files for machine learning, users need to select three parameters (Step 4) and submit (Step 5). Analysis panel: there are three files for download (Step 6). The reduced amino acid sequences are visualized by clicking ‘Visualization’ button (Step 7). Sequences visualization: three charts were exhibited: alignment between natural and reduced amino acid sequences (Step 8), mergence of natural amino acid composition (Step 9), and distribution of amino acid composition (Step 10). Features visualization (Steps 11–13): according to different reduced alphabets and parameters, service will generate the K-tuple reduced amino acid composition heat map of multiple sequences and the distribution of single reduced sequence peptides. Reduced logo visualization (Steps 14–15): the figure represents each amino acid information of each position in protein sequence based on the reduced alphabet. Subfigure (B) Machine learning workflow shows the acquisition of the classifier model by uploading datasets and setting parameters. Settings panel (Steps 1–3): K-tuple, the alphabet type and the machine learning algorithm are selected (Step 2), after uploading fasta files containing positive and negative datasets (Step 1). Subsequently, the machine learning service was executed (Step 3). Model evaluation: the chart of Sp, Sn, Acc, Mcc and the diagram of the ROC curve are generated (Steps 4 and 5), and the classifier model and vector files can be downloaded (Steps 6–8).