Literature DB >> 16845018

PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence.

Z R Li¹, H H Lin, L Y Han, L Jiang, X Chen, Y Z Chen.

Abstract

Sequence-derived structural and physicochemical features have frequently been used in the development of statistical learning models for predicting proteins and peptides of different structural, functional and interaction profiles. PROFEAT (Protein Features) is a web server for computing commonly-used structural and physicochemical features of proteins and peptides from amino acid sequence. It computes six feature groups composed of ten features that include 51 descriptors and 1447 descriptor values. The computed features include amino acid composition, dipeptide composition, normalized Moreau-Broto autocorrelation, Moran autocorrelation, Geary autocorrelation, sequence-order-coupling number, quasi-sequence-order descriptors and the composition, transition and distribution of various structural and physicochemical properties. In addition, it can also compute previous autocorrelations descriptors based on user-defined properties. Our computational algorithms were extensively tested and the computed protein features have been used in a number of published works for predicting proteins of functional classes, protein-protein interactions and MHC-binding peptides. PROFEAT is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/prof/prof.cgi.

Entities: CellLine Chemical Species

Mesh：

Substances：

Year: 2006 PMID： 16845018 PMCID： PMC1538821 DOI： 10.1093/nar/gkl305

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Sequence-derived structural and physicochemical features have frequently been used for predicting protein structural and functional classes (1–5), protein–protein interactions (6–8), subcellular locations (9,10) and peptides of specific properties (11) (J. Cui, L. Y. Han, H. H. Lin, H. L. Zhang, Z. Q. Tang, C. J. Zheng, Z. W. Cao and Y. Z. Chen, manuscript submitted) from their sequence. These features are highly useful for representing and distinguishing proteins or peptides of different structural, functional and interaction profiles, which is essential for the successful application of statistical learning methods in predicting the structural, functional and interaction profiles of proteins and peptides irrespective of sequence similarity (12). While several programs for computing protein structural and physicochemical features have been developed (1,2,6,9–11,13), these are not freely and easily accessible. We introduce PROFEAT, Protein Features, as a freely accessible web-based server for computing the commonly-used structural and physicochemical features of proteins and peptides from amino acid sequence.

WEB SERVER ACCESS

PROFEAT is available at . The sequence of a protein or a peptide, in single-letter code and RAW format, as well as FASTA format, can be input in a window provided. The RAW format is similar to the plain text format except that it removes any white-space and TAB characters, accepts only alphabetic characters and rejects anything else. Multiple sequence entries, in FASTA format, can also be input to facilitate the convenient export of the generated protein features to machine learning methods servers. Illustrative examples for submitting single sequence entry and multiple sequence files to POFEAT and for sending the generated feature vector files to a machine learning server GIST (14) are provided in the on-line manual at the PROFEAT homepage. An input sequence with less than eight amino acids is not accepted, because functional peptides typically contain more than eight amino acids and protein chains are much longer. If an input sequence contains an invalid character, or a non-amino acid letter, or abnormal composition, such as long stretch of the same amino acid covering an entire protein sequence, then a message of ‘invalid character …’ or ‘your input sequence is invalid’ is displayed. The computed features are divided into six groups each of which has been separately used for protein or peptide studies. Upon submitting a sequence, users are directed to a window shown in Figure 1 for selecting the feature groups to be displayed and the output file format. Three types of file format are provided to support printer-friendly view and the export of the computed features to computational software or servers, such as GIST (14). An index Fi.j.k.l is used to represent the lth descriptor value of the kth descriptor of the jth feature in the ith feature group, which serves as an easy reference to the PROFEAT manual provided in the server homepage.

Figure 1

PROFEAT feature-display options window

MATERIALS AND METHODS

As shown in Table 1, 10 sets of commonly-used structural and physicochemical features, including 51 descriptors and 1447 descriptor values, are computed by PROFEAT. These features can be divided into six groups each of which has been used as an independent set of features for predicting proteins and peptides of various profiles by using statistical learning methods. The first group includes two features, amino acid composition and dipeptide composition, with 2 descriptors and 420 descriptor values (15–20). Each of the second, third and fourth group contains a different autocorrelation feature: normalized Moreau–Broto autocorrelation (21,22), Moran autocorrelation (23) and Geary autocorrelation (24). Each of these features has 8 descriptors and 240 descriptor values. The fifth group consists of three feature sets: composition, transition and distribution with a total of 21 descriptors and 147 descriptor values (2–6,8,25,26) (J. Cui, L. Y. Han, H. H. Lin, H. L. Zhang, Z. Q. Tang, C. J. Zheng, Z. W. Cao and Y. Z. Chen, manuscript submitted). The sixth group contains two sequence-order feature sets (9–11,27), one is sequence-order-coupling number with 2 descriptors and 60 descriptor values, and the other is quasi-sequence-order with 2 descriptors and 100 descriptor values. Apart from these descriptors, it can also compute previous autocorrelation descriptors based on user-defined properties. The references of the studies that used which of these features are provided in the subsequent discussions.

Table 1

List of structural and physicochemical features of proteins and peptides commonly-used for predicting proteins and peptides of specific properties by using statistical learning methods

Feature group	Feature	Feature index	No. of descriptors	No. of descriptor values
Amino acid, dipeptide composition	Amino acid composition	F1.1	1	20
	Dipeptide composition	F1.2	1	400
Autocorrelation 1	Normalized Moreau–Broto autocorrelation	F2.1	8	240
Autocorrelation 2	Moran autocorrelation	F3.1	8	240
Autocorrelation 3	Geary autocorrelation	F4.1	8	240
Composition, transition, distribution	Composition	F5.1	7	21
	Transition	F5.2	7	21
	Distribution	F5.3	7	105
Sequence-order	Sequence-order-coupling number	F6.1	2	60
	Quasi-sequence-order descriptors	F6.2	2	100

Amino acid and dipeptide composition are simplistic descriptors of protein sequence features (15), which have been used for predicting protein fold and structural classes (19,20), functional classes (16) and subcellular locations (17,18) at accuracy levels of 72–95%, 83–97% and 79–91%, respectively. Amino acid composition is the fraction of each amino acid type in a sequence: , where r = 1, 2, 3, … , 20, N is the number of amino acid of type r, and N is the length of the sequence. A total of 20 descriptor values are computed for the 20 types of amino acids. Dipeptide composition is defined as: , where r,s = 1, 2, 3, … , 20, and N is the number of dipeptides of amino acid type r and s (16). A total of 400 descriptor values are computed for the 20 × 20 amino acid combinations. Autocorrelation features describe the level of correlation between two objects (protein or peptide sequences) in terms of their specific structural or physicochemical property (28), which are defined based on the distribution of amino acid properties along the sequence (29). There are eight amino acid properties used for deriving these autocorrelation descriptors. The first is hydrophobicity scale derived from the bulk hydrophobic character for the 20 types of amino acids in 60 protein structures (30). The second is the average flexibility index derived from the statistical average of the B-factors of each type of amino acids in the available protein X-ray crystallographic structures (31). The third is the polarizability parameter computed from the group molar refractivity values originally provided by Hansch et al. (32). The fourth is the free energy of amino acid solution in water measured by Hutchins (32). The fifth is the residue accessible surface areas taken from average values from folded proteins (33). The sixth is the amino acid residue volumes measured by Fisher (34). The seventh is the steric parameters derived from the van der Waals raddi of amino acid side-chain atoms (35). The eighth is the relative mutability obtained by multiplying the number of observed mutations by the frequency of occurrence of the individual amino acids (36). Each of these properties is centralized and standardized such that , where is the average of the property of the 20 amino acids, and σ are given by: Three different autocorrelation features are computed, each having 8 descriptors and 240 descriptor values. The first is Moreau–Broto autocorrelation (28,37), which has been used for predicting transmembrane protein types (21) and protein secondary structural contents (22) at accuracy levels of 82–94% and 91–94%, respectively. Here d is the lag of the autocorrelation, P and P+ are the amino acid property at position i and i+d, respectively. The normalized Moreau–Broto autocorrelation is defined as: where d = 1, 2, 3, …, 30. The second is Moran autocorrelation (38), which has been applied for predicting protein helix contents at an accuracy level of 85% (23), and it is defined as: where d and P and P+ are defined above, is the average of P, i.e. . This algorithm differs from that of Moreau–Broto autocorrelation in the use of property deviations from the average values instead of the property values themselves as the basis for measuring correlations. The third feature is Geary autocorrelation (39), which has been used for analyzing allele frequencies and population structures (24), and it is defined as: where d, , P and P+ are defined above. This algorithm differs from the other two algorithms in the use of square-difference of property values instead of vector-product of property values or deviations as the basis for measuring correlations. Composition, transition and distribution features represent the amino acid distribution patterns of a specific structural or physicochemical property along a protein or peptide sequence (5,25), which have been used for recognition of protein folds (5) and prediction of protein–protein interactions (6,8), protein functional families (2–4,26) and MHC-binding peptides (J. Cui, L. Y. Han, H. H. Lin, H. L. Zhang, Z. Q. Tang, C. J. Zheng, Z. W. Cao and Y. Z. Chen, manuscript submitted) at accuracy levels of 74–100%, 77–81%, 67–99%, 97–99%, respectively. Seven types of physicochemical properties have been used for computing these features. These are hydrophobicity, normalized Van der Waals volume, polarity, polarizibility, charge, secondary structures and solvent accessibility (2,5,25). These descriptors are computed by the following procedure: the sequence of the amino acids is transformed into a sequence of certain structural or physicochemical properties (attributes) of residues. Twenty amino acids are divided into three groups for each of the seven different attributes based on the main clusters of the amino acid indices of Tomii and Kanehisa (5,40). The reason for dividing amino acids into three groups instead of two or four groups is that, while amino acids can be divided into a minimum of both two and three groups for most attributes, they can only be divided into a minimum of three groups for such attributes as charge (positive, negative and neutral) and secondary structure (helix, strand and coil). Therefore, dividing amino acids into three groups appears to be a more rational choice as have been used by a number of studies (2–6,8). The ranges of these numerical values and the amino acids belonging to each group are shown in Table 1. Three descriptors, composition (C), transition (T) and distribution (D), are then computed for a given attribute to describe the global percent composition of each of the three groups of amino acids in a protein, the percent frequencies with which the attribute changes its index along the entire length of the protein, and the distribution pattern of the attribute along the sequence, respectively. Computation of these features can be illustrated by using hydrophobicity attribute as an example. All amino acids are divided into three groups: polar, neutral and hydrophobic. The composition descriptor C consists of three values: the global percent compositions of polar, neutral and hydrophobic residues in the protein. The transition descriptor T also consists of three values: the percent frequency with which a polar residue is followed by a neutral residue or a neutral residue by a polar residue, a polar residue is followed by a hydrophobic residue or a hydrophobic residue by a polar residue, and a neutral residue is followed by a hydrophobic residue or a hydrophobic residue by a neutral residue. The distribution descriptor D consists of five values for each of the three groups: the fractions of the entire sequence, where the first residue of a given group is located, and where 25, 50, 75 and 100% of those are contained. There are 3 descriptors and 3(C) + 3(T) + 5 × 3(D) = 21 descriptor values for the hydrophobicity attribute. Consequently, the seven different amino acid attributes produce a total of 7 × 3 = 21 descriptors and 7 × 21 = 147 descriptor values (Table 2).

Table 2

Amino acid attributes and the division of the amino acids into three groups for each attribute

Attribute	Divisions
Hydrophobicity	Polar	Neutral	Hydrophobicity
	R,K,E,D,Q,N	G, A, S,T,P,H,Y	C,L,V,I,M,F,W
Normalized van der Waals volume	Volume range 0–2.78	Volume range 2.95–94.0	Volume range 4.03–8.08
	G,A,S,T,P,D	N,V,E,Q,I,L	M,H,K,F,R,Y,W
Polarity	Polarity value 4.9–6.2	Polarity value 8.0–9.2	Polarity value 10.4–13.0
	L,I,F,W,C,M,V,Y	P,A,T,G,S	H,Q,R,K,N,E,D
Polarizability	Polarizability value 0–1.08	Polarizability value 0.128–120.186	Polarizability value 0.219–0.409
	G,A,S,D,T	C,P,N,V,E,Q,I,L	K,M,H,F,R,Y,W
Charge	Positive	Neutral	Negative
	KR	ANCQGHILMFPSTWYV	DE
Secondary structure	Helix	Strand	Coil
	EALMQKRH	VIYCWFT	GNPSD
Solvent accessibility	Buried	Exposed	Intermediate
	ALFCGIVW	PKQEND	MPSTHY

The division is based on the clusters of the amino acid indices of Tomii and Kanehisa (5,40) for each of the seven attributes. For such attributes as secondary structure and solvent accessibility, the division is based on statistical appearance of each amino acid in a specific state.

The sequence-order features can also be used for representing amino acid distribution patterns of a specific physicochemical property along a protein or peptide sequence (11,27), which have been used for predicting protein subcellular locations at accuracy levels of 72.5–88.9% (9,10). These descriptors are derived from both the Schneider–Wrede physicochemical distance matrix (9–11) and the Grantham chemical distance matrix (27) between each pair of the 20 amino acids. The dth rank sequence-order-coupling number is defined as: where d,+ is the distance between the two amino acids at position i and i+d. For each amino acid type, the type-1 quasi-sequence-order descriptor can be defined as: where f is the normalized occurrence of amino acid type i and w is a weighting factor(w = 0.1). The type-2 quasi-sequence-order is defined as: The PROFEAT implementation of each of these algorithms was extensively tested by using a number of test sequences, such as homopolymers and copolymers of different types of amino acids. The computed descriptor values were compared to the known values for these sequences to ensure that they match with each other.

DISCUSSION

The usefulness of the features covered by PROFEAT for computing the structural and physicochemical features of proteins and peptides has been tested by a number of published studies of the development of support vector machine (SVM) classification systems for predicting protein functional classes (4,26), protein–protein interactions (8) and MHC-binding peptides (J. Cui, L. Y. Han, H. H. Lin, H. L. Zhang, Z. Q. Tang, C. J. Zheng, Z. W. Cao and Y. Z. Chen, manuscript submitted). These SVM classification systems have been found to give prediction performance with sensitivity and specificity in the range of 53.0–99.3% and 82.1–99.9%, respectively. Because of the use of these structural and physicochemical features, these SVM classification systems do not rely on sequence similarity, clustering or profiles for predicting protein functional classes, and they have been found to be particularly useful for facilitating the prediction of novel proteins (13,41,42). Moreover, the predicted descriptors important for specific classes of proteins have been found to correlate with the experimentally estimated interactions and forces that define the distinguished activities of these proteins (4,26,43). For instance, an analysis the SVM prediction of transporters have shown that, in order of prominence, hydrophobicity, amino acid composition, polarity and charge play prominent roles for identifying transporters (26). Amino acid composition and hydrophobicity are important factors for the interaction of a protein with other biomolecules. Studies of structure-activity relationships of transporter-substrate binding has shown that hydrophobic contact, hydrogen bonding (which arises primarily from polar interaction) and charged center play important roles in substrate binding (44,45). Molecular modeling have also shown that hydrophobic contact and hydrogen bonding plays important role in transporter-substrate binding (46). A new SVM prediction system was developed for predicting members and non-members of three separate transporter families TC1.C, TC3.E and TC9.A by using this reduced set of descriptors and the same protein datasets as those of the earlier study of SVM prediction of transporters that used a full set of group 5 descriptors (46), which gives a similar prediction performance, suggesting that the selected descriptors are highly useful for distinguishing members and non-members of these transporter families. So far, individual group of features has been separately used for computing structural, functional and interaction profiles of proteins and peptides. For instance, a protein functional class prediction server SVMProt has been developed by using descriptors of the fifth feature group (2). It is of interest to examine how the use of additional features affects the performance of this and other prediction systems. For such a purpose, two new SVM systems were developed for predicting members and non-members of the enzyme EC1.15 family and transporter TC2.C family, respectively by using descriptors of all six feature groups and the same datasets as those used for developing the corresponding SVMProt prediction systems (3,26). Comparison of the results of these new SVM systems with those of the corresponding SVMProt systems shows that the sensitivity (percentage of correctly predicted family members) is increased from 92.5 to 94.4% for the EC1.15 and from 76.5 to 83.2% for the TC2.C family, respectively, while the specificity (percentage of correctly predicted non-family members) remains unchanged at 99.8% for both families. This seems to suggest that at least for some protein families the use of additional features can moderately improve the performance of SVMProt. The contribution of each of these features can be estimated by separately conducting SVM classification using each feature (47,48). By using the same method, the order of contribution from each of the feature groups was found to be: 5th-group (composition, transition and distribution) > 1st-group (amino acid and dipeptide composition) > 6th group (sequence-order) > autocorrelation 1 > autocorrelation 3 > autocorrelation 2. Investigation of other prediction systems and on more extensive range of protein structural, functional and interaction profiles is warranted. The commonly-used structural and physicochemical features appear to be useful in the development of statistical learning systems for predicting protein structural classes (19,20), functional families (2–5,16,26), protein–protein interactions (6,8), subcellular locations (9,10,17,18) and peptides of specific properties (J. Cui, L. Y. Han, H. H. Lin, H. L. Zhang, Z. Q. Tang, C. J. Zheng, Z. W. Cao and Y. Z. Chen, manuscript submitted). Various proteins are known to form covalent bonding with their substrates and inhibitors. These types of properties are unlikely to be sufficiently covered by the existing set of features. Some of the molecular descriptors widely used in describing the structural and physicochemical properties of chemical compounds (49–52) may be extended for representing these features. PROFEAT can be further improved by allowing the input of new structural and physicochemical properties, expanding the program for computing additional descriptors, and providing user-friendly facilities to feed computed features into the general and specialized SVM-based servers such as GIST and SVMProt.

43 in total

1. AAindex: amino acid index database.

Authors: S Kawashima; M Kanehisa
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Multi-class protein fold recognition using support vector machines and neural networks.

Authors: C H Ding; I Dubchak
Journal: Bioinformatics Date: 2001-04 Impact factor: 6.937

3. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect.

Authors: K C Chou
Journal: Biochem Biophys Res Commun Date: 2000-11-19 Impact factor: 3.575

4. Support vector machine approach for protein subcellular localization prediction.

Authors: S Hua; Z Sun
Journal: Bioinformatics Date: 2001-08 Impact factor: 6.937

5. Classifying G-protein coupled receptors with support vector machines.

Authors: Rachel Karchin; Kevin Karplus; David Haussler
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

6. Accurate prediction of protein secondary structural content.

Authors: Z Lin; X M Pan
Journal: J Protein Chem Date: 2001-04

7. Predicting protein--protein interactions from primary structure.

Authors: J R Bock; D A Gough
Journal: Bioinformatics Date: 2001-05 Impact factor: 6.937

8. Classification of nuclear receptors based on amino acid composition and dipeptide composition.

Authors: Manoj Bhasin; Gajendra P S Raghava
Journal: J Biol Chem Date: 2004-03-23 Impact factor: 5.157

Review 9. Themes in RNA-protein recognition.

Authors: D E Draper
Journal: J Mol Biol Date: 1999-10-22 Impact factor: 5.469

10. Molecular modeling study of diltiazem mimics at L-type calcium channels.

Authors: K J Schleifer; E Tot
Journal: Pharm Res Date: 1999-10 Impact factor: 4.200

80 in total

1. A comparative study of family-specific protein-ligand complex affinity prediction based on random forest approach.

Authors: Yu Wang; Yanzhi Guo; Qifan Kuang; Xuemei Pu; Yue Ji; Zhihang Zhang; Menglong Li
Journal: J Comput Aided Mol Des Date: 2014-12-20 Impact factor: 3.686

2. ProtDCal-Suite: A web server for the numerical codification and functional analysis of proteins.

Authors: Sandra Romero-Molina; Yasser B Ruiz-Blanco; James R Green; Elsa Sanchez-Garcia
Journal: Protein Sci Date: 2019-09 Impact factor: 6.725

3. DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information.

Authors: Farman Ali; Saeed Ahmed; Zar Nawab Khan Swati; Shahid Akbar
Journal: J Comput Aided Mol Des Date: 2019-05-23 Impact factor: 3.686

4. 3D proteochemometrics: using three-dimensional information of proteins and ligands to address aspects of the selectivity of serine proteases.

Authors: Vigneshwari Subramanian; Qurrat Ul Ain; Helena Henno; Lars-Olof Pietilä; Julian E Fuchs; Peteris Prusis; Andreas Bender; Gerd Wohlfahrt
Journal: Medchemcomm Date: 2017-03-15 Impact factor: 3.597

5. EuLoc: a web-server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAAC.

Authors: Tzu-Hao Chang; Li-Ching Wu; Tzong-Yi Lee; Shu-Pin Chen; Hsien-Da Huang; Jorng-Tzong Horng
Journal: J Comput Aided Mol Des Date: 2013-01-03 Impact factor: 3.686

Review 6. Machine learning-enabled discovery and design of membrane-active peptides.

Authors: Ernest Y Lee; Gerard C L Wong; Andrew L Ferguson
Journal: Bioorg Med Chem Date: 2017-07-08 Impact factor: 3.641

7. Computational chemogenomics: is it more than inductive transfer?

Authors: J B Brown; Yasushi Okuno; Gilles Marcou; Alexandre Varnek; Dragos Horvath
Journal: J Comput Aided Mol Des Date: 2014-04-27 Impact factor: 3.686

8. Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques.

Authors: Maris Lapins; Jarl Es Wikberg
Journal: BMC Bioinformatics Date: 2010-06-22 Impact factor: 3.169

9. Enzyme classification with peptide programs: a comparative study.

Authors: Daniel Faria; António E N Ferreira; André O Falcão
Journal: BMC Bioinformatics Date: 2009-07-24 Impact factor: 3.169

10. CAMP: a useful resource for research on antimicrobial peptides.

Authors: Shaini Thomas; Shreyas Karnik; Ram Shankar Barai; V K Jayaraman; Susan Idicula-Thomas
Journal: Nucleic Acids Res Date: 2009-11-18 Impact factor: 16.971