Literature DB >> 18692072

Natural/random protein classification models based on star network topological indices.

Cristian Robert Munteanu¹, Humberto González-Díaz, Fernanda Borges, Alexandre Lopes de Magalhães.

Abstract

The development of the complex network graphs permits us to describe any real system such as social, neural, computer or genetic networks by transforming real properties in topological indices (TIs). This work uses Randic's star networks in order to convert the protein primary structure data in specific topological indices that are used to construct a natural/random protein classification model. The set of natural proteins contains 1046 protein chains selected from the pre-compiled CulledPDB list from PISCES Dunbrack's Web Lab. This set is characterized by a protein homology of 20%, a structure resolution of 1.6A and R-factor lower than 25%. The set of random amino acid chains contains 1046 sequences which were generated by Python script according to the same type of residues and average chain length found in the natural set. A new Sequence to Star Networks (S2SNet) wxPython GUI application (with a Graphviz graphics back-end) was designed by our group in order to transform any character sequence in the following star network topological indices: Shannon entropy of Markov matrices, trace of connectivity matrices, Harary number, Wiener index, Gutman index, Schultz index, Moreau-Broto indices, Balaban distance connectivity index, Kier-Hall connectivity indices and Randic connectivity index. The model was constructed with the General Discriminant Analysis methods from STATISTICA package and gave training/predicting set accuracies of 90.77% for the forward stepwise model type. In conclusion, this study extends for the first time the classical TIs to protein star network TIs by proposing a model that can predict if a protein/fragment of protein is natural or random using only the amino acid sequence data. This classification can be used in the studies of the protein functions by changing some fragments with random amino acid sequences or to detect the fake amino acid sequences or the errors in proteins. These results promote the use of the S2SNet application not only for protein structure analysis but also for mass spectroscopy, clinical proteomics and imaging, or DNA/RNA structure analysis.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2008 PMID： 18692072 PMCID： PMC7094162 DOI： 10.1016/j.jtbi.2008.07.018

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

One of the widely used methods for the predicting of the protein properties is quantitative structure activity relationship (QSAR) (Devillers and Balaban, 1999). Graph theory can be used to obtain macromolecular descriptors named topological indices (TIs). The branch of mathematical chemistry dedicated to encode the DNA/protein information in graph representations by the use of the TIs has become an intense research area with interesting works of Liao (Liao and Wang, 2004a, Liao and Wang, 2004b; Liao and Ding, 2005; Liao et al., 2006), Randic, Nandy, Balaban, Basak, and Vracko (Randic, 2000; Randic et al., 2000; Randic and Basak, 2001; Randic and Balaban, 2003), Bielinska-Waz team (Bielinska-Waz et al., 2007) or our group (Perez et al., 2004; Aguero-Chapin et al., 2006). Using graphic approaches to study biological systems can provide useful insights, as indicated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions (Andraos, 2008; Chou, 1989; Chou and Forsen, 1980, Chou and Forsen, 1981; Chou and Liu, 1981; Chou et al., 1979; King and Altman, 1956; Kuzmic et al., 1992; Myers and Palmer, 1985; Zhou and Deng, 1984), protein folding kinetics (Chou, 1990), inhibition kinetics of processive nucleic acid polymerases and nucleases (Althaus et al., 1993a, Althaus et al., 1993b, Althaus et al., 1993c, Althaus et al., 1994a, Althaus et al., 1994b, Althaus et al., 1996; Chou et al., 1994), analysis of codon usage (Chou and Zhang, 1992; Zhang and Chou, 1993, Zhang and Chou, 1994), base frequencies in the anti-sense strands (Chou et al., 1996), and analysis of DNA sequence (Qi et al., 2007). Moreover, graphical methods have been introduced for QSAR study (Gonzalez-Diaz et al., 2006, Gonzalez-Diaz et al., 2007b; Prado-Prado et al., 2008) as well as utilized to deal with complicated network systems (Diao et al., 2007; Gonzalez-Diaz et al., 2007a, Gonzalez-Díaz et al., 2008). Recently, the “cellular automaton image” (Wolfram, 1984, Wolfram, 2002) has also been applied to study hepatitis B viral infections (Xiao et al., 2006a), HBV virus gene missense mutation (Xiao et al., 2005b), and visual analysis of SARS-CoV (Gao et al., 2006; Wang et al., 2005), as well as representing complicated biological sequences (Xiao et al., 2005a) and helping to identify protein attributes (Xiao and Chou, 2007; Xiao et al., 2006b). The actual work presents for the first time a natural/random protein classification using only the chain sequence and amino acid connectivity protein structural data. The data are transformed into sequence and connectivity Star Graph's TIs, which are then used as input for a statistical linear method in the construction of a simple classification model.

Materials and methods

Protein set

Two sets of proteins are compared in the new classification model: a set (Nat) of 1046 natural protein chains as defined in the pre-compiled CulledPDB list from PISCES Dunbrack's Web Lab (Wang and Dunbrack, 2003) and a second (Rnd) with the same size formed by random amino acid sequences generated with Python scripts (Rossum, 2006). The natural set is characterized by a homology of 20%, a structure resolution of 1.6 Å and R-factor lower than 25%. The random set is composed by the same standard amino acid types and the average length of the chains is the same as that of the natural set. Python scripts are used to download PDB files from the PDB data bank (Berman et al., 2000) and to create the correspondent DSSP file with the DSSP application (Kabsch and Sander, 1983). The chain sequences were extracted with a Python script from these DSSP files and were filtered with our Prot-2S Web Tool (http://www.requimte.pt:8080/Prot-2S/) by removing the chains that contain non-standard amino acid (usually labelled X).

Star graph

Each protein can be considered as a real network where the amino acids are the vertices (nodes), connected in a specific sequence by the peptide bonds. The graph is the abstract representation of the network and is a collection of N vertices and the connections between them. The star graph is a special case of trees with N vertices where one has got N−1 degrees of freedom and the remaining N−1 vertices have got one single degree of freedom (Harary, 1969). In addition, as a general property, there is a unique path between any pair of vertices. For proteins, each of the 20 possible branches (“rays”) of the star contains the same amino acid type and the star centre is a non-amino acid vertex. The same protein can be represented by different forms which are associated to distinct distance matrices (Randic et al., 2007). If the vertices do not carry a label, the sequence information will be lost; for that reason, the best method is to construct a standard star graph where each amino acid/vertex holds the position in the original sequence and the branches are labelled by alphabetical order of the three-letter amino acid code (Randic et al., 2007). In the present study we are using the alphabetical order of one-letter amino acid code. The standard star graph for a random virtual decapeptide (ACADCEFDGH) is illustrated in Fig. 1 .

Fig. 1

Non-embedded Star Graph for the ACADCEFDGH sequence.

Non-embedded Star Graph for the ACADCEFDGH sequence. If the initial connectivity in the protein chain is included, the graph is embedded (Fig. 2 ). In order to compare the graphs, it is necessary to transform the graphical representation in connectivity matrix, distance matrix and degree matrix. In the case of the embedded graph, the matrices of the connectivity in the sequence and in the star graph are combined. These matrices and the normalized ones are the base for the TIs calculation.

Fig. 2

Embedded Star Graph for the ACADCEFDGH sequence.

TIs for star graph

The protein chain sequences are transformed into Star Graph representations and then characterized by several TIs using our new Sequence to Star Networks (S2SNet) application. S2SNet is a wxPython (Noel Rappin, 2006) GUI application with Graphviz (Koutsofios, 1993) as a graphics back-end. The user of this interactive tool is able to choose the level of calculations, such as: embedded graph, additional weights for each amino acid, Markov normalization, power of the matrix connectivity, the input files (files with sequences, groups and weights), the output files, the level of details (files for summary and detailed results) and the type of graph visualization (dot, neato, fdp, twopi, circo). In particular, the calculations presented in this work are characterized by embedded and non-embedded TIs, no weights, Markov normalization and power of matrices/indices (n) up to 5. The summary file contains the following TIs (Todeschini and Consonni, 2002): Shannon entropy of the n powered Markov matrices (Sh): where p are the n elements of the p vector, resulted from the matrix multiplication of the powered Markov normalized matrix (n×n) and a vector (n×1) with each element equal to 1/n; The trace of the n connectivity matrices (Tr): where n=0–power limit, M=connectivity matrix (i×i dimension); ii=ith diagonal element; Harary number (H): where d are the elements of the distance matrix, m are the elements of the M connectivity matrix, w are the weight elements and nw is a switch to select (1) or not select (0) weights calculations; Wiener index (W): Gutman topological index (S 6): where degare the elements of the degree matrix; Schultz topological index (non-trivial part) (S): Moreau-Broto, autocorrelation of topological structure (ATS, n=1−power limit), only with weights included: where dp are the elements of the pair distance matrix when the distance is n; Balaban distance connectivity index (J): where nodes+1=AA numbers/node number in the Star Graph+origin, ∑is the node distance degree; Kier–Hall connectivity indices ( X): Randic connectivity index (1 X): All these TIs will be used to construct a natural/random classification model by statistical methods.

Statistical analysis

General discriminant analysis (GDA) (Kowalski and Wold, 1982; Van Waterbeemd, 1995) from STATISTICA 6.0 package (StatSoft.Inc., 2002) has been chosen as the simplest and fastest method. In order to decide if a protein chain is classified as natural (if exists in the PDB database) or random, we added an extra dummy variable named Nat/Rnd (binary values of 0/1) and a cross-validation variable (CV). There are three often used cross-validation methods to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test (Chou and Zhang, 1995). Through a crystal-clear analysis, Chou and Shen, 2007, Chou and Shen, 2008 have shown that only the jackknife test has the least arbitrariness. Therefore, the jackknife test has been increasingly used by investigators to examine the accuracy of various predictors (Chen and Li, 2007a, Chen and Li, 2007b; Diao et al., 2007; Ding et al., 2007; Jiang et al., 2008; Jin et al., 2008; Li and Li, 2008; Lin, 2008; Lin et al., 2008; Niu et al., 2006, Niu et al., 2008; Wang et al., 2008; Xiao and Chou, 2007; Zhou et al., 2007; Zhang et al., 2008). In the actual work, the independent data test is used by splitting the data at random in a training series (train, 75%) used for model construction and a prediction one (val, 25%) for model validation (the CV column is filled by repeating 3 train and 1 val). All independent variables are standardized prior to model construction. Using S2SNet methodology, as defined previously we can attempt to develop a simple linear QSAR, with the general formula where Nat/Rnd-score is the continue score value for the Nat/Rnd classification, T=TIs described above, C 1−C=TIs coefficients, n is the number for the indices and c 0 is the independent term. GDA models quality was determined by examining Wilk's U statistics, Fisher ratio (F), p-level (p), and canonical regression coefficient (R). We also inspected the percentage of good classification, cases/variables ratios, and number of variables to be explored in order to avoid over-fitting or chance correlation. The forward, backward and best subset model types are tested for the embedded, non-embedded and both data.

Results

Eight variable selection methods were applied in order to find the best GDA equation which is able to discriminate between natural and random chain proteins. Eight models were constructed using embedded/non-embedded Star Graph TIs obtained with S2SNet application and forward, backward and best subset model types. The values obtained for the training/predicting accuracies are presented in Table 1 .

Table 1

Training/predicting accuracies for the embedded (E), non-embedded (nE) and both Star Graph TIs

Model	Star Graph type	Train			Cross-validation			Total
Model	Star Graph type	% Nat	% Rnd	% Total	% Nat	% Rnd	% Total	% Nat	% Rnd	% Total
Forward	nE	86.50	96.17	91.33	83.52	96.95	90.25	85.76	96.37	91.06
	E	80.00	88.65	84.32	78.54	90.08	84.32	79.64	89.01	84.32
	nE and E	85.86	96.17	91.01	81.99	98.09	90.06	84.89	96.65	90.77
Backward	nE	86.11	96.68	91.40	83.52	98.09	90.82	85.47	97.04	91.25
	E	81.27	90.82	86.04	79.69	92.75	86.23	80.88	91.30	86.09
	nE and E	86.75	97.19	91.97	84.67	98.47	91.59	86.23	97.51	91.87
Best	nE	86.75	96.68	91.71	83.52	98.09	90.82	85.95	97.04	91.49
Best	E	81.40	90.05	85.72	79.31	91.60	85.47	80.88	90.44	85.66

Training/predicting accuracies for the embedded (E), non-embedded (nE) and both Star Graph TIs The forward stepwise selection variable method conjugated with the nE and E TIs provides the best results for our data set with values of correctly classified compounds of 91.01%, 90.06% and 90.77% for the training, cross-validation and full sets, respectively, and using a minimum number of 12 parameters (Eq. (15)). The embedded TIs have the name of the non-embedded ones plus “e” as suffix: where N is the number of studied protein sequences (Nat+Rnd), R is the canonical regression coefficient, U is the Wilk's statistics, F is the Fisher's statistics and p is the p-level (probability of error). The present R value shows a high level of correlation between the input variables and the classification of proteins. Wilk's U is used to measure the statistical significance of the discriminatory power of the model and has values from 1.0 (no discriminatory power) to 0.0 (perfect discriminatory power). The F value shows the statistical significance in the discrimination between groups, a measure of the extent to which a variable makes a unique contribution to a prediction of group membership. The values of the p-level of Fisher's test for the GDA is less than 0.05 and show that the hypothesis of group overlapping with a 5% error can be rejected (Hua and Sun, 2001). The above results are typically considered as excellent in the literature for LDA-QSAR models (Garcia-Garcia et al., 2004; Marrero-Ponce et al., 2004, Marrero-Ponce et al., 2005). The parametrical assumptions such as normality, homoscedasticity (homogeneity of variances) and non-colinearity have the same importance in the application of multivariate statistic techniques to QSPR (Bisquerra Alzina, 1989; Stewart, 1998) as the correct specification of the mathematical form has. The validity and statistical significance of any model is conditioned by the above-mentioned factors. In our study, a simple linear mathematical form of the model has been chosen in the absence of prior information. Fig. 3, Fig. 4 show that the training cases against the residuals did not present any characteristic pattern (Dillon and Goldstein, 1984). The protein nos. 632 and 864 are the only two cases not shown in Fig. 4 because the corresponding raw residuals are clear distinct from the whole set, ca -7. They correspond to 1QWN, chain A (1014 AAs) and 1JZ8, chain A (1011 AAs). One possible reason for the apparent different statistical behaviour could be the limitation of the model when the length of the chains is greater than 1000 amino acids. It is possible that the star net TIs for large proteins become similar to the TIs of the random proteins.

Fig. 3

Training cases against the residuals for the full set.

Fig. 4

A zoom in the training cases against the residuals for the full set that does not include the two abnormal sequences.

Training cases against the residuals for the full set. A zoom in the training cases against the residuals for the full set that does not include the two abnormal sequences. A different and better threshold for the a priori classification probability can be estimated by means of the receiver operating characteristics (ROC) curve (James and Hanley, 1982). As the Fig. 5 clearly shows, one can see that the model is not a random, but a truly statistically significant classifier, since the area under the ROC curve (for both training=0.98 and validation=0.96) is significantly higher than the area under the random classifier curve random=0.5=diagonal line (Morales Helguera et al., 2007).

Fig. 5

ROC curve for the Nat/Rnd model.

ROC curve for the Nat/Rnd model. The validity of the GDA models depends on the normal distribution of the sample used as well as the homogeneity of their variances. Thus, we carried out two significant tests for normality, chi-square and Kolmogorov–Smirnov tests, and we have found significant statistical differences (p<0.01) on the respective values (chi-square, d). These results allow us to reject the hypothesis of normal distribution of the sample under study (Fig. 6 ) (Stewart, 1998).

Fig. 6

Distribution for GDA model residuals, chi-square and Kolmogorov–Smirnov tests.

Distribution for GDA model residuals, chi-square and Kolmogorov–Smirnov tests. The heteroscedasticity of a large set can be detected with the simple graphical method based on the examination of the residuals of the variable included in the model. Fig. 7 (a and b) shows that the Nat/Rnd GDA model variables against the residuals plots do not present any pattern, which indicates that homoscedasticity assumption is fulfilled (Stewart, 1998).

Fig. 7

(a) Graphical analysis of homogeneity of variances (variables vs. residuals) for Sh0, H, W, S, J and X0. (b) Graphical analysis of homogeneity of variances (variables vs. residuals) for X3, X4, X5, Tr4e and X2e. Due to the robustness of the GDA multivariate statistical techniques, the predictive ability and interference reached by using the proposed model should not be affected (see Fig. 8 ).

Fig. 8

Residuals vs. deleted residuals plot for the GDA model.

Discussion

This study extends for the first time the classical TIs to protein Star Network TIs by proposing a model that can predict if a chain protein is natural or random. The results prove for the first time the excellent predictive ability (90.77%) of the simple and fast Star Network TIs and GDA statistics linear models in the case of natural/random protein model. This classification can help the study of the protein function by changing some fragments with random amino acid sequences or can detect the fake amino acid sequences or the errors in proteins. The S2SNet application can be very useful to calculate the protein Star Network TIs, which can be the base of a model for any other protein property. S2SNet can also be used for mass spectroscopy, clinical proteomics and imaging or DNA/RNA structure analysis.

68 in total

1. Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases.

Authors: Bo Liao; Tian-Ming Wang
Journal: J Chem Inf Comput Sci Date: 2004 Sep-Oct

2. Mixtures of tight-binding enzyme inhibitors. Kinetic analysis by a recursive rate equation.

Authors: P Kuzmic; K Y Ng; T D Heath
Journal: Anal Biochem Date: 1992-01 Impact factor: 3.365

Review 3. Recent progress in protein subcellular location prediction.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: Anal Biochem Date: 2007-07-12 Impact factor: 3.365

4. Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach.

Authors: Feng-Min Li; Qian-Zhong Li
Journal: Protein Pept Lett Date: 2008 Impact factor: 1.890

5. Predicting membrane protein types by the LLDA algorithm.

Authors: Tong Wang; Jie Yang; Hong-Bin Shen; Kuo-Chen Chou
Journal: Protein Pept Lett Date: 2008 Impact factor: 1.890

6. 3D-QSAR study for DNA cleavage proteins with a potential anti-tumor ATCUN-like motif.

Authors: Humberto González-Díaz; Angeles Sánchez-González; Yenny González-Díaz
Journal: J Inorg Biochem Date: 2006-03-16 Impact factor: 4.155

7. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition.

Authors: Ying-Li Chen; Qian-Zhong Li
Journal: J Theor Biol Date: 2007-05-18 Impact factor: 2.691

8. The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase.

Authors: I W Althaus; A J Gonzales; J J Chou; D L Romero; M R Deibel; K C Chou; F J Kezdy; L Resnick; M E Busso; A G So
Journal: J Biol Chem Date: 1993-07-15 Impact factor: 5.157

9. A probability cellular automaton model for hepatitis B viral infections.

Authors: Xuan Xiao; Shi-Huang Shao; Kuo-Chen Chou
Journal: Biochem Biophys Res Commun Date: 2006-02-08 Impact factor: 3.575

10. The benzylthio-pyrimidine U-31,355, a potent inhibitor of HIV-1 reverse transcriptase.

Authors: I W Althaus; K C Chou; R J Lemay; K M Franks; M R Deibel; F J Kezdy; L Resnick; M E Busso; A G So; K M Downey; D L Romero; R C Thomas; P A Aristoff; W G Tarpley; F Reusser
Journal: Biochem Pharmacol Date: 1996-03-22 Impact factor: 5.858

7 in total

1. Entropy of never born protein sequences.

Authors: Grzegorz Szoniec; Maciej J Ogorzalek
Journal: Springerplus Date: 2013-04-30

2. Do natural proteins differ from random sequences polypeptides? Natural vs. random proteins classification using an evolutionary neural network.

Authors: Davide De Lucrezia; Debora Slanzi; Irene Poli; Fabio Polticelli; Giovanni Minervini
Journal: PLoS One Date: 2012-05-16 Impact factor: 3.240

3. Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.

Authors: Guillermin Agüero-Chapin; Deborah Galpert; Reinaldo Molina-Ruiz; Evys Ancede-Gallardo; Gisselle Pérez-Machado; Gustavo A de la Riva; Agostinho Antunes
Journal: Biomolecules Date: 2019-12-23

4. IFPTML Mapping of Drug Graphs with Protein and Chromosome Structural Networks vs. Pre-Clinical Assay Information for Discovery of Antimalarial Compounds.

Authors: Viviana Quevedo-Tumailli; Bernabe Ortega-Tenezaca; Humberto González-Díaz
Journal: Int J Mol Sci Date: 2021-12-02 Impact factor: 5.923

5. Randomness in Sequence Evolution Increases over Time.

Authors: Guangyu Wang; Shixiang Sun; Zhang Zhang
Journal: PLoS One Date: 2016-05-25 Impact factor: 3.240

6. Natural protein sequences are more intrinsically disordered than random sequences.

Authors: Jia-Feng Yu; Zanxia Cao; Yuedong Yang; Chun-Ling Wang; Zhen-Dong Su; Ya-Wei Zhao; Ji-Hua Wang; Yaoqi Zhou
Journal: Cell Mol Life Sci Date: 2016-01-22 Impact factor: 9.261

7. Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices.

Authors: Alcides Perez-Bello; Cristian Robert Munteanu; Florencio M Ubeira; Alexandre Lopes De Magalhães; Eugenio Uriarte; Humberto González-Díaz
Journal: J Theor Biol Date: 2008-10-17 Impact factor: 2.691

7 in total