Literature DB >> 22359432

A classification scoring schema to validate protein interactors.

Prashanth Suravajhala, Vijayaraghava Seshadri Sundararajan.

Abstract

Hypothetical protein [HP] annotation poses a great challenge especially when the protein is putatively linked or mapped to another protein. With protein interaction networks (PIN) prevailing, many visualizers still remain unsupported to the HP annotation. Through this work, we propose a six-point classification system to validate protein interactions based on diverse features. The HP data-set was used as a training data-set to find putative functional interaction partners to the remaining proteins that are waiting to be interacting. A Total Reliability Score (TRS) was calculated based on the six-point classification which was evaluated using machine learning algorithm on a single node. We found that multilayer perceptron of neural network yielded 81.08% of accuracy in modelling TRS whereas feature selection algorithms confirmed that all classification features are implementable. Furthermore statistical results using variance and co-variance analyses confirmed the usefulness of these classification metrics. It has been evaluated that of all the classification features, subcellular location (sorting signals) makes higher impact in predicting the function of HPs.

Entities: Disease Gene Species

Keywords: hypothetical proteins; protein interaction networks; total reliability score

Year: 2012 PMID： 22359432 PMCID： PMC3282273 DOI： 10.6026/97320630008034

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

The protein-protein interaction (PPI) data provide a powerful representation for discerning organization of cells besides predicting biological functions and providing insight into a variety of biochemical processes [1]. In the recent-past, there has been a twofold increase of the PPI data using protein interaction networks (PIN) while several advanced methods [2] connecting orthology mapping and comparative approaches have come up to analyze and visualize proteins. These approaches aid bioinformatical algorithms to discover families of proteins that have shared functional modules. However, bioinformatical methods stated as above are usually applicable when the protein has known functional relationship and not for those proteins like ‘predicted’ or ‘similar to’ or ‘hypothetical’. On the other hand, significant experimental efforts have allowed us to analyze the interactions of known proteins in various organisms. The interactomes established so far represent proteins corresponding to various organisms and sometimes organelles. With high-throughput data still limited for the human proteome, genome-wide approaches have been used to elucidate the human interactome. However, assuming that functional protein interactions are conserved in evolution, one can consider extending the experimentally determined human protein interaction network by using data from protein interaction data-sets of the model organisms. Transferring the information from known interaction networks to the unknown have been accomplished [14] further requiring identification of genes that have a common ancestor and therefore share the same function between the orthologs. In addition, web-based databases such as String (http://www.string.embl.de) contain several thousands of predicted interactions which are assembled by mapping interactions from model organism to various orthologs using sequence similarity searches, bidirectional and reciprocal best hit approaches. All the inferred interactions work on either one or two methods which contains lots of false positive data. Hence there remains a challenge to develop efficient tools to predict and accurately define function to the hypothetical proteins. The orthology mapping poses a great challenge as many known HPs remain un-annotated even after being ‘mapped to’ or 'associated' with certain known proteins. For example, a HP queried using a visualizer Osprey [3] would still be shown as a grey node (meaning unknown) and therefore remain unsupported. In addition, there is a challenge in analyzing huge data-sets of HPs as flow of operations are to be executed simultaneously for better protein annotation. Thanks to the Cloud architecture [4], the work is made simple with the data mapped into reduced phase. The reduced phases combine all the outcomes from the multiple nodes into a single outcome. Further, such learning algorithms can be implemented over multiple nodes on map-reduce framework. The data thus analysed can then be used to validate each parameter and further statistical analysis might be employed to validate the results. To overcome this, we have employed a six-point classification schema based on some set of features. Our classification was employed on a subset of 20 HPs that have been randomly considered from a group of 1455 HPs [5]. In employing the classification system, we ideate bona fide protein interactions can be determined making sensitive Protein Interaction Networks (PIN).

Methodology

Six-point Classification:

Each classification is a measure of different methods, viz. Pfam score, orthology inference, functional linkages, back to back orthology for protein interactants, subcellular location and protein associations taken from known databases and visualizers. Each protein is given a value of 1 if the protein matches the classifier; else 0 is given against them. The annotation scores, based on the features are available in ( Table 1, see supplementary material).

Classification 1: Pfam identities:

Score:

Best Pfam scores are given as per the assignment returned by Pfam [6]. The Pfam-B is given value 0 and Pfam-A is given value 1.

Principle:

The underlying principle is that the presence of domains in varying combinations in different proteins tends to provide insights into the function of the protein. The Pfam, represented by multiple sequence alignments and Hidden Markov Models (HMMs) classifies the query into Pfam-A and Pfam-B. While the Pfam-A are curated and built from the seed alignment, the Pfam-B are lower quality sequences generated automatically from electronic annotation using the nonredundant clusters.

Classification 2: Orthology mapping:

E value <1 were given a score of 1, else 0 The ortholog proteins often retain similar functions, so a pair of orthologs that interacts in subject organism is likely to interact in target organism too (putative interactions in other organisms are called interologs). The protein sequences were blasted against the Arabidopsis thaliana and Non Redundant (NR) databases. Besides these, other organisms were also targeted towards hot spots for functional linkages. We transferred the information of the ortholog data to Arabidopsis thaliana, and mapped them to functional linkages.

Classification 3: Functional linkages using protein interactions and associations:

If there is an association or linkage found through Rosetta stone method, a score of 1 is given, else 0. A protein Navigator tool [7] to check orthology pairs was used to predict the functional linkages linked to them: 1. Rosetta stone method [8] works on a rationale that two polypeptides X and Y in one organism are likely to interact if their homologs are expressed as a single polypeptide XY (which is called as a Rosetta Stone) in another organism. It is also likely that some proteins might be functionally represented as pathways or chemicals or simply in GO database. 2. Gene fusion method [9] works on the theory that pairs of monomeric proteins fused in other organisms tend to be functionally related or physically interacted. 3. Gene neighbours method works on the assumption that the operons of one organism may be conserved across other organisms.

Classification 4: Back to back orthology:

The associators or interactors found in classifier 3 are searched in query organism, if found distinctively and are correlated a score of 1 is given; 0, if absent. The interaction is linked only if the interactant ortholog is present in the query organism too. This is similar to the bi directional best blast hits.

Classification 5: Presence of sorting signals and localization to the same organelle:

If localized to the same organelle, 1 else 0. An approximate 50% of interactions are between the same organelle and rarely do we find transmembrane interactions. If the proteins are predicted to be localized to different organelle, the perchance of protein interacting to the query would be less. TargetP [10] was used to employ the subcellular location classifier.

Classification 6: Presence of interactors (available through databases and visualizers):

1 and 0 for presence and absence respectively The experimentally confirmed interacting pairs are documented in databases like Database of Interacting Proteins (DIP) [11], MINT [12], IntAct [13], thebiogrid.org [14] which finally are visualized using cytoscape [15] and Osprey [3]. The classification 3 developed on the assumption that there is presence of RS sequences is based on the presence of interacting partners from existing databases and those that we visualized using Osprey.

Total Reliability Score (TRS):

The total reliability score (TRS) is summation of all the scores employed for all the six classifiers. If the value exceeded 3, then we believe that the candidate is likely to have a probable interacting partner. The first five classifiers, viz. Pfam score, orthology inference, functional linkages, back to back orthology for protein interactants, subcellular location are based on the manual annotation and prediction that we have employed while the protein associations/interactions are based on the existence of known protein interactors. A scoring schema similar to phylogenetic profiling of 0 and 1 for absence and presence of genes respectively in an organism was employed to make the PIN sensitive. These scoring patterns are applicable either when annotation is transferred to another organism or when the protein is waiting to be annotated. However, the fact that the use of six classifiers makes this method more stringent, the scores are averaged into two sections, viz. the 1+2+3 classifiers and 4+5+6 classifiers and an over all, TRS summing them. In employing these scoring patterns, we ideate that the sensitivity of the proteins based on any three or more methods can give better results.

Evaluation of classification:

To circumvent the problem of protein annotation on current dataset, we further evaluated the classification scores with single node through learning algorithms J48 (a version of C4.5 decision tree), SMO (a version of Support Vector Machine), Naive Bayesian, and Multilayer Perceptron (a version of Neural Network) and 22 learning algorithms [17-18]. We implemented WEKA machine learning package [16] (Version 3.6.4) on Cloud with a single node. The data-set containing the proteins over six-point classification scores was further modeled through learning algorithms with ten-fold cross validation scheme (Table 1, see supplementary material).

Statistical analysis using Anova and Kruskalwallis:

We statistically interpreted the six-point classification metrics further using MATLAB [23] (Version 7.11 on a Windows 7 desktop). We finally tested the matrix (protein and six-point classifiers) using one-way analysis of variance (Anova) and Kruskalwallis methods.

Discussion

The classification scoring schema from the TRS was employed on the test dataset (Table 1, see supplementary material). The permutations and combinations of the dataset gave a valid protein interaction networks. For example, when the classifier 3, viz. functional linkages and classifier 6 were analyzed, only 4 of the 20 turned out to be Rosetta Stone sequences, making a putative and novel protein interaction map (Figure 1). We observe that the protein NP_057673.2, a mitochondrial protein is known to interact with AAM10026.1 suggesting that they make a bona fide interaction pairing. Further, we evaluated the classification scoring system using machine learning algorithms and statistical interpretation suggesting the fact that the subcellular location (sorting signals) make a very good impact amongst all classification features. These results demonstrate that the six-point classification scheme is capable of yielding an ultimate TRS which is capable of interpreting PIN. From Table 1, we get the best accuracies through data methods: ALL (MLP: 81.08), Split (MLP: 76.92); CFS (RandomTree: 67.57), PCA (MLP: 81.08), SVM (MLP: 78.38) using different approaches (as mentioned in Columns 2-6). Feature selected through InfoGain, ChiSquare & Probabilistic Significane and modeled by Smo- PolyKernal algorithm yielded similar accuracy of 78.38. The highest among all data methods and algorithms is ALL:: MLP::81.08. This means all six classifiers scheme are required in accurate modeling of TRS. Further we derived best data subsets from six classification schemes by choosing top score from all combination using Hill Climbing method [22]. Table 3 (see supplementary material) illustrates that all subset combination method “0 1 2 3 4 5” by MLP (81.081) and Hill selected data sub set “4 0” by MLP (78.378) are the best accuracies by these methods. This further adds to the confidence that all the six classification schemes helps better modeling of TRS compared to the sub sets. From the statistical interpretation, we used a matrix; whose rows are proteins and columns are different classifiers (Table 4, see supplementary material) and observed that the function returns the p-value after transforming 37-2 degree of freedom. Covariance test further revealed that the data is unbiased in the covariance matrix, forming a normal distribution (Table 5, see supplementary material). The statistical results indicate that the six-point classifiers are uniformly useful in modeling TRS as evident from the scores (Anova = 8.0645e-031; kruskalwallis= 6.3123e-011; (Figure 2 and Figure 3). This adds to our confidence that having such classification schema is valid and could be extended to protein annotation using cloud architecture, if the dataset is larger. However, in order to overcome the impact of ranking all classifiers, we propose either unimposing a specific classifier or employing all classifiers. Disregarding any of the classifiers will yield less score for which we propose yet another classifier based on the total number of possible pairwise interactions and degree of paralogy for the fact that paralogy increases the number of possible interactions thereby decreasing the certainty of the prediction. The bottom line in this proposed classification is that higher the TRS, greater is the chance of the protein to be interacting and more reliable the functional linkage.

Figure 1

A more reliable protein-protein interaction map for protein accession, NP_057673.2 based on the functional linkages.

Figure 2

Outcome of Anova on six-point classifiers using MATLAB. X-axis shows 1-6 classification schemes and TRS while Y-axis shows values corresponding to them.

Figure 3

Outcome of Kruskal_Wallis on six-point classifiers using MATLAB. X-axis shows 1-6 classification schemes and TRS while Y-axis shows values corresponding to them.

Conclusions

A six-point classification system is proposed to solve the problem of hypothetical protein annotation with respect to interaction networks. In this work, by employing our classification schema, we have shown an example taking a protein NP_057673.2 that is known to be localized to mitochondria. The functional linkages through Arabidopsis thaliana indicate that the protein is a Rosetta Stone sequence with AAM10026.1 and the fact that the protein has already been shown to be interacting with HGS (ZFYVE8) makes promiscuous interaction (Table 2, see supplementary material). We believe with a variety of statistical interpretation that we put forth, our in silico selection strategy can be used to select the most promising candidates from a PIN. Further the cloud computing resources employed were quite useful in validating accuracies on TRS modeling through multiple algorithms and data sub-sets.

9 in total

1. Detecting protein function and protein-protein interactions from genome sequences.

Authors: E M Marcotte; M Pellegrini; H L Ng; D W Rice; T O Yeates; D Eisenberg
Journal: Science Date: 1999-07-30 Impact factor: 47.728

2. The Database of Interacting Proteins: 2004 update.

Authors: Lukasz Salwinski; Christopher S Miller; Adam J Smith; Frank K Pettit; James U Bowie; David Eisenberg
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. Iterative cluster analysis of protein interaction data.

Authors: Vicente Arnau; Sergio Mars; Ignacio Marín
Journal: Bioinformatics Date: 2004-09-16 Impact factor: 6.937

4. Locating proteins in the cell using TargetP, SignalP and related tools.

Authors: Olof Emanuelsson; Søren Brunak; Gunnar von Heijne; Henrik Nielsen
Journal: Nat Protoc Date: 2007 Impact factor: 13.491

Review 5. Advanced technologies for studies on protein interactomes.

Authors: Hongtao Guan; Endre Kiss-Toth
Journal: Adv Biochem Eng Biotechnol Date: 2008 Impact factor: 2.635

6. In silico research in the era of cloud computing.

Authors: Joel T Dudley; Atul J Butte
Journal: Nat Biotechnol Date: 2010-11 Impact factor: 54.908

7. The Pfam protein families database.

Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971

8. The IntAct molecular interaction database in 2010.

Authors: B Aranda; P Achuthan; Y Alam-Faruque; I Armean; A Bridge; C Derow; M Feuermann; A T Ghanbarian; S Kerrien; J Khadake; J Kerssemakers; C Leroy; M Menden; M Michaut; L Montecchi-Palazzi; S N Neuhauser; S Orchard; V Perreau; B Roechert; K van Eijk; H Hermjakob
Journal: Nucleic Acids Res Date: 2009-10-22 Impact factor: 16.971

9. Hypo, hype and 'hyp' human proteins.

Authors: Prashanth Suravajhala
Journal: Bioinformation Date: 2007-07-10