Literature DB >> 18990723

RANKPROP: a web server for protein remote homology detection.

Iain Melvin1, Jason Weston, Christina Leslie, William Stafford Noble.   

Abstract

UNLABELLED: We present a large-scale implementation of the Rankprop protein homology ranking algorithm in the form of an openly accessible web server. We use the NRDB40 PSI-BLAST all-versus-all protein similarity network of 1.1 million proteins to construct the graph for the Rankprop algorithm, whereas previously, results were only reported for a database of 108 000 proteins. We also describe two algorithmic improvements to the original algorithm, including propagation from multiple homologs of the query and better normalization of ranking scores, that lead to higher accuracy and to scores with a probabilistic interpretation. AVAILABILITY: The Rankprop web server and source code are available at http://rankprop.gs.washington.edu

Entities:  

Mesh:

Year:  2008        PMID: 18990723      PMCID: PMC2638939          DOI: 10.1093/bioinformatics/btn567

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Rankprop (Weston et al., 2004) is a network-based inference algorithm for identifying subtle protein sequence similarities, corresponding to remote homology relationships or to structural similarities. The algorithm operates on a protein similarity network, a graph in which each node is a protein and each weighted edge connecting two proteins indicates their similarity. Such a network can be built using existing tools, such as PSI-BLAST (Altschul et al., 1997). The key idea of the Rankprop algorithm is to extract global information from a protein similarity network by propagating outward from a user-specified query protein. Effectively, the algorithm sums over all possible paths from the query to each target protein. Thus, after propagation, the resulting activation score for each node includes global information about that protein's relationship to the query. Ranking proteins by these scores is analogous to performing a database search using a tool such as PSI-BLAST, except that the ranking induced by Rankprop reflects the global topology of the protein similarity network. In (Weston et al., 2004), PSI-BLAST is used to measure sequence similarity, and the unnormalized weight for the edge from node i to node j is W=exp(−S(i)/σ), where S(i) is the PSI-BLAST E-value assigned to protein i given query j, and the parameter σ is a positive constant. Edges are only included in the network for E-values smaller than a fixed threshold. We obtain a stochastic connectivity matrix M for the protein similarity network by row-normalizing edge weights W to obtain transition probabilities: M = W / ∑ W. Given such a network and a query sequence q, the Rankprop algorithm is simple to describe. First, all nodes are assigned initial activation scores that reflect each target protein's similarity to q. Like the edge weights, these scores are computed from PSI-BLAST E-values using the same equation. At each iteration of the algorithm, the activation score at a given node is replaced by the weighted sum of the scores from all of its incoming edges. The update rule includes a diffusion constant α that controls the rate of diffusion through the network. Formally, we define the initial activation scores as P0= exp(−S(i)/σ). Viewing P as the column vector of activation levels at iteration t, the algorithm is given by P= αMP+P0 if P≠q and P = 1 otherwise, where α ∈(0,1). One can show that this iterative procedure converges to a fixed point, which in practice happens in a small number of iterations. The output of the Rankprop algorithm is a ranking of the nodes in the network according to their final activation values. Proteins that receive a high activation score are linked to the query via many strongly weighted paths and vice versa. A multidomain query protein will produce strong matches to any target protein that contains one or more of the query domains. A single domain query A may connect through a multidomain protein AB to infer a false relationship with B. However, previous work (Weston et al., 2004) has found that as long as the query sequence is connected to many other proteins, then the true homologs will be mutually reinforcing and receive a higher rank. In this work, we extend the original Rankprop algorithm in two ways: (1) improving accuracy by propagating simultaneously from proteins that are very closely related to the query, and (2) improving the interpretability of the scores produced by Rankprop by empirically mapping them to probabilities. The mapped score can be interpreted as the probability that the target protein is a member of the same SCOP superfamily as the query. We also announce the availability of a free web server that allows individual queries against a protein similarity network derived from the NRDB40, comprising 1.1 million targets.

2 METHODS

The Rankprop server uses the PSI-BLAST all-versus-all similarity matrix for NRDB40 provided by the PairsDB website (Heger et al., 2008). NRDB40 is a subset of the non-redundant sequence database, filtered so that no pairs exhibit >40% sequence identity. We generalize the Rankprop algorithm to accept a set Q of query proteins, rather than a single query protein. To use this extra information we perform propagation as usual, but we constrain the activation scores for all the query points such that they are highly ranked. In particular, we choose the set Q to be all the proteins that have a match with the initial query q with a PSI-BLAST E-value<0.001. We then constrain our algorithm to have P = 1−S(j), ∀ j∈Q. This modification is useful because, instead of propagating from a single query source node in the graph, we can propagate from several source nodes that all belong to the same family or superfamily that we are searching for. The original Rankprop algorithm outputs scalar values that are not directly interpretable. In the new version of the algorithm, we map each Rankprop score to an estimate of the probability that the corresponding query and target proteins belong to the same structural superfamily. We employ the SCOP database (Murzin et al., 1995) to compute a histogram of empirical frequencies of the activation levels P for each protein i. More specifically, we choose bin centers v and compute the following quantities: n, the number of times P falls into bin v, and s, the number of times that the latter occurs and i is in the same superfamily as the query. We are interested in the value s / n, which can be interpreted as the probability for each activation value bin of the target being in the same superfamily as the query. We choose the bin centers v= (0,0.01,0.02,…, 0.2 , 0.3 ,…, 1), and we enforce monotonicity in the final output by setting p/n = p/n if p/n

3 RESULTS

Table 1 compares our large-scale Rankprop results with PSI-BLAST (using NRDB40 and the same blastpgp parameters as PairsDB: −j 10 −e 1 −h 0.001 −b 10000 −v 10000) and the previously published version of Rankprop (using the SWISSPROT database, 108k proteins). Rankprop NRDB40 is a straightforward scaling up of the previous Rankprop algorithm to NRDB40. In addition, Rankprop+homologs NRDB40 employs the extensions described in Section 2. Accuracy is measured following the methodology given in (Weston et al., 2004): SCOP version 1.59 is split into train and test portions, and hyperparameters are chosen by using the training set. Then, each test protein is treated as a query, and the quality of a method's protein ranking is measured by using the area under the receiver operating characteristic (ROC) curve, up to the first (ROC1) or 50th (ROC50) false positive. We report results as average ROC1 and ROC50 scores across all 3083 test proteins. Using a larger network yields improvements across all four performance metrics, and propagating from multiple queries improves the performance still further. A Wilcoxon signed rank test, corrected for multiple tests, shows that all differences in Table 1 are significant at 0.01, except for the three pairs of methods marked with asterisks.
Table 1.

Ranking accuracy

FamilyFamilyS-FamS-Fam

MethodROC1ROC50ROC1ROC50
PSI-BLAST0.833*0.8510.609*0.628
RankProp SWISSPROT0.816*0.9060.592*0.725
RankProp NRDB400.8720.9230.6960.779*
RankProp+homologs NRDB400.8840.9280.7100.775*

*Indicate pairs of values that are not different at P < 0.01 (Wilcoxon signed rank).

Ranking accuracy *Indicate pairs of values that are not different at P < 0.01 (Wilcoxon signed rank). We also evaluate the performance of Rankprop using a combined ROC curve across all the queries in our test set, following the protocol of (Altschul et al., 1997). Figure 1 shows the combined ROC curves for Rankprop NRDB40 (ranked by activation value), Rankprop+homologs NRDB40 (ranked by probability) and PSI-BLAST (ranked by E-value). Compared with average per-query ROC scores, the combined ROC curve requires that scores are well calibrated from one query to the next. The figure shows that the mapping of Rankprop scores to probabilities significantly improves the calibration, yielding better performance than PSI-BLAST for all but the first few false positives (across 3083 queries).
Fig. 1.

Combined ROC curve across multiple queries. For each method, search results from 3083 queries were sorted into a single list. The figure plots, for varying thresholds in the ranked list, the fraction of all known homologs (SCOP superfamily members) that fall above the threshold, as a function of the number of non-superfamily members above the threshold.

Combined ROC curve across multiple queries. For each method, search results from 3083 queries were sorted into a single list. The figure plots, for varying thresholds in the ranked list, the fraction of all known homologs (SCOP superfamily members) that fall above the threshold, as a function of the number of non-superfamily members above the threshold. The Rankprop web server first looks for an exact match of the query sequence against the sequences in NRDB40. If such a match is found, the server will retrieve the precomputed PSI-BLAST results from the database and then apply the Rankprop algorithm. In this case the server takes around 90 s to process a query. If the sequence is not found in the database, then the server will run PSI-BLAST first, which on average takes an additional 15 min.

Funding

National Institutes of Health oupReleaseDelayRemoved from OA Article (12|0) award (R01 GM074257). Conflict of Interest: none declared.
  4 in total

1.  Protein ranking: from local to global structure in the protein similarity network.

Authors:  Jason Weston; Andre Elisseeff; Dengyong Zhou; Christina S Leslie; William Stafford Noble
Journal:  Proc Natl Acad Sci U S A       Date:  2004-04-15       Impact factor: 11.205

Review 2.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

3.  SCOP: a structural classification of proteins database for the investigation of sequences and structures.

Authors:  A G Murzin; S E Brenner; T Hubbard; C Chothia
Journal:  J Mol Biol       Date:  1995-04-07       Impact factor: 5.469

4.  PairsDB atlas of protein sequence space.

Authors:  Andreas Heger; Eija Korpelainen; Taavi Hupponen; Kimmo Mattila; Vesa Ollikainen; Liisa Holm
Journal:  Nucleic Acids Res       Date:  2007-11-05       Impact factor: 16.971

  4 in total
  7 in total

1.  A new method to improve network topological similarity search: applied to fold recognition.

Authors:  John Lhota; Ruth Hauptman; Thomas Hart; Clara Ng; Lei Xie
Journal:  Bioinformatics       Date:  2015-02-25       Impact factor: 6.937

2.  Supergenomic network compression and the discovery of EXP1 as a glutathione transferase inhibited by artesunate.

Authors:  Andreas Martin Lisewski; Joel P Quiros; Caroline L Ng; Anbu Karani Adikesavan; Kazutoyo Miura; Nagireddy Putluri; Richard T Eastman; Daniel Scanfeld; Sam J Regenbogen; Lindsey Altenhofen; Manuel Llinás; Arun Sreekumar; Carole Long; David A Fidock; Olivier Lichtarge
Journal:  Cell       Date:  2014-08-14       Impact factor: 41.582

3.  Detecting remote evolutionary relationships among proteins by large-scale semantic embedding.

Authors:  Iain Melvin; Jason Weston; William Stafford Noble; Christina Leslie
Journal:  PLoS Comput Biol       Date:  2011-01-27       Impact factor: 4.475

4.  Physicochemical property distributions for accurate and rapid pairwise protein homology detection.

Authors:  Bobbie-Jo M Webb-Robertson; Kyle G Ratuiste; Christopher S Oehmen
Journal:  BMC Bioinformatics       Date:  2010-03-19       Impact factor: 3.169

5.  Concept and application of a computational vaccinology workflow.

Authors:  Johannes Söllner; Andreas Heinzel; Georg Summer; Raul Fechete; Laszlo Stipkovits; Susan Szathmary; Bernd Mayer
Journal:  Immunome Res       Date:  2010-11-03

6.  ANTENNA, a Multi-Rank, Multi-Layered Recommender System for Inferring Reliable Drug-Gene-Disease Associations: Repurposing Diazoxide as a Targeted Anti-Cancer Therapy.

Authors:  Annie Wang; Hansaim Lim; Shu-Yuan Cheng; Lei Xie
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2018-03-16       Impact factor: 3.710

7.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation.

Authors:  Junjie Chen; Ren Long; Xiao-Long Wang; Bin Liu; Kuo-Chen Chou
Journal:  Sci Rep       Date:  2016-09-01       Impact factor: 4.379

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.