Literature DB >> 26614125

SVD-phy: improved prediction of protein functional associations through singular value decomposition of phylogenetic profiles.

Andrea Franceschini¹, Jianyi Lin², Christian von Mering¹, Lars Juhl Jensen³.

Abstract

UNLABELLED: A successful approach for predicting functional associations between non-homologous genes is to compare their phylogenetic distributions. We have devised a phylogenetic profiling algorithm, SVD-Phy, which uses truncated singular value decomposition to address the problem of uninformative profiles giving rise to false positive predictions. Benchmarking the algorithm against the KEGG pathway database, we found that it has substantially improved performance over existing phylogenetic profiling methods.
AVAILABILITY AND IMPLEMENTATION: The software is available under the open-source BSD license at https://bitbucket.org/andrea/svd-phy CONTACT: lars.juhl.jensen@cpr.ku.dk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2015 PMID： 26614125 PMCID： PMC4896368 DOI： 10.1093/bioinformatics/btv696

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Phylogenetic profiling methods are able to predict functional interactions between genes that encode proteins from the same complex or pathway, by comparing their phylogenetic distributions (Cheng and Perocchi, 2015; Date and Marcotte, 2003; Enault ; Li ; Pellegrini ; Tabach ). The underlying idea is that when two genes are functionally related, they should tend to be co-inherited; since the loss of either one of these genes would be detrimental to that particular function. Here we present a new phylogenetic profiling method, SVD-Phy, which performs considerably better than existing methods for both bacteria and eukaryotes.

2 Phylogenetic profiling algorithm

Our algorithm infers associations among the proteins in a query organism based on their sequence similarity to sequences from a large number of other organisms. Specifically we construct a matrix with the alignment bit scores of the best scoring match for each query protein (rows) in each organism (columns), including the organism itself. We obtain the bit scores from SIMAP (Arnold ) via the homology table of STRING v10 (Szklarczyk ), but bit scores from BLAST can also be used. If a query protein gives no hits in a certain organism with a bit score of at least 60, the bit score is set to 0; using higher cutoffs reduced the performance (Supplementary Fig. S1). We convert this matrix to a normalized best hit matrix M by dividing each bit score by the largest score in the same row (typically the self-hit). Similar to earlier work on phylogenetic stratification (Psomopoulos ), we then perform truncated singular value decomposition (SVD) of M by calculating the factorization M=USV' and retaining only the first C columns from the resulting unitary matrix U. Different values of C were tested for each organism (Supplementary Figs S2–S5). We finally normalize each row in the matrix to unit vectors and calculate all pairwise Euclidean distances between them. Other similarity metrics gave similar or worse performance (Supplementary Figs S6–S10). See supplementary material for further details.

3 Benchmarking and comparison

We tested the algorithm on both prokaryotic and eukaryotic proteins and compared its performance against a simplified algorithm lacking the truncated SVD step and against two established algorithms (Date and Marcotte, 2003; Tabach ,b). For all four algorithms, we generated ranked lists of predicted associations based on phylogenetic profiles across all 1793 prokaryotes and 238 eukaryotes in STRING v10 for prokaryotic and eukaryotic query proteins, respectively. We benchmarked the predicted associations against the KEGG pathway database (Kanehisa ). Given a ranked list of predicted function associations, we evaluate the performance as follows. We first discard all pairs with bit score ≥60, as homologous proteins will trivially have similar phylogenetic profiles and are often involved in the same KEGG pathway. We next map all proteins to KEGG genes and discard pairs where one or both proteins cannot be placed on a KEGG map. The remaining pairs are considered true positives (TP) if the two proteins fall within the same KEGG map and otherwise false positives (FP). To ensure that the results were not biased by certain atypical KEGG maps (Supplementary Table S1), we repeated all analysis excluding these maps. We also benchmarked the predicted associations for E.coli and H.sapiens using EcoCyc (Keseler ) and Reactome (Croft ), respectively. In all benchmarks, SVD-Phy showed dramatically improved performance over the other three algorithms, including the simplified algorithm that differs only by leaving out the truncated SVD step (Fig. 1 and Supplementary Figs S6–S10). When benchmarked on Saccharomyces cerevisiae, SVD-Phy also outperformed the CLIME method (Li ) (Supplementary Fig S11). For example, SVD-Phy predicts over 14-fold more associations at 75% precision than other methods on Escherichia coli, an organism on which all algorithms generally perform well. When not restricting associations to proteins that can be mapped to KEGG, we predict 14 078 interactions in E.coli and 4090 in H.sapiens at 75% precision. This corresponds to an average interaction degree of 7.2 and 0.4, respectively.

Fig. 1

. Performance comparison of SVD-Phy and three other methods. We ran SVD-Phy (red), SVD-Phy without the truncated SVD step (gray), the Marcotte (Date and Marcotte, 2003) (black) and the Tabach (Tabach a,b) (blue) algorithms. Graphs show the precision [TP/(TP+FP)], which we measured by scanning the sorted lists with a sliding window of 400 interactions The benchmarks also revealed that all algorithms performed considerably worse on eukaryotes than on prokaryotes. To test whether this was purely due to the smaller number of eukaryotic organisms used to build the phylogenetic profiles, we repeated the analyses using profiles based on only 238 prokaryotes (Supplementary Figs S5B and S7B). Although this did lead to an expected decrease in performance, all algorithms continued to perform notably better on prokaryotes than on eukaryotes.

4 Discussion

We have shown that SVD-Phy has better predictive power than existing phylogenetic profiling algorithms. This improvement was achieved by performing truncated SVD on the profiles before calculating their similarities. An intuitive explanation of this transformation is that it collapses phylogenetic profiles that are shared by many proteins into fewer dimensions (principal components). This reduces noise (Psomopoulos ) and increases the diversity of the resulting profiles, which was recently shown to be beneficial (Škunca and Dessimoz, 2015). The benefit is that it prevents high similarity scores between uninformative profiles that can be trivially explained by simple vertical inheritance of genes along the taxonomic tree, or by broad similarities in the lifestyles of the organisms. This includes highly similar profiles caused by the inclusion of multiple strains of a species, clade-specific proteins and enzymes that have been lost in most parasites (because they instead import metabolites from their hosts). We fully integrated our protein–protein interaction predictions with the STRING database (Szklarczyk ) (Supplementary Figs S12–S13). The data can be browsed online and is freely available for download in tab-delimited format. SVD-Phy executes very fast: its run time is on average about 10–20 min per organism on a normal workstation. This allows us to execute the algorithm for all 2031 species in the STRING database, and makes it possible for others to utilize the algorithm within their web resources. In a recent study, Tabach successfully used their method to shed light on several disease pathways. Phylogenetic profiling algorithms have also been applied to analyze non-coding elements (NCEs), such as small RNAs (Ott ; Tabach a), showing that phylogenetic profiling is indeed an important technique that can be used to shed light even on NCE functions and interactions (Dimitrieva and Bucher, 2012).

Funding

This work was supported by the Swiss Institute of Bioinformatics and the Novo Nordisk Foundation [NNF14CC0001]. Conflict of Interest: none declared.

16 in total

1. Annotation of bacterial genomes using improved phylogenomic profiles.

Authors: F Enault; K Suhre; C Abergel; O Poirot; J-M Claverie
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

2. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.

Authors: M Pellegrini; E M Marcotte; M J Thompson; D Eisenberg; T O Yeates
Journal: Proc Natl Acad Sci U S A Date: 1999-04-13 Impact factor: 11.205

3. Expansion of biological pathways based on evolutionary inference.

Authors: Yang Li; Sarah E Calvo; Roee Gutman; Jun S Liu; Vamsi K Mootha
Journal: Cell Date: 2014-07-03 Impact factor: 41.582

4. EcoCyc: fusing model organism databases with systems biology.

Authors: Ingrid M Keseler; Amanda Mackie; Martin Peralta-Gil; Alberto Santos-Zavaleta; Socorro Gama-Castro; César Bonavides-Martínez; Carol Fulcher; Araceli M Huerta; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Luis Muñiz-Rascado; Quang Ong; Suzanne Paley; Imke Schröder; Alexander G Shearer; Pallavi Subhraveti; Mike Travers; Deepika Weerasinghe; Verena Weiss; Julio Collado-Vides; Robert P Gunsalus; Ian Paulsen; Peter D Karp
Journal: Nucleic Acids Res Date: 2012-11-09 Impact factor: 16.971

5. STRING v10: protein-protein interaction networks, integrated over the tree of life.

Authors: Damian Szklarczyk; Andrea Franceschini; Stefan Wyder; Kristoffer Forslund; Davide Heller; Jaime Huerta-Cepas; Milan Simonovic; Alexander Roth; Alberto Santos; Kalliopi P Tsafou; Michael Kuhn; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2014-10-28 Impact factor: 16.971

6. ProtPhylo: identification of protein-phenotype and protein-protein functional associations via phylogenetic profiling.

Authors: Yiming Cheng; Fabiana Perocchi
Journal: Nucleic Acids Res Date: 2015-05-08 Impact factor: 16.971

7. Phylogenetic profiling: how much input data is enough?

Authors: Nives Škunca; Christophe Dessimoz
Journal: PLoS One Date: 2015-02-13 Impact factor: 3.240

8. Detection of genomic idiosyncrasies using fuzzy phylogenetic profiles.

Authors: Fotis E Psomopoulos; Pericles A Mitkas; Christos A Ouzounis
Journal: PLoS One Date: 2013-01-14 Impact factor: 3.240

9. Identification of small RNA pathway genes using patterns of phylogenetic conservation and divergence.

Authors: Yuval Tabach; Allison C Billi; Gabriel D Hayes; Martin A Newman; Or Zuk; Harrison Gabel; Ravi Kamath; Keren Yacoby; Brad Chapman; Susana M Garcia; Mark Borowsky; John K Kim; Gary Ruvkun
Journal: Nature Date: 2012-12-23 Impact factor: 49.962

10. The Reactome pathway knowledgebase.

Authors: David Croft; Antonio Fabregat Mundo; Robin Haw; Marija Milacic; Joel Weiser; Guanming Wu; Michael Caudy; Phani Garapati; Marc Gillespie; Maulik R Kamdar; Bijay Jassal; Steven Jupe; Lisa Matthews; Bruce May; Stanislav Palatnik; Karen Rothfels; Veronica Shamovsky; Heeyeon Song; Mark Williams; Ewan Birney; Henning Hermjakob; Lincoln Stein; Peter D'Eustachio
Journal: Nucleic Acids Res Date: 2013-11-15 Impact factor: 16.971

40 in total

1. Link prediction based on non-negative matrix factorization.

Authors: Bolun Chen; Fenfen Li; Senbo Chen; Ronglin Hu; Ling Chen
Journal: PLoS One Date: 2017-08-30 Impact factor: 3.240

2. Over-expression of TOP2A as a prognostic biomarker in patients with glioma.

Authors: Tianmin Zhou; Yan Wang; Dongmeng Qian; Qing Liang; Bin Wang
Journal: Int J Clin Exp Pathol Date: 2018-03-01

3. Overexpression of Topoisomerase 2-Alpha Confers a Poor Prognosis in Pancreatic Adenocarcinoma Identified by Co-Expression Analysis.

Authors: Zhou Zhou; Shi Liu; Meng Zhang; Rui Zhou; Jing Liu; Ying Chang; Qiu Zhao
Journal: Dig Dis Sci Date: 2017-08-16 Impact factor: 3.199

4. The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.

Authors: Damian Szklarczyk; Annika L Gable; Katerina C Nastou; David Lyon; Rebecca Kirsch; Sampo Pyysalo; Nadezhda T Doncheva; Marc Legeay; Tao Fang; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

5. Long non-coding RNA SNHG6 is upregulated in prostate cancer and predicts poor prognosis.

Authors: Youji Yan; Zhongjun Chen; Yu Xiao; Xinghuan Wang; Kaiyu Qian
Journal: Mol Biol Rep Date: 2019-03-25 Impact factor: 2.316

6. Analysis of Microarray Data from Medulloblastoma Tissue Samples.

Authors: Debojyoti Dhar; Gopala Kallapura
Journal: Methods Mol Biol Date: 2022

7. Defining hierarchical protein interaction networks from spectral analysis of bacterial proteomes.

Authors: Mark A Zaydman; Arjun S Raman; Alexander S Little; Fidel Haro; Valeryia Aksianiuk; William J Buchser; Aaron DiAntonio; Jeffrey I Gordon; Jeffrey Milbrandt
Journal: Elife Date: 2022-08-17 Impact factor: 8.713

8. Developing Photoaffinity Probes for Dopamine Receptor D₂ to Determine Targets of Parkinson's Disease Drugs.

Authors: Spencer T Kim; Emma J Doukmak; Raymond G Flax; Dylan J Gray; Victoria N Zirimu; Ebbing de Jong; Rachel C Steinhardt
Journal: ACS Chem Neurosci Date: 2022-10-02 Impact factor: 5.780

9. Effect of Productive Human Papillomavirus 16 Infection on Global Gene Expression in Cervical Epithelium.

Authors: Sa Do Kang; Sreejata Chatterjee; Samina Alam; Anna C Salzberg; Janice Milici; Sjoerd H van der Burg; Craig Meyers
Journal: J Virol Date: 2018-09-26 Impact factor: 5.103

10. Network based analysis identifies TP53m-BRCA1/2wt-homologous recombination proficient (HRP) population with enhanced susceptibility to Vigil immunotherapy.

Authors: Elyssa Sliheet; Molly Robinson; Susan Morand; Khalil Choucair; David Willoughby; Laura Stanbery; Phylicia Aaron; Ernest Bognar; John Nemunaitis
Journal: Cancer Gene Ther Date: 2021-11-16 Impact factor: 5.854