Sergio Picart-Armada1,2, Wesley K Thompson3,4, Alfonso Buil3, Alexandre Perera-Lluna1,2. 1. B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain. 2. Esplugues de Llobregat, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Barcelona, 08950, Spain. 3. Mental Health Center Sct. Hans, 4000 Roskilde, Denmark. 4. Department of Family Medicine and Public Health, University of California, San Diego, La Jolla, CA, USA.
Abstract
MOTIVATION: Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterized some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. RESULTS: Diffusion scores starting from binary labels were affected by the label codification and exhibited a problem-dependent topological bias that could be removed by the statistical normalization. Parametric and non-parametric normalization addressed both points by being codification-independent and by equalizing the bias. We identified and quantified two sources of bias-mean value and variance-that yielded performance differences when normalizing the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalization was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. AVAILABILITY: The code is publicly available at https://github.com/b2slab/diffuBench and the data underlying this article are available at https://github.com/b2slab/retroData. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterized some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. RESULTS: Diffusion scores starting from binary labels were affected by the label codification and exhibited a problem-dependent topological bias that could be removed by the statistical normalization. Parametric and non-parametric normalization addressed both points by being codification-independent and by equalizing the bias. We identified and quantified two sources of bias-mean value and variance-that yielded performance differences when normalizing the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalization was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. AVAILABILITY: The code is publicly available at https://github.com/b2slab/diffuBench and the data underlying this article are available at https://github.com/b2slab/retroData. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Christian von Mering; Roland Krause; Berend Snel; Michael Cornell; Stephen G Oliver; Stanley Fields; Peer Bork Journal: Nature Date: 2002-05-08 Impact factor: 49.962
Authors: Evan O Paull; Daniel E Carlin; Mario Niepel; Peter K Sorger; David Haussler; Joshua M Stuart Journal: Bioinformatics Date: 2013-08-27 Impact factor: 6.937
Authors: Andrew Chatr-Aryamontri; Rose Oughtred; Lorrie Boucher; Jennifer Rust; Christie Chang; Nadine K Kolas; Lara O'Donnell; Sara Oster; Chandra Theesfeld; Adnane Sellam; Chris Stark; Bobby-Joe Breitkreutz; Kara Dolinski; Mike Tyers Journal: Nucleic Acids Res Date: 2016-12-14 Impact factor: 16.971
Authors: Sergio Picart-Armada; Francesc Fernández-Albert; Maria Vinaixa; Miguel A Rodríguez; Suvi Aivio; Travis H Stracker; Oscar Yanes; Alexandre Perera-Lluna Journal: PLoS One Date: 2017-12-06 Impact factor: 3.240