| Literature DB >> 27380992 |
Waqar Ali1, Anatol E Wegner1, Robert E Gaunt1, Charlotte M Deane1, Gesine Reinert1.
Abstract
Networks are routinely used to represent large data sets, making the comparison of networks a tantalizing research question in many areas. Techniques for such analysis vary from simply comparing network summary statistics to sophisticated but computationally expensive alignment-based approaches. Most existing methods either do not generalize well to different types of networks or do not provide a quantitative similarity score between networks. In contrast, alignment-free topology based network similarity scores empower us to analyse large sets of networks containing different types and sizes of data. Netdis is such a score that defines network similarity through the counts of small sub-graphs in the local neighbourhood of all nodes. Here, we introduce a sub-sampling procedure based on neighbourhoods which links naturally with the framework of network comparisons through local neighbourhood comparisons. Our theoretical arguments justify basing the Netdis statistic on a sample of similar-sized neighbourhoods. Our tests on empirical and synthetic datasets indicate that often only 10% of the neighbourhoods of a network suffice for optimal performance, leading to a drastic reduction in computational requirements. The sampling procedure is applicable even when only a small sample of the network is known, and thus provides a novel tool for network comparison of very large and potentially incomplete datasets.Entities:
Year: 2016 PMID: 27380992 PMCID: PMC4933923 DOI: 10.1038/srep28955
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Effect of ego-network sampling on Netdis performance measured by nearest neighbour scores 1 − NN and k − NN: (A) Simulated networks from different random graph models with model parameters matching the DIP-core yeast network with 2,160 nodes, (B) Simulated networks from different random graph models with 10,000 nodes and average degree ≈20, (C) Onnela et al. data containing 151 networks of sizes ranging from 30 to 11586 nodes and (D) Protein interaction networks of Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fly), Homo sapiens (human), Escherichia coli and Helicobacter pylori. The dashed red lines correspond to the average nearest neighbour scores over a sample of 50 random distance matrices. The performance of the Netdis statistics starts to deteriorate strongly only when less than 10% of the ego-networks of each network are sampled. Even when only 1% of the ego-networks are sampled the statistics performs better than the random baseline.
Figure 2Netdis performance under sub-sampling measured by measured by nearest neighbour scores 1 − NN and k − NN for large simulated network data sets with average degree ≈20 each containing 5 realizations of 5 different random graph models: (A) Networks with 25,000 nodes, (B) Networks with 50,000 nodes and (C) Networks with 100,000 nodes. The dashed red lines correspond to the average nearest neighbour scores over a sample of 50 random distance matrices. Note that the x-axes are scaled logarithmically. While the performance of Netdis slightly deteriorates with smaller sample size, the deterioration is sublinear in the the number of nodes. The k − NN score is only slightly smaller than the 1 − NN score. For networks with 100,000 nodes even sampling only 10 ego-networks contains a strong signal.