| Literature DB >> 30065311 |
Maksym Byshkin1, Alex Stivala1,2, Antonietta Mira1,3, Garry Robins2,4, Alessandro Lomi5,6.
Abstract
A major line of contemporary research on complex networks is based on the development of statistical models that specify the local motifs associated with macro-structural properties observed in actual networks. This statistical approach becomes increasingly problematic as network size increases. In the context of current research on efficient estimation of models for large network data sets, we propose a fast algorithm for maximum likelihood estimation (MLE) that affords a significant increase in the size of networks amenable to direct empirical analysis. The algorithm we propose in this paper relies on properties of Markov chains at equilibrium, and for this reason it is called equilibrium expectation (EE). We demonstrate the performance of the EE algorithm in the context of exponential random graph models (ERGMs) a family of statistical models commonly used in empirical research based on network data observed at a single period in time. Thus far, the lack of efficient computational strategies has limited the empirical scope of ERGMs to relatively small networks with a few thousand nodes. The approach we propose allows a dramatic increase in the size of networks that may be analyzed using ERGMs. This is illustrated in an analysis of several biological networks and one social network with 104,103 nodes.Entities:
Year: 2018 PMID: 30065311 PMCID: PMC6068132 DOI: 10.1038/s41598-018-29725-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Estimation of simulated networks by suggested EE algorithm and MoM[18]. Horizontal lines show the true value and the estimates from each of the two methods for each of the four network sizes are shown as boxplots (generated with ggplot2[65]). Each boxplot represents estimates for 120 networks, except that for MoM, when N = 5000 only 118 estimations converged and when N = 10,000 only 49 of the 120 estimations converged.
Figure 2Estimation times (observed values for different network sizes are reported as circles and triangles) for the EE algorithm and Method of Moments (MoM) with fitted lines. Both axes are on a log scale.
Parameter estimates with 95% confidence interval (see Supplementary Information) for the Arabidopsis thaliana PPI network, estimated using the EE algorithm with the IFD sampler.
| AT | Mismatch E class | Mismatch kinase-phosphorylated | Edge (L) | Isolates | AS | Activity plant specific | Interaction plant specific |
|---|---|---|---|---|---|---|---|
| 1.276 | 1.304 | 0.192 | −14.940 | −7.116 | 2.320 | −0.104 | 0.456 |
| (1.24, 1.31) | (0.77, 1.83) | (0.08, 0.30) | (−14.97, −14.92) | (−7.59, −6.64) | (2.23, 2.41) | (−0.15, −0.06) | (0.21, 0.70) |
Estimation of this 2,160 nodes network took only 3 minutes on the Lenovo NeXtScale x86 system at Melbourne Bioinformatics.
Figure 3Results of estimation of ERGM parameters for Livemocha networks with 104,103 nodes and 2,193,083 ties using the EE algorithm. The starting point is the result of the CD-1 algorithm. Producing these results took 12 hours on one core of the Intel E5-2650 machines available at https://intranet.ics.usi.ch/HPC.