| Literature DB >> 32042892 |
Till Hoffmann1, Leto Peel2, Renaud Lambiotte3, Nick S Jones1,4.
Abstract
We develop a Bayesian hierarchical model to identify communities of time series. Fitting the model provides an end-to-end community detection algorithm that does not extract information as a sequence of point estimates but propagates uncertainties from the raw data to the community labels. Our approach naturally supports multiscale community detection and the selection of an optimal scale using model comparison. We study the properties of the algorithm using synthetic data and apply it to daily returns of constituents of the S&P100 index and climate data from U.S. cities.Entities:
Year: 2020 PMID: 32042892 PMCID: PMC6981088 DOI: 10.1126/sciadv.aav1478
Source DB: PubMed Journal: Sci Adv ISSN: 2375-2548 Impact factor: 14.136
Fig. 1A Bayesian hierarchical model for time series with community structure.
Time series y are generated by a latent factor model with factor loadings A shown as dots in (A). The factor loadings are drawn from a Gaussian mixture model with mean μ and precision Λ. Generated time series are shown next to each factor loading for illustration. (B) Directed acyclic graph (DAG) representing the mixture model (A and all of its parents) and the probabilistic principal components analysis (PCA) (A, its siblings, and y). Observed nodes are shaded gray, and fixed hyperparameters are shown as black dots.
Fig. 2The algorithm successfully identifies synthetic communities of time series.
(A) Entries of a synthetic factor loading matrix A as a scatter plot. (B) Inferred factor loading matrix together with the community centers as black crosses and the community covariances as ellipses; error bars correspond to 3 SDs of the posterior. (C) ELBO as a function of the number of latent factors and the prior precision. (D) Difference between the estimated number of communities and the true number of communities K. The model with the highest ELBO is marked with a black dot in (C) and (D); it recovers two latent factors and five communities.
Fig. 3The prior precision Λof the communities affects the number of detected communities.
(A) ELBO of the model as a function of the prior expectation of the precision matrices 〈Λ〉. The ELBO has two distinct peaks corresponding to the community assignments shown in panels (B) and (D), respectively. (C) Number of identified communities as a function of the prior precision; data points with arrows represent a lower bound on the number of inferred communities.
Fig. 4Communities can be recovered even from very short time series.
(A) Median NMI between the true and inferred community assignments obtained using our hierarchical model for n = 100 time series and K = 5 groups as a function of the number of observations T and the community separation h. (B) Median difference between the number of inferred communities and the true number of communities. (C) Median NMI obtained using PCA followed by k-means clustering. (D) Difference in NMI between the two algorithms [see the main text for description of regions (i) to (iii)].
Fig. 5The algorithm identifies 11 communities of stocks in a 10-dimensional factor loading space.
(A) ELBO as a function of the number of latent factors of the model peaking at p = 10 factors. The ELBO of the best of an ensemble of 50 independently fitted models is shown in blue. (B) ELBO as a function of the prior precision. (C) Number of communities identified by the algorithm. The shaded region corresponds to the range of the number of detected communities in the model ensemble. (D) Factor loadings inferred from 1 year of daily log returns of constituents of the S&P100 index as a heat map. Each row corresponds to a stock, and each column corresponds to a factor. The last column of the loading matrix serves as a color key for different communities. (E) Two-dimensional embedding of the factor loading matrix using t-SNE together with cluster labels including credit card (CC) and fast-moving consumer goods (FMCG) companies.
Constituents of the S&P100 grouped by inferred community assignment.
| Mixed | Apple (AAPL), Abbott Laboratories (ABT), Accenture (ACN), Amazon (AMZN), American Express (AXP), Cisco (CSCO), Danaher (DHR), |
| Biotech | AbbVie (ABBV), Actavis (AGN), Amgen (AMGN), Biogen (BIIB), Bristol-Myers Squibb (BMY), Celgene (CELG), Costco (COST), CVS (CVS), |
| Financials | American International Group (AIG), Bank of America (BAC), BNY Mellon (BK), BlackRock (BLK), Citigroup (C), Capital One (COF), |
| Manufacturing and | Allstate (ALL), Barnes Group (B), Boeing (BA), Caterpillar (CAT), Comcast (CMCSA), Emerson Electric (EMR), Ford (F), FedEx (FDX), |
| Fast-moving consumer | Colgate-Palmolive (CL), Kraft Heinz (KHC), Coca Cola (KO), Mondelez International (MDLZ), Altria (MO), PepsiCo (PEP), Procter & |
| Oil and gas | ConocoPhillips (COP), Chevron (CVX), Halliburton (HAL), Kinder Morgan (KMI), Occidental Petroleum (OXY), Schlumberger (SLB), |
| Chemicals | DuPont (DD), Dow Chemical (DOW) |
| Utilities | Duke Energy (DUK), Nextera (NEE), Southern Company (SO) |
| Telecoms | Exelon (EXC), Simon Property Group (SPG), AT&T (T), Verizon (VZ) |
| Defense | Lockheed Martin (LMT), Raytheon (RTN) |
| Credit cards | MasterCard (MA), Visa (V) |
Fig. 6Climate zones of U.S. cities.
(A) City locations colored according to the Köppen-Geiger climate classification system (). (B) Inferred climate zones based on the monthly average high and low temperatures and precipitation amounts. We observe qualitative similarities between the two sets of climate zones, but a quantitative comparison reveals a relatively low correlation (NMI ≈ 0.4).
Root mean square error predicting held-out values of the real-world datasets.
The first column indicates the dataset. The second column displays the error of our method using the specific factor loadings of the time series. We include the third column to indicate the error of our method when we predict missing values according to the community means as a more direct comparison to the baseline methods of Köppen-Geiger and Fenn et al. (). N/A, not applicable.
| S&P100 | 0.731 | 0.750 | N/A | 0.803 |
| U.S. climate | 0.301 | 0.578 | 0.706 | 0.727 |