Mark He1, Joseph Glasser1, Nathaniel Pritchard2, Shankar Bhamidi1, Nikhil Kaza3. 1. Statistics & Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America. 2. Statistics, University of Wisconsin at Madison, Madison, WI, United States of America. 3. City & Regional Planning, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America.
Abstract
We develop a method to identify statistically significant communities in a weighted network with a high proportion of self-looping weights. We use this method to find overlapping agglomerations of U.S. counties by representing inter-county commuting as a weighted network. We identify three types of communities; non-nodal, nodal and monads, which correspond to different types of regions. The results suggest that traditional regional delineations that rely on ad hoc thresholds do not account for important and pervasive connections that extend far beyond expected metropolitan boundaries or megaregions.
We develop a method to identify statistically significant communities in a weighted network with a high proportion of self-looping weights. We use this method to find overlapping agglomerations of U.S. counties by representing inter-county commuting as a weighted network. We identify three types of communities; non-nodal, nodal and monads, which correspond to different types of regions. The results suggest that traditional regional delineations that rely on ad hoc thresholds do not account for important and pervasive connections that extend far beyond expected metropolitan boundaries or megaregions.
Geographic regions map to social, cultural, and economic structures that enable us to make sense of the world. Demarcation of these regions allows institutional responses to shared problems by creating territorial administrations. These regions are useful at different scales and are created for varying purposes (e.g. cities, places, watersheds, economic regions) [1, 2, 3]. In the United States, metropolitan regions are conceived as collections of counties or equivalent areas (sub-state political units) and are used for different statistical, governance and planning purposes. Yet recent work suggests that these metropolitan regions have coalesced and that megaregions spanning multiple states to better project and plan for future growth [4]. At the same time, many urbanized areas are often sub-county regions [5].Methods to identify these metropolitan areas, megaregions or labor markets are often subject to issues of scale and rely on ad hoc definitions, thresholds, and judgments about proximity, connections and similarity. For example, the U.S. Office of Management and Budget (OMB) uses commuting networks to define Core Based Statistical Areas (CBSA). Surrounding counties are added to a CBSA core county (population thresholds of 10,000 or 50,000 in urban areas for Micropolitan (μSA) and Metropolitan areas (MSA)), if social and economic ties (as defined by commuting to the core county) comprise more than 25% of the total ties. Adjacent CBSAs are merged into a single CBSA when the central county (or counties) of one CBSA qualifies as outlying with respect to the other CBSA. The county to CBSA relationship is a one-to-one relationship; i.e. a county belongs to only one CBSA. If a county satisfies threshold requirements for multiple CBSAs, then it is allocated to the CBSA with which it has the strongest ties [6]. On the other hand, megaregions are defined by the Regional Planning Association with collections of counties that are part of CBSAs, then agglomerated using the Delphi method. Megaregions rarely overlap, but expert judgment is used to identify the boundaries when they do.In both of these delineations, expert judgment is relied upon to set thresholds for the definition of core county, the strengths of connections, and where the boundaries fall. These judgments are colored by the analysts’ orientation regarding the appropriate size of the region and the relevance of the connections. Judgments about boundaries are difficult to account for in analytic projects that seek to explain economic growth patterns [7], productivity gains [8] and city size distribution laws [9]. They are also particularly problematic when governmental assistance programs rely on statistical information conditioned by the boundaries (e.g. tax credits for investments in opportunity zones defined on the basis of Median Household Income in the MSA).We present a method for inferring geographic regions systematically from the underlying data using community detection methods in network science. One of the key innovations of this approach is to identify multiple overlapping regions at different scales in the same statistical inference framework. We also extend the notion of community to identify nodal regions where peripheral connections are overwhelmed by connections to the core. We also extend community detection methods to include self-loops that have traditionally been implicit or ignored in other community detection work, but are of great importance in commuting networks. The results of these methods identify unusual regions that neither CBSA nor megaregions identify and allow a more nuanced approach to studying and governing metropolitan areas and labor markets [10]. In summary, our methods are able to differentiate between various types of communities that we classify into three major types:Monads: Nodes preferentially attached to themselves in the sense that the self-looping proportions of these nodes are significantly stronger than the baseline self-looping proportions across the entire network.Nodal communities: Peripheral nodes more strongly connected to some core nodes rather than among themselves, after accounting for the baseline self-loops.Non-nodal communities: Clusters of nodes that are strongly connected to one another. (See Fig 1).
Fig 1
Conceptual diagrams representing a) Overlapping non-nodal communities b) Nodal communities (trees) c) monads.
The different colors represent different clusters/regions.
Conceptual diagrams representing a) Overlapping non-nodal communities b) Nodal communities (trees) c) monads.
The different colors represent different clusters/regions.Networks are used to model the relational structures between individual units of an observed system. A multitude of data structures may be conceived of as networks in the biological, physical, and social sciences. Over the last few years, owing to the explosion in data from a host of areas including social networks, information networks such as the Internet, and biochemical networks such as gene regulatory systems, there has been a concerted inter-disciplinary approach to understanding these data [11, 12, 13, 14, 15]. Due to the inherently relational nature of the commuting data, network methods offer a fitting approach for identifying clusters amongst interconnected regions.
1.1 Related work in region demarcation
Both CBSA and megaregions implicitly or explicitly rely on the notion of cores. In the case of the former, cores are counties. In the case of the latter, cores are CBSAs. The core-based approach to identifying urban regions has a history dating back to 1950s. Key to this approach is the identification of a core and its connections with the hinterland (nodal communities) [16]. This core-based approach ignores the emerging polycentric structure that has come to dominate regions around the world [17]. Since the 1980s, however, further studies have shown that most commuting flows in urban systems are lateral (i.e. between different parts of suburbs and hinterland) rather than core-centered [18]. Methods have been proposed to account for these peripheral commuting patterns and used to delineate regions [19]. In our approach, we eschew the a priori identification of cores and instead rely on entire commuting networks, thereby capturing important peripheral connections as well as polynodal and diffuse regions. We identify the core-centered regions in a posthoc analysis.Non-unique membership is another problem that is not acknowledged in other approaches. Regional delineations tend to partition a set of geographic jfobjects instead of treating them as members in multiple agglomerations. Especially in densely urbanized regions, many counties have large numbers of commuters to different cities that are relatively close to one another [20, 21]. Often, delineations (such as OMB) tend to assign counties to one region by breaking membership ties rather than acknowledging connections with multiple regions. It is useful to relax the unique membership condition between an object and the agglomeration to which it belongs.The other major issue that has received less attention in the literature, both in the context of regional demarcation as well as in the area of network science, is the idea of self-connection. Many agglomeration delineations in both network science and regional science have focused on the connection between two nodes/counties. However, commuting networks have significant self-loops (i.e. commuters within a county): 56% of the total commuters in 2010 in the US commuted within the county where they resided. Ignoring this large commuting pattern skews the results of agglomerations. Since some nodes may be preferentially attached to themselves (measured by the weight of the self-loop), they should be treated as their own agglomerations (see Section 4.4).Community detection methods have been applied to commuting networks to identify regions but traditionally do not account for the above critiques. For example, Nelson and Rae ignore commuting within a node and focus only on commuting between nodes [22]. They also rely on a community detection algorithm that partitions the entire node set rather than identifying statistically significant connections and clusters. Our proposed approach identifies overlapping communities. Unlike their approach, which starts with census tracts, we start with counties because it is easier to fashion institutions for collections of political boundaries (counties) rather than statistical boundaries (census tracts).
1.2 Related work in community detection
Community detection is an approach used to divide a set of nodes in a given network into clusters whose nodes are strongly connected within a grouping as opposed to between groupings. Many techniques have been proposed for unweighted or binary data including modularity optimization [23, 24], stochastic block models [25, 26, 27, 28], and extraction methods [29, 30].The field of community detection is vast, with a host of methodologies from many different fields including computer science, statistical physics, optimization and statistics. We refer the reader to the work of Fortunato [31] for a wide ranging survey. The two themes most closely related to the methodology developed in this paper are those that (1) establish a notion of quality functions such as modularity [32] and (2) statistical techniques for fitting empirical data to networks (such as fitting either Bayesian [33, 34, 35] or frequentist [36, 37, 38] models).More precisely, one of the central techniques in community detection is modularity optimization [32], where for any assignment of community labels c = (c1, c2, …, c) with c ∈ {1, 2, …, K} (if we assume there are at most K communities) the modularity score of such a community assignment is
where δ is the Kronecker delta function with δ(x, y) = 1 if x = y and zero otherwise. A large number of community detection algorithms are built on (approximate) optimization of the modularity score.Another class of methods is based on a network null model. Typically, in these schemes one constructs a network model that preserves some aspects of the observed network (in the context of unweighted networks). The preserved characteristic is often the degree distribution of the network (see Section 3.1 for details). Such a scrambled network with the same degree distribution creates a network with no inherent clustering tendency. The observed network is then compared with this null model to extract subsets which seem more densely connected within the subset as compared to the null model. We use the null-model based approach in our methodology (Section 3). More details on null models are further described in Section 3.1.
1.3 Related work in spatial network analysis
Substantial literature exists for the applications of spatial network analysis onto human mobility data. Barthelemy et al. [39] conducts a general survey of spatial network models and dynamic processes on these models. Batty et al. [40] provides a concise description of network methods specialized to the understanding of cities. Some popular network models for urban and spatial flows known as gravitation and radiation models are described in existing work [41, 42, 43].Some studies use existing community detection techniques on novel geographical datasets to gain insight on how inferred communities are similar to or different from existing points of interest and how they change over time [44, 45]. Community detection methods on commuting behavior were used to partition Japan into synthetic cities in the work of Fujishima et al. [46]. Community detection techniques were used to understand the polycentric spatial patterns within Singapore [47]. Other work uses proxy measures for mobility patterns, such as the flow of banknotes through the US [48], mobile phone data [49] and vehicle GPS tracking data [50], then applies existing community detection techniques. For a comprehensive bibliography of models utilizing network analytic techniques related to human mobility flow and their associated data, see the work of Pappalardo et al. [51]. Conceptually, our work differs from most of the existing work as one of the primary motivations is developing new community detection techniques specifically tied to commuting networks that have a high proportion of self-loop weights.The rest of the paper proceeds as follows. We first briefly describe the commuting dataset for the conterminous United States in 2010. We then present a method to identify clusters in a weighted network with self-loops. Using this method, we analyze the commuting patterns between counties in the United States in 2010 to identify the clusters of counties that are significantly connected. We then present and discuss the results of regional delineation and compare them with other delineations. We conclude with the limitations of our analysis and potential future research directions.
2 Data description
We downloaded our data from the US Census Bureau’s Local Origin Destination Employment Statistics (LODES). This dataset contains commuter data between census tracts for all of the continental United States in the year 2010, which we then aggregated to the county level. The data are stored as an undirected and weighted network with self-loops such that each node represents a county and the weight on each edge represents the number of commuters between the connected counties. Edges with fewer than 100 commuters are removed from the network. Commuters who travel more than 100km (distance between population weighted centroids) are also ignored to remove the effect of telecommuters or super commuters similar to [22]. The resulting network contains 3,091 nodes and 17,632 edges. Los Angeles County has the largest number of commuters to itself (∼ 3.1 million), while Los Angeles County to Orange County in California is the largest non self-loop edge (∼ 0.6 million). As stated earlier, about 56% of the commuters are commuting within the county. As such, this network can be described as a strongly self-looping network.
3 Configuration null model
The Configuration Model, first introduced by Bollobas and Bender [52, 53], is a probability measure on a family of multigraphs that preserves the degree sequence. The input to the model is an observed graph from which we extract the degree sequence, namely a list consisting of the vertices and their corresponding degrees. The model then constructs a random graph as follows: start with the degrees of nodes with d denoting the degree of vertex u; associate every vertex u with d “stubs”. One then performs a uniform matching on these stubs to form full edges, thus resulting in a random graph with the prescribed degree sequence but without any other inherent clustering tendency. The relative proclivity of each node to form ties is determined purely upon its degree.Many of the aforementioned community detection methods utilize the configuration model as the null model [30, 54, 55, 32]. We significantly extend the methodology developed by Palowitch et al. [56] for weighted network data by developing a new null model for weighted networks with self-loops. The outcome of the methodology reveals both significantly connected communities, monads, as well as nodal communities in the context of regional commuting flows (see Fig 1).
3.1 Related null models
A null model in the context of community detection is a random network model which preserves some aspects of an observed network but without any explicit community structure. The most common null model used in the context of unweighted networks is the configuration model (described in Section 1.2). Once a null model for the network is established, communities based on the null model are groups of nodes that deviate from the baseline by being more connected to each other than expected under the null.Various functionals are used to measure the deviation of a set or a partition of the entire node set from the null model. The most popular among these functionals is modularity score [23, 24]. One can then try to optimize such scores to find the best partitions. Fosdick et al. introduced a framework for configuration models that accounts for self loops and used a modularity optimization approach for community detection as an application, but did not focus on weighted networks whose self loops account for the majority of its weights [54]. Another null-model based approach is to assess the statistical significance of the deviation of subsets from what one would expect under the null, correcting these estimates for false-discovery rates, and then extracting communities that appear to be more significantly connected than under the null model. Several approaches have implicitly utilized the notion of deviation against the null partitions, such as likelihood ratios [28] or Bayes factors [27].Though other community detection methods often allow self loops, our method explicitly uses a parameter to account for their effects and integrates them in a iterative testing framework. Most existing methods that allow for self loops, or presume that self-loops are somehow similar in characteristic to the other edges, do not properly account for when self loops are large, as in the case of the algorithm introduced by Palowitch et al. [56]. Previous tree-based methods do mention self-loops, but few focus explicitly on strongly self looping networks, where self loops account for over 50% of the weights. Xiang et al. introduced one such tree-based method [57], which uses a modularity measure that is rescaled by the size of the self loop. Peixoto’s allows for self loops in the Bayesian stochastic blockmodel formulation [35]. The method of Cafieri et al. also accounts for self loops in the context of modularity maximization [58].
3.2 Notation
We denote an undirected weighted network on n nodes by the triple G = ([n], A, W), where [n] ≔ {1, 2, …, n} is the set of n labeled nodes; A = (A) is an n × n square, symmetric adjacency matrix with A = 1 if there is an edge between u and v, and A = 0 otherwise. Since we are interested in networks with self-loops, we assume A ≡ 1 for all u ∈ [n]. Though conventionally the self-loop edge is defined as A = 2, we define it to be 1 as this convention makes the algebra simple when defining the null model. We let W = (W) be another symmetric matrix representing (non-negative) weights on edges with W denoting the weight between u, v ∈ [n] with W ≡ 0 if there is no edge between u and v. We let d = ∑
A denote the degree of a vertex, which specifically is the total number of edges connecting to u ignoring self-loops. The total strength of a node u is defined as: s = ∑
W. We let d = ∑
d and s = ∑
s denote the total degree and weight of the network, respectively. We define to be the propensity of the node to connect to itself byWe define the baseline propensity of the self-loop ratio for the entire network to be:We let d = (d1, …, d) and s = (s1, …, s) denote the degrees and strengths of nodes in [n], respectively.
3.3 Continuous configuration model extraction
Significance-based testing was directly pursued in the context of unweighted networks in [59] and weighted networks in [56]. We extend the significance testing based approach developed by Palowitch et al. [56] in scope and application by adjusting for self-loops. Palowitch et al. [56] developed a method that used a weighted configuration model as a null model that preserved the expected degrees and strengths of any given node u with its actual respective strengths and degrees. The assumptions of the configuration model are as follows:
We refer to this method as CCME.
3.3.1 Model construction
We start by describing the null model for weighted networks with self-loops that will serve as a comparative model for an observed weighted network G = ([n], A, W). The model is indexed by a family of parameters = (d, s, κ, κ, a, b) where d, s are, as before, the degree and weight sequences of the observed network, respectively, κ, κ > 0 are parameters that control the variance of self-loop and non self-loop edge distribution in the null model and a, b > 0 are parameters constrained by the relation a/(a + b) = p where p is, as in (1), the global self-looping tendency of the observed graph. The concentration parameters a, b of the beta distribution with mean p represent the sparseness and tail shapes (tendencies towards zero or one) of the self-looping probability.Implicitly, we fix two distributions F and F on with mean one and variance κ and κ respectively. Using the above ingredients we construct a random weighted graph as follows:Network topology: By design for all u ∈ [n]. For all u ≠ v we letSelf-loop edges: For self-loop edges, we generate edge strengths as follows: First for each vertex u ∈ [n] (independently across vertices), we generate its self-loop propensity
(i.e. a Beta distribution). Next we generate ξ from distribution F (independent of ). Then, we modelNon self-loop edges: For u ≠ v generate edge strengths as follows: first if from step (i) then let . If then let ξ ∼ F, and let
where each q represents the following ratio of strengths and degrees of u and v:Writing D(u) and S(u) for the degree and strength of vertex u in the associated random graph, it is easy to check that
The weight matrix and adjacency matrix represent inherently different, though correlated, modes of relation. For example, in a social network one can imagine two individuals having similar degrees but very different rates of interaction with the individuals they are connected to. In the context of the commuting data, Mesa County, CO and L.A. County, CA have similar degree but very different strengths. Thus part of the aim of this paper was to develop a baseline null model that would preserve both degrees and strengths as well as a baseline level of self-loopiness and then compare an empirically observed network against this null model to extract regions of significantly higher connectivity after accounting for this baseline connectivity.We note that in (7), the first two conditions are identical to that of the ‘ordinary’ CCME method, but the third which preserves the ratios of expected self-loops is novel. The model preserves (on average) the degrees and strengths of the observed graph without any other specific notion of clustering. Each vertex has no particular preferential self-looping proclivity other then the average tendency p of the entire network. We refer to this model as CCME with self-loop adjustment (CCME-SL).
3.4 Parameter specifications
Palowitch et.al [56] use a method-of-moments estimator to specify parameters for CCME. We use this method to learn the parameters from the observed graph. We specify two types of variables to describe the uncertainty arising out of the strengths of the nodes’ connection propensities.Recall that we denote κ as the variance of the self-looping edge weight distribution (with distribution F) and κ as the variance of the non-self-looping edge weight distribution (with distribution F). Both of these variables have mean one to ensure the preservation of the strengths and degrees for the configuration null model and to ensure identifiability.
3.4.1 Variance parameters
The method-of-moments estimates for κ and κ are as follows:
where represents the estimated variance of using empirical method-of-moments, q represents the ratio of strengths to degrees as described in (6).
represents the variation in relative weights between two edges when we know that the strengths (total sum of weights) are the same. represents the variation within a single self-directed edge. These estimates account for the inherent variability of the edge weights of an empirically observed network. The eventual proposed score function (Section 4.2) used to judge the significance of the internal connectivity structure of a community (or any subset of nodes) can use this variability metric in its calibration of significance. Details on the derivations of these parameters can be found in S1 Appendix A.
3.4.2 Beta random variable to model self looping proportion
We specify as adhering to a Beta(a, b) distribution independent across u ∈ [n] with mean p and with variance equal to , the sample variance of . The beta distribution is supported on [0, 1]. We designate the proportion of self-looping commuters in each node as comprised of the averages of decisions to either commute in-county or out-of-county by a host of commuters. The empirical distribution of closely matches the simulated values of , except for a few nodes that are at the upper or lower end of the distribution (in S1 Appendix). We note that for a beta distribution,
We use p and (10) to determine a and b and, from (11), express their estimates as
4 Community detection algorithm
The CCME-SL algorithm is split into three general phases: initialization, update, and filtering. These steps compose the general procedure of iterative testing. Significant communities are groups of nodes with cross-edges that deviate considerably from the expected values under a null model. Significant communities are determined by repeatedly applying an iterative search algorithm that starts with a seed set B0 and finds all nodes with edges connecting to the seed set. The edge-weights are then summed as a test statistic which is evaluated against the expected values of the sums of the weights in the set under the null model (described in S1 Appendix A.4) with respect to each node in the starting seed set B0, imputing a p-value for each node.Each p-value from the candidate set is rejected if it is significant after being corrected by the Benjamini-Hochberg correction. The nodes with p-values that are significant in the present iteration are used as the initial seed sets for the next iteration. The final set B is extracted when the node-set becomes stable: when at some iteration step k, B = B. Nodes in the final set have a stronger affiliation with each other and have fewer edge connections with all other nodes outside the set.In this section, we describe each of the phases of CCME-SL in detail. We also describe the hub and monad detection steps as post-community detection phases of the method.
4.1 Initialization
We initialize (step 1) sets B0 by setting counties with high commuting volume as seed nodes (which represent counties). We select nodes that have above 20,000 self-commuters as seed nodes. We select these nodes because they are proxies for relative population centers where commuter traffic radiates outwards to more peripheral connections upon each iteration. We then find all nodes which are connected to each seed node. The seed node and its connected nodes are used as the initializing sets B0. The seeds are largely irrelevant to the final outcome: the final outcomes reveal similar outcomes regardless of what the initial nodes selected are, so long as a majority of the high-volume nodes are included (see Fig 4). However, because the initial seed nodes are fixed using the above heuristic, the algorithm converges to the same resulting clusters under the same parameters α and τ.
Fig 4
Example of a tightly connected community in the Bay Area in Northern California.
4.2 Update
Stable communities are found using an iterative node-set updating scheme based on the p-value of the connectivity between a single node u ∈ [n] and a candidate (testing) set B ∈ [n]. We denote S(u, B, G) as the connectivity of a single node to the set of nodes which is hypothesized to be a community:
When the observed value S(u, B, G) significantly exceeds the expected sum of weights under the null model, then there is evidence to support the claim that there is some additional structure undergirding the set of nodes than that which is posited by the null model. The null model attributes connectivity between sets of nodes as dependent only on the strengths and degrees of the aggregations of the nodes themselves.The p-value representing the significance of a node-set is given by:
In the above equation G is observed but is random with a distribution given by the null model and with each B representing the candidate set to be tested. When the observed value of is much larger than the expected value, expressed as S(u, B, G), the p-value is low. Low p-values are rejected in an iterative fashion so as to allow the formation of node-sets with edges that are consistently significantly connected to each other. We define these sets as communities.The iterative method is described as follows: for each u ∈ [n], given a set B (or denoted by B at k iteration), we find all counties that are connected to the present set, then p-values are imputed for members of the candidate set B and repeatedly tested until the set becomes stable upon sequential iterations:Calculate p-values . P-values are calculated using a normal approximation for the distribution of . Details on this part of the procedure are given in S1 Appendix A.4Obtain threshold τ(p) using a Benjamini-Hochberg multiple testing procedure [60]. The procedure is used for sets of p-values that are obtained through multiple hypothesis testing. The rejection method ensures that the expected number of falsely rejected hypotheses divided by the total number of rejected hypotheses (false discovery rate, or FDR) has a maximum percentage of α. A false discovery rate threshold α of 0.05 is common in many applications, but for community detection we find empirically that such a threshold should be lower to avoid excess overlaps.The next set reached by the iteration is defined asThe above steps are iterated with B′ replacing B until we reach a fixed point. We set α = 0.01 at each step of iterative testing to perform community detection. The threshold can be made higher or lower, and such adjustments do not change the results drastically (see Supporting Information for details), but the threshold of.01 appears to be optimal for maximizing coverage and minimizing overlaps.
4.3 Filtering
After obtaining M stable communities C, where j = 1, …, M, we remove redundant node-sets that have a high proportion of overlap with other sets. Redundancy in clusters is evaluated using the Jaccard similarity index. A Jaccard similarity index of two sets is defined as the ratio of the size of common elements between the two sets over the total distinct elements, or concisely expressed as J(A, B) = |A ⋂ B|/|A ⋃ B| for two sets A, B [61].We evaluate Jaccard similarities for each pair of found communities C, C. If the Jaccard index J(C, C) is above a given pre-set threshold τ, then the clusters are redundant and we select a preferred cluster by calculating the average weight per connection between nodes. We use a simple formula for a given stable node-set C:
K(C) roughly measures the average sum of weights among cross-edges per node within a candidate set C. Given that two sets C, C have Jaccard overlaps larger than τ, a higher K(C) compared to K(C) signifies more interconnectivity between nodes in C and thus it is kept in the final set of communities while C is removed. We set the τ parameter to be 0.80 when implementing the method on commuting networks.
4.4 Detection of monads
One unique feature of geographical commuting networks is that a non-trivial proportion of the total commuting volume is not found in edges across vertices because most residents commutes within their counties. We define the degree of monadicity of a given node as the following:
Assuming the observed graph originated from the null model, the variable I measures two things. Firstly, I measures how much larger is for a given u than the global mean self-looping tendency p. Secondly, I measures how much larger the latent ξ component of is than its expected value of one (recall that self-loop weights are modeled as from (4)).The exact form of the variance of I is difficult to calculate, but we use a simulation method to approximate a p-value for I. We first determine estimates for a, b from the mean and variance of under the null model. Given p and the sample standard deviation , we find estimates for a, b from (12). We use these estimated parameters to simulate a measurement of how extreme the of a given node is, compared to that of a measurement assuming random generation from a beta() distribution. If the the actual value I is large, as measured by whether it is above the α-th quantile of the simulated values, then the node is deemed significant because it consistently exceeds what would be expected if were randomly generated from a distribution of the same parameters. Such a simulation-based method to approximate p-values is commonly used [62].The process of finding significant monads may be concisely described by the following procedures. First, we simulate from beta() for every node (county). We then obtain the empirical distributions of how monadic a given node is by computing the empirical distribution of . Following this step, we consider I = W − ps. If I is in the 1 − αth tail of the distribution then the node is monadic at this instance of simulation. We repeat the procedure 10,000 times and the nodes that are monadic all of the 10,000 trials are conclusively classified as monads. In practice, we set α equal to.05 for this test. The resultant group of nodes are significantly monadic at the 5% significance level.
4.5 Differentiating nodal communities from non-nodal communities
In the post-processing phase, we identify nodal communities within the communities detected by using the local clustering coefficient as defined by Opshal et al. [63] for weighted networks. For an unweighted network, the local clustering coefficient of a node u is the ratio of the number of present ties over the total number of possible ties between the node’s neighbors. A community will have a low clustering coefficient if there is a single hub-like node that is connected to all its peripheral nodes, but its peripheral nodes do not connect to each other. A node in a complete graph has a coefficient of 1.For a weighted network, Opsahl et al. define the minimum clustering coefficient of a particular node u in a set C using triplets and triangles of nodes [63]. A triplet δ(u, v, w) is defined as a set of three nodes that share at least one edge with another node in the set. A closed triangle Δ(u, v, w) is a set of three nodes whose nodes all connect to both other nodes in the set. One may visually conceive of a triplet as a loosely connected set of three vertices which may have a missing edge, and a closed triangle as a clique of three vertices with three edges.Using triplets and closed triangles, Opsahl et al. [63] define the minimum clustering coefficient of node u in the set C as
The ratio m(u, C) is the ratio of the sum of all closed triangles to triplets associated with u. If the sum of the minimum values of triangles is how compared to those of all triplets within a cluster, then that implies the presence of a dominant node that projects high edge weights across many peripheral nodes. A low clustering coefficient signifies that communities are bound together by a common node, while a high m(u, C) signifies that the nodes are all connected to each other.We determine the overall clustering coefficient of a community by
mtotal(C) is constructed to capture how tree-like the highest-weight nodes are in a given cluster C. The lower m(u, C) is, the more tree-like the node is. 1 − m(u, C) weighted by the strengths of nodes in a given cluster assigns a value of how tree-like and strong a node is. Summing these values gives an overall measure of the monocentricity of a cluster, as the strongest nodes tend to have the smallest m(u, C). The empirical distribution of mtotal(C) is bimodal (see Fig 2) with a split around 0.4. We use this value to identify nodal communities.
Fig 2
Resulting communities from the CCME-SL algorithm.
Communities (non-nodal) (left), nodal clusters (middle), and monads (right).
Resulting communities from the CCME-SL algorithm.
Communities (non-nodal) (left), nodal clusters (middle), and monads (right).
4.6 Methods to compare communities with other delineations
We compare our results with other existing delineations (in particular OMB’s Metropolitan areas) by using the Fuzzy Rand Index (FRI) [64]. The Fuzzy Rand Index is a metric that measures the similarity of two covers, C1 and C2. A cover C is the assignment of vertices of a graph into k groups, where a vertex may belong to more than one group. Counties that are not assigned to any community will belong to their own group, a group of counties that do not belong to any community. Let V represent the set of vertices and C represent a cover of V. Each element v ∈ V is characterized by its membership vector, C(v), which describes the degree of membership between a vertex and each group. A membership vector is subject to the following constraints: C(v) = {C1(v), C2(v), …, C(v)} ∈ [0, 1], where C(v) is the degree of membership of v in the i community, C, and ∑
C(u) = 1. The norm of C(u) − C(v) in a cover with k communities will be defined to be ||C(u) − C(v)|| = ∑|C(u) − C(v)|/2. For some cover C, the similarity measure between two nodes u, v is defined as E(u, v) = 1 − ||C(u) − C(v)||. The distance measure between two covers, C1 and C2 of a network V is defined as:
Likewise, the similarity measure between two covers is 1 − d(C1, C2).
5 Results
We use the term ‘clusters’ to refer to all sets of counties that we obtain from the detection procedures, ‘communities’ to refer to clusters that contain more than one county, and ‘nodal communities’ to refer to clusters that have strong nodal centers with little lateral commuting. We use the term ‘monads’ to refer to those counties that are strongly connected to themselves.From the total 3,091 US counties, we find a total of 182 significant clusters. Of these clusters, 14 are nodal communities, 78 are non-nodal communities, and 90 are monads. Together they cover 90.3% of the population of commuters (93% intra-county, 87% inter-county). The method simultaneously delineates both small (such as monads and dyads) and large clusters consisting of hundreds of counties (see Table 1). For example, Santa Barbara and San Luis Obispo counties in California are strongly connected to one another and are separate from other clusters. Of the regions identified, 99 are monadic or dyadic counties. 68 of these clusters are medium sized, comprising between 2 and 50 counties. In 20 other instances, CCME-SL identifies clusters consisting of 50 or more counties that span several states. These tend to be polycentric regions centered around multiple large cities such as Philadelphia, Washington DC, and Baltimore (see Fig 5). Out of the 182 total clusters, the average size of a cluster is 24 counties, with a standard deviation of 59. The median size, however, is only 2, signifying that the majority of imputed clusters are monads. When only considering communities, the average size is 49, with a standard deviation of 78. The median size of communities is 15, suggesting that the distribution of community sizes is right-skewed (see Table 1). In general, many of the counties belong to overlapping clusters, with some belonging to as many as six different clusters (see Fig 3).
Table 1
Size characteristics of the identified regions.
IQR gives the first and third quartiles of the cluster sizes. Note that the number of counties is the total number captured by these regions with duplicates removed to avoid double counting within each subgroup.
Cluster Type
Count
Mean
Median
IQR
Min
Max
# Counties
Non-nodal communities
78
49
15
[8, 46]
2
356
2044
Nodal communities
14
28
5.5
[4, 19]
4
255
392
Monads
90
1
1
1
1
90
Comparison Delineations
Metropolitan Statistical Areas (MSAs)
363
3.0
2
[1, 3]
1
28
1094
Megaregions
12
80
38
[27, 96]
8
388
967
Fig 5
Comparison of MSAs of New York City Region, major Texas cities, and Minneapolis
(left) with their associated communities (right) in fairly populous regions.
Fig 3
Heatmap of frequencies of each county to appear in any cluster (community, nodal cluster, or monad).
Size characteristics of the identified regions.
IQR gives the first and third quartiles of the cluster sizes. Note that the number of counties is the total number captured by these regions with duplicates removed to avoid double counting within each subgroup.The parameters associated with the null model are as follows: p is.46. p is roughly the mean of intra-county commuters; though weights between cross-edges are counted twice, the proportion of intra-county commuters is 56%. , the sample standard deviation of , is.04, signifying that around half of the commuting volume in the US does not leave the county where it orginates, and that the dispersion about the mean is fairly tight. The estimates for concentration parameters a and b of the beta variable , derived from p and , are respectively 3.86 and 4.50. and are.08 and.79 respectively, signaling a much higher variance for the latent variable undergirding inter-county commuting volume than that of intra-county commuting.
5.1 Benchmarking with other delineations
In general, the clusters found by our proposed method are much larger than existing delineations and account for much more of the commuting activity. We define the coverage rate as the number of commuters between all edges of the given sets of clusters divided by the sum of the total edges. This criterion captures how much of the total commuting activity is ‘captured’ by different modes of aggregation. The rate of coverage for all inter-county commuting was 86% for all clusters, compared to only 48% from OMB delineated MSAs and 77% from megaregions. The coverage rate for all intra-county commuting was 92% from clusters imputed by CCME-SL, compared to 86% from MSAs and 74% from megaregions.Similarities between clusters and MSAs are assessed using the Fuzzy Rand Index (Section 4.6) and by comparing size characteristics in Table 2. Clusters and MSAs are most alike in the West North Central and Mountain regions and least alike in the East South Central region. In the Mountain and Pacific regions, the number of clusters is nearly the same as the number of MSAs, and the mean commuter volumes captured by each method are more comparable than in the eastern states.
Table 2
Characteristics of clusters (monads, nodal and non-nodal communities) compared to MSA for each Census Division.
Census Division
FRI
# Clusters
Mean Commuters
Mean Size
Coverage
Clus/MSA
Clus
MSA
Clus
MSA
Clus
MSA
Cliques
Clus
MSA
Northeast
.66
8
15
703,315
133,223
10
2
86%
92%
60%
Midatlantic
.64
18
31
2,023,890
442,430
16
3
97%
97%
81%
E N Central
.63
24
70
1,557,092
179,574
22
2
81%
83%
68%
W N Central
.76
15
34
796,990
163,363
36
4
61%
68%
63%
S Atlantic
.60
33
80
1,225,996
195,596
32
4
85%
90%
69%
E S Central
.58
15
34
1,062,264
107,392
59
3
97%
97%
56%
W S Central
.66
33
43
661,861
230,651
19
3
76%
85%
70%
Mountain
.80
33
35
265,302
173,790
5
2
51%
82%
71%
Pacific
.70
42
43
706,892
321,860
5
2
47%
79%
74%
Clusters are most consistent with MSA delineations in large urban areas (see Figs 5 and 6). The size differences between the detected clusters and other delineations are muted in the regions around population centers such as Chicago and the cities of New York and Texas. The more populous and denser a core urban area is, the more concentrated a community (a cluster of more than one county) becomes. In Texas, for example, the economic regions surrounding major cities like Austin, Houston, Dallas, and San Antonio are all detected by our method, but these regions are larger and overlapping. The Austin region is the only Texas community that is drastically different from the corresponding MSA. The Houston region is split into coastal and inland commuting zones, which overlap at the core Houston area. The overlapping communities between Austin and San Antonio include Bastrop and Hays counties, identifying the dual memberships of these counties in larger cities’ economic spheres of influence. Each of the four cities are also quantified as monads (not shown), signifying the relationship between large urban centers and their associated suburbs and hinterlands [16]. However, the larger, more diffuse nature of the communities signals emergent lateral commuting behavior between these peripheral counties [18].
Fig 6
Comparison of clusters from of CCME-SL (top) with DC-SBM (middle row, 100 blocks
(left), 350 blocks (right)), Modularity (bottom left), and MSAs and megaregions (bottom right).
Most urban-centric areas yield communities that are larger than and surround MSAs. One such example is the community surrounding Minneapolis, which is inclusive of the MSA but incorporates more peripheral counties (see Fig 5). In California, for example, a community in the Bay area region is similar to the San Francisco–Oakland–Fremont MSA, but the community also incorporates counties from San Jose–Sunnyvale–Santa Clara MSA (see Fig 4 because of the connected nature of inter-regional commuting in the Greater Bay Area. Northern California MSAs are similar to communities in the core area, but the community includes more peripheral counties.Some communities are similar in size but differ in shape from MSAs. Two of the communities associated with New York City are similar in size with their analogous MSAs, but communities do not incorporate as much of New Jersey counties as the MSA. One community in particular stretches North to upstate New York and part of New Hampshire.Megaregions, in general, are larger than clusters from CCME-SL. The largest megaregion is in the Midwest connecting Minneapolis to Western Pennsylvania. The largest community yields a similar number of counties but is located in the Georgia-Tennesee region, centered around Atlanta (see Table 1). The largest communities appear to be in the southern central part of the US, where megaregions cover the least amount of ground (Fig 6). Though megaregions tend to be larger than the average community, the megaregions in the southern and central parts of the US still do not take into account the counties with smaller commuter volumes. The coverage rate for commuter volumes by megaregion is lower than that generated by communities for both intra-county and inter-county cases. The CCME-SL algorithm tends to find larger communities in counties that are more tightly connected by commuting, while megaregions tend to identify large population centers located in relative proximity to one another. The Piedmont megaregion stretches from Atlanta to central North Carolina and only comprises around 150 counties. Most communities around that region are focused on population centers. Cities like Tulsa in the central US are not accounted for by megaregions possibly because they do not have large population centers. However, these cities are still influential commuter hubs that facilitate the formation of large regions tied together by commuting.The mean sizes of communities far exceed those of MSAs in all Census Divisions, except in the Western states, where the communities are only somewhat larger than MSAs (see Table 2). In the East South Central Division, the average cluster size is the largest at 59. In the Mountain and Pacific Divisions, however, the mean size of clusters is so small as to be almost on par with the number of MSAs. The largest delineated cluster is centered on the Atlanta area in the Southwest.Communities (excluding monads) cover, on average, much more commuting volume than MSAs. In the Mountain and Pacific states, communities cover much less of the commuting volume because most populous counties are classified as monads. When taking into account monads, the coverage is higher than that of the MSAs in all regions. The proportion of commuting volume is nearly entirely captured by the clusters from CCME-SL in the East South Central and Midatlantic Divisons. 97% of total commuting activity is accounted for by clusters see Table 2). Communities and clusters (inclusive of monads) mostly cover above 80% of the commuting activity, while the MSAs cover less than 70%.Many of the communities in the Midwest and South have numerous overlaps. For example, Cook and Tulsa counties are members of 5 different communities. North Carolina is covered by three large, heavily overlapping communities. Most nodal clusters are found in the Mountain West, where the counties are on average much larger and more sparsely settled. Monads are much more common in the Western US because the counties tend to be larger. Most major cities are identified as monads, including Cook, Harris, and Los Angeles. Notably, New York (Manhattan) and King (Brooklyn) counties are not monads because the counties that comprise New York City are all highly interconnected with each other (see Fig 5).
5.2 Comparison with other community detection methods
We compare the results found by CCME-SL with several other widely used methods. We examine the results imputed by modularity maximization (Louvain) and the degree-corrected stochastic blockmodel (DC-SBM). The Louvain method naturally finds the optimal number of partitions, while the DC-SBM needs a pre-specified number of partitions. In the Louvain, DC-SBM, and expert judgment (OMB) methods, regions are non-overlapping, while CCME-SL is the only method that demarcates in a way that allows counties to have multiple memberships (see Fig 6). A number of other techniques implicitly allow for self loops. CCME-SL was largely motivated by the need to address settings where self-loops account for a significant proportion of the weight emanating from a vertex.The Louvain method imputes an optimal number of around 350 communities (see Fig 6, bottom left) and approximately maps commuting patterns to regions roughly similar in size to (large) CBSAs. We fit DC-SBMs under two different specifications for number of blocks: 100, which maps approximately to the number of clusters found by CCME-SL, and 350, which are the optimally clustered sets found by modularity maximization. We visualize these clusters alongside pre-defined MSA delineations and megaregions (see Fig 6).Compared to DC-SBM and modularity methods, CCME-SL is capable of capturing overlapping communities finds much more variation in community sizes (though less so in weights). Counties that are influential for several regions, like Harris County in Texas, are strictly partitioned by DC-SBM and Louvain, but yield components in both ‘coastal’ and ‘inland’ counties in communities found by CCME-SL. Communities imputed by DC-SBM are highly dependent on the pre-specified number of blocks chosen. Los Angeles County is a single block under DC-SBM when 100 counties are chosen, but is subsumed by a much larger block when 350 blocks are chosen.DC-SBM was implemented by means of regularized spherical spectral clustering [65], which has been shown to be consistent with DC-SBM in the package randnet. The modularity algorithm was implemented using the package igraph.
6 Discussion
We summarize our findings in this section and offer interpretations in relation to the economic geography of the US. We highlight how and why our findings are different from typical delineations and reconcile these findings with the introduced method and its novel incorporation of self-loops in a null model within a weighted network.
6.1 Substantive differences from existing agglomerations
Our method groups nearly all major urban areas in the U.S. into clusters. Counties like St Louis, Missouri and Fulton, Georgia (Atlanta) connect to many smaller, less populated counties. These clusters are much larger than those in more urban areas such as New York City or Harris (Houston). Counties in Southern California are not identified as multi-county communities but as monadic regions because these counties typically have much higher rates of intra-county commuting. However, in the Greater Bay area, Sacremento and Stockton are grouped together (see Fig 4). The clusters also overlap. Some counties in the Bay Area are also part of communities associated with the Sacramento metropolitan area.In the Northeast, New York City is linked to upstate New York in some communities, Philadelphia is linked to Baltimore and Washington DC, and Baltimore/Washington DC are more linked to each other and to other urban centers along the Eastern seaboard stretching down to North Carolina. However, the Northeast and Midatlantic regions yield smaller communities on average compared to the South and Midwest. Because the commuting volumes in the Northeast are higher and counties are smaller, the counties in this region tend to cluster around each other in medium-sized communities of usually between 20-30 counties, which are larger than those in the West, but smaller compared to those in the South and Midwest.Communities have some similarities to megaregions in the Midatlantic. New York, Philadelphia, Washington DC, and Baltimore are all grouped in the same megaregion, as is the case in several communities (see Fig 6). However, megaregions do not account for the connections between Baltimore, Washington DC and Eastern North Carolina. Communities generated by CCME-SL are allowed to overlap such that Eastern North Carolina and New York City are both significantly connected to Washington DC without being connected to each other. Such categorizations allow different urban areas to be flexibly classified in different regions.The wide range of community sizes is due to both differing county sizes and differences in population between different regions of the United States. A county in California or Colorado tends to be at least a few times larger than a county in a state like Georgia. Thus, small communities are mostly found in the Western part of the contiguous US, while the largest clusters are found in the South and Central parts of the US.Clusters from CCME-SL are larger than MSAs except in populous regions. CCME-SL naturally finds larger communities when inter-county commuting is relatively high compared to total commuting and smaller communities when large population centers are included. Communities show a large variation in size possibly because they account for the power-law distribution of human populations. Compared to MSAs, which are at most a few counties (excepting some metro regions like the Atlanta MSA), and megaregions, which are typically very large, community sizes are much more variable. That a single county can be compared to a massive region is a natural reflection of the highly uneven geographic distribution of the US population, and an expected feature of the results from an unsupervised method such as CCME-SL. Los Angeles County is the most populous county in the US with over 10 million inhabitants and 4 million commuters, a greater number of commuters than all counties combined in most states. LA County is identified as monadic and is within the same class of comparison with the 300-county communities stretching across several states in the South. Such a comparison does not make sense when only considering regions that are similar in size.Communities found by the proposed method account for much more of the total inter-county commuting than is accounted for by MSAs. The discrepancy is especially large in the Southeast, Northeast, and Midwest, where communities and MSAs are most dissimilar according to FRI and differ most in the size, volume, and number of delineations (Table 2). MSAs in these regions capture between 50-70% of inter-county commuting compared to over 90% from communities. Counties in these regions are more interconnected by commuting than counties in other regions, making it likely that their corresponding MSAs are classifying counties in too fragmented a manner to effectively describe the highly interconnected nature of the regions (Table 2). In the Central and Southern US, interwoven commuting patterns are more spread out across counties, resulting in massive clusters.CCME-SL produces clusters that vary greatly in size. The most populous MSAs house similar counties as their corresponding communities (see Fig 5). However, communities can also be as large as megaregions, although they tend to capture counties in different ways. Megaregions capture cities that are in close proximity and which have large overall commuting volumes, but communities capture sets of counties that are closely linked by commuting, even when there are no cities and the gross commuting volume is not large. In the South and Central parts of the US, counties tend to be small and rural yet tightly interconnected. Such aggregations have not been depicted in existing official regional delineations and may be a novel contribution of the method in this study.
6.2 Methodological contributions to region demarcation
Though detection of monads may be antithetical to typical notions of community in network theory, they are very important in this particular application. This research highlights the role of self loops in commuting networks and is broadly applicable to human mobility networks with spatial constraints. As described in [66], most network data arising from nature adhere to power laws, and commuting flows are no exception. This study shows that power laws in spatial settings are intricately linked to self-referential behavior. Many studies have described human populations adhering to heavy-tailed distributions such as the Zipf Law [67, 68], but this study is among the first to indicate how regional delineations can account for such phenomena.Strongly self-commuting counties that are identified as monads are also classified as clusters. Monads are found using tests of similar hypotheses evaluating how unlikely the size of the total commuting activity in a given geographical unit is compared to a scenario where commuting activity was generated at random subject to constraints arising from the weighted configuration model. The null hypothesis for monadicity parallels the null hypothesis for community connectivity in evaluating whether a test node is significantly more strongly connected to other nodes, or itself, than expected.CCME-SL is mostly unsupervised. There are only two tuning parameters controlling the algorithm, α and τ. α tunes the significance of the nodes’ connectivity in relation to its associated community, and τ tunes the threshold of Jaccard distance to filter overlapping clusters. Different values of α and τ are used for both communities and monads, the results of which are shown in the Supporting Information. We set the α = 0.01 and τ = 0.80 for communities and α = 0.05 for monads because monads cover nearly all the major US metropolitan areas. Though they don’t differ much from those presented in Section 5, communities induced by different thresholds promote discovery of more tightly linked or more significantly monadic clusters.Nelson and Rae use a multi-step routine to tune parameters for modularity-based methods, filtering out unwanted ‘outlier’ nodes and validating visual heuristics [22]. Compared to this approach, our method is more parsimonious and statistically interpretable.
6.3 Comparisons with other community detection methods
In Section 5.2, we compared the CCME-SL algorithm with several other standard community detection techniques, including the existing OMB demarcations. The primary advantage of CCME-SL is that it accounts for overlapping memberships for each node. Another advantage of CCME-SL is that the regions demarcated are more defensible because the model accounts for self-loops, which are important in the commuting network. Finally, the simplicity of CCME-SL is a practical advantage compared to other models, especially DC-SBM, which requires that the optimal clustering parameters be determined through cross-validation.Though other techniques implicitly account for self-loops, the self-loops oftentimes cause distortions that create problems in identifying regions. Fig 6 shows that Louvain and DC-SBM yield smaller communities that are all roughly balanced in number of nodes per community. Sizes of regions are highly contingent on their populations and commuting volumes. CCME-SL, on the other hand, yields clusters that are highly variable in size, commensurate with the highly variable populations. The sizes and characteristics of the clusters (monads and communities) imputed by CCME-SL thus appear to be less constrained than the other three methods shown in Fig 6. We posit that the differences between clusters from CCME-SL and other approaches are at least partly due to the extremely strong self-looping weights in high-strength nodes (populous counties).Many algorithms return different results under different initialization scenarios when the likelihoods of partition functions are multimodal and thus give rise to a number of near-optimal partitions [69, 70]. Though this is a common problem in modularity-based approaches with random seedings, CCME-SL is not strongly affected by this issue for two different reasons. First, the method of initialization using the heuristics of starting at nodes with self-loops larger than 20,000 described in Section 4.1 produces the same results upon every run of the algorithm. Second, even if initialization was randomized and subject to different initialization criteria (while still retaining most of the population-center counties), the results do not look very different (see Fig 4 in S1 File). In other applications of CCME-SL, wherein the data do not have interpretable initial seeds, more runs would be required and analysis of sets of partitions would be necessary [35].The number of communities is very important in community detection and its determination is a difficult problem in the field. For example, in DC-SBM the number can be determined by model selection or by cross validation. Under CCME-SL, the number of detected communities is determined by just one sample. This would be just one of the near-optimal states of the assumed model, as there would be other optimal states with different numbers of communities. Although the proposed method finds generally similar communities under a range of parameters and initializations (Figs 3,4 in S1 File), these validations do not allow for the discovery of some exact optimal objective. As such, we reiterate the point that CCME-SL should be viewed as an exploratory method that could give rise to more rigorous modes for proposing novel OMB-designated regions based on the structure of a commuting network.
6.4 Limitations and further research
Several limitations arise from the proposed method and application. Though mostly unsupervised, there are several tuning parameters that must be specified, such as the overlap parameter and the α threshold for the p-value. Sensitivity analysis (see Fig 3 in S1 File) over a range of p-values and overlap thresholds demonstrates that results do not change much when tuning parameters are tweaked. The post-processing step for determining the nodal communities does not capture the extent of the monocentricity that arises from areas that we expect to exhibit these properties, such as Multnomah County (Portland) and Hennepin County (Minneapolis).Two major issues that we plan to explore in future work are:Resolution limit: In future work we plan to explore the notion of resolution limit in the context of CCME and its judgement of significant communities. In brief, consider the simpler setting of unweighted networks. In an unweighted network the null model corresponding to an empirically observed network is created by placing an edge between two vertices u and v with a probability proportional to d
d/d. For two sets A and B with degrees d(A) and d(B), d(A) × d(B) ≪ d will cause the null model to find any presence of an edge between these sets to be “surprising” and cause A and B to be merged into the same community. This is a well-known issue with algorithms underpinned by a null model, as is the case for modularity detection. Similar issues must also arise with CCME-like algorithms. One method to address this would be to incorporate a resolution parameter γ as in the context of the unweighted case, which we now briefly describe (and refer the reader to work in [71] or [31] for more details). In the unweighted case, the natural null model is the configuration model which preserves degrees. This model implies that the null random graph model gives the probability of the existence of an edge between two vertices u and v as in (3). To accommodate for and deal with the issue of the resolution limit, the reference model can be modified so that
As one increases γ ↑ ∞, this implies that we expect non-trivial connectivity between nodes and thus connectivity within subsets to pass “higher bars” before being judged significant. We are exploring similar ideas in the context of the weighted case, including connections between modifications of the reference null model and the corresponding Markov stability of diffusion processes on the null models [72].Spatial null models: In current work we are developing null models which also take into account the spatial component i.e. null models that directly include the spatial component and preserve various functionals such as the degree and the strength. We hope that this new approach will give more accurate results, mitigating issues like that of CCME-SL finding connections between neighboring counties surprising purely because of the resolution limit, since CCME-SL does not directly incorporate the notion that in spatial systems “neighboring counties have a higher propensity to connect”.Several fruitful directions of further research emerge from this study of commuting regions. This study documents the influence of self loops in spatial networks describing complex patterns arising from the collective behavior of many individuals. As network data in these realms become increasingly available, we expect the emergence of more methods accounting for self-looping behavior in networks. We plan to extend the methodology proposed in this study to directed networks so as to model the orientation of commuter flows. A further extension of the method described in this study is to use a temporal model to measure change in communities across time. Another avenue is to investigate the characteristics of power-law distribution of populations that are embedded within the commuting networks and how distributional characteristics of power laws interplay in community detection.This research extends such a methodology to a more general setting of spatial networks that characterizes collective behaviors. Such networks often represent agglomerations of particles that are influenced by both core and peripheral elements. The method of community detection in networks with self-loops may be applied to a variety of spatially-constrained human mobility networks which naturally exhibit significant self-looping characteristics, such as human migration behaviors.This method may find applications in other domains as well. In neuroimaging, one such application may be in analyzing the epicenter-spreading proliferation of biomarkers such as tau, which is significantly linked to Alzheimer’s Disease. Research has revealed that tau develops along a trajectory of concentric spatial accumulation that aggregates at a seed region. Mapping such behavior in brain networks may find a suitable implemention in community detection on strongly self-looping graphs.
7 Conclusions
Traditional delineations of geographic regions have relied on agglomerations of smaller geographies, historical and political boundaries, separating edges and central foci. The boundary characterization is important not only for scientific purposes of tracking and tracing the historical evolution of urban systems, but also for administrative purposes of allocating infrastructure investments and formulating economic development strategies. Boundaries of metropolitan areas in the United States are artifacts of these delineation definitions, yet are central to tracking demographic and economic changes, funding allocations, determination of fair market rents, housing subsidies that depend on area median income and a host of other federal and state programs, even when the agencies caution their use for non-statistical purposes. These delineations are central but invisible to the lives of many. In this paper, we provide a robust method of accounting for the membership of a single place in multiple regions.The main methodological contribution of this paper is its introduction of a community extraction method for a network with strongly self-looping characteristics. The application of the method on US commuting data suggests a way of conceiving delineations of economic geography that differs from existing approaches. CCME-SL accounts for intra-county commuting patterns and produces drastically different results when compared to other community detection methods as well as CBSA-based approaches. Furthermore, allowing regions to overlap allows us to create institutional structures and policies that are tailored not only to singular geographical entities, but also to multitudinous identities interacting across space and place.
Appendix A. Probability Computations for Parameters
A.1 Variance of ξ: κ
Note that
as the two variables are assumed to be independent. The sample variance of W may be decomposed in the following way:We derive another calculation of the sample standard deviation of self looping weights using a method of moments estimator.From the above two equations, we derive the following approximation
Rearranging the above equation and replacing unknown parameters by their estimates, we derive the estimate for
A.2 Variance of W
The properties of the weighted configuration model stipulate that the expectation of an edge weight given that there exist an edge, by Eq 5, is:
q in the above equation is defined in the main body of the paper in Eq 6. We calculate the variance of W by decomposing it into two terms I and II using the following identityTo calculate I, we note that,Therefore, calculating I first necessitates calculating the conditional variance of , given that there exists an edge, under an expectation. As shorthand, we define the operation Var(⋅) ≔ Var(⋅|A) and .Since relation 15 holds for all A, it becomes apparent thatUsing Eq 16, the calculation of I becomes straightforward:The calculation of II, similarly, is as follows:Putting the two Eqs 17 and 18 together, we are able to solve for the variance of W from Eq 14, and substituting the expression for from Eq 3,
where
A.3 Variance of ξ: κ
Now we construct a similar method of moments estimator for κ as was done for κ. However, we also make use of the expression for the conditional variance of W given the existence of an edge in Eq 16.After solving for κ in the above equation and substituting unknown variables with their estimates, then changing the approximation to an equation, we derive the estimate for κ, thus obtaining the estimate as given in Eq 8:
A.4 Central Limit Theorem for in set B
In this section we detail the calculation of the expectation and variance of the relative strength of a given node-set B used in iterative testing, described in the body of the text in Section 4.2.Taking the expectation of each ξ gives the following expression for the strength of the node setWe have found, in Section A.2, that the variance of a given W is expressed as:Adding the variance terms together in set B yields:Then given that B is ‘typical’ and that d and B are sufficiently large, that is approximately normal. For and
Hence in each step of iterative testing in the update step of CCME, the normal p-value is used to iteratively reject insignificant nodes in a candidate community. Assumptions for and proofs of this Central Limit Theorem can be found in [56].Communities (non-nodal) (left), nodal clusters (middle), and monads (right).(TIFF)Click here for additional data file.
Example of a tightly connected community in the Bay Area in Northern California.
(TIFF)Click here for additional data file.(ZIP)Click here for additional data file.26 Jul 2019PONE-D-19-14845Demarcating Geographic Regions using Community Detection in Commuting NetworksPLOS ONEDear Dr. Kaza,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.We would appreciate receiving your revised manuscript by Sep 09 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocolsPlease include the following items when submitting your revised manuscript:A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.We look forward to receiving your revised manuscript.Kind regards,Renaud LambiotteAcademic EditorPLOS ONEJournal Requirements:1. When submitting your revision, we need you to address these additional requirements.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttp://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.3. In your data availability statement you write, "All relevant data are within the paper and its Supporting Information files." Please ensure you have listed the third party data sources for your study and information on how other researchers could access that data (http://journals.plos.org/plosone/s/data-availability#loc-faqs-for-data-policy). If you have the necessary permissions, please consider uploading the data used in your specific study to a data repository such as Dryad or Figshare.4. Please include a separate caption for each figure in your manuscript.[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #1: PartlyReviewer #2: Partly**********2. Has the statistical analysis been performed appropriately and rigorously?Reviewer #1: NoReviewer #2: Yes**********3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: No**********4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #1: YesReviewer #2: No**********5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #1: The authors identified the clusters of counties in US from commuting data among them, using by a overlapping community detection method with self-loops.Such a data-driven approach to identify spatial units would be promising, overcoming the problems of the traditional heuristic methods. On the other hand, this is not the first paper for this approach, and the reliability of their method should be checked carefully to use the results for statistical, governance and planning purposes that the authors argued in the paper.1.There are some papers to identify spatial units from the human mobility datasets.For examples,- The impact of social segregation on human mobility in developing and industrialized regions by Amini, A., Kung, K., Kang, C. et al. EPJ Data Sci. (2014) 3: 6. https://doi.org/10.1140/epjds31.- Comparing Community Detection Algorithms in Transport Networks via Points of Interest by Huang, Liping & Yang, Yongjian & Gao, Hepeng & Zhao, Xuehua & Du, Zhanwei. (2018). IEEE Access. PP. 1-1. 10.1109/ACCESS.2018.2841321.- Detecting the dynamics of urban structure through spatial network analysis by Chen Zhong and Stefan Müller Arisona and Xianfeng Huang and Michael Batty and Gerhard Schmitt, International Journal of Geographical Information ScienceVolume 28, 2014 - Issue 11. https://doi.org/10.1080/13658816.2014.914521- The Size Distribution of 'Cities' Delineated with a Network Theory-based Method and Smartphone GPS Data by Shota Fujishima, Naoya Fujiwara, Yuki Akiyama, Ryosuke Shibasaki, and Hodaka Kaneda, Center for Spatial Information Science, Discussion Paper http://www.csis.u-tokyo.ac.jp/wp-content/uploads/2018/02/150.pdfThis is not a comprehensive list of the topic(I just searched in a web).Please add a short review in Introduction and give sufficient credits to such previous works.I also think this paper contains original results to be published, because of the new method of community detection with the application to another dataset.2.Concerning the community detection method, I think the reliability is not checked enough to use it for the purposes addressed in this paper.The authors aimed to identify spatial areas, instead of CBSA(core based statistical areas) or MSA(metropolitan areas), for examples.Such areas are the basis of many statistics in US, and should be carefully considered to guarantee the reliability of the detection method.The community detection methods, however, have many problems in general.This is mainly because the likelihood function is multimodal and there are a number of near-optimal partitions, as discussed in literature.- B.H. Good, Y.-A. de Montjoye and A. Clauset, "The performance of modularity maximization in practical contexts." Physical Review E 81, 046106 (2010).- Peel, Leto & B. Larremore, Daniel & Clauset, Aaron. (2016). The ground truth about metadata and community detection in networks. Science Advances. 3. 10.1126/sciadv.1602548.This would be problematic to obtain reliable results.I suspect that the author's optimization method focuses only a single partition that would be optimal, and can not deal with the problem property.Recently, network scientists have examined the ensembles of partitions, not just focusing on the optimal one. For a example, the number of clusters are determined by the model selection or model evidence in Bayesian inference framework.- A review, Tiago P. Peixoto, “Bayesian stochastic blockmodeling”, arXiv: 1705.10225.It would be better if the authors carefully consider this issue and develop their method.At least, the authors should address the issue in Discussion and argue on the potential limitations of their methods.3.The proposed method is not clear, because(i) Two different parts of the method are mixed. The theoretical description of the statistical model and its numerical implementation should be described separately.(ii) The relation to the previous models is unclear.Related to the second point, the proposed method seems to be non-standard in network science.This is okay, but I would like to ask about the resolution limit, that is already studied in the cases of standard methods.This resolution limit is significantly related to the issue of the detection of `Monads', obtained by self-loop consideration.Is the limit is small enough to detect a single node as a community?Moreover, the authors are responsible to allocate their method in literature to join the scientific discussions. The current manuscript is not enough for readers to understand the relationships to previous methods, related to the point (i).For example, could you formulate the model in the Bayesian framework and argue the differences from degree-corrected SBM?4.The details of the proposed model seems to be unnatural to model the commuting data.In the null model, adjacency matrix A and weight matrix W is separately used.(After assigning edges, the weight is allocated only to existing edges.)But, the commuting data is represented only by W, and A is just its binarization by a threshold of 100 commuters.Why not consider a model of W, without A?What is the aim to introduce A?Please explain the logic behind the modelling.In addition, the threshold seems to be ad-hoc.5.Authors argued the novelty of introduction of self-loops to community detection methods.But, I think, this is not novel. The previous studies have also considered it, and in most cases, there is no limitation on self-loops, except that the methods are based on tree-like approximations in sparse graphs.It just depends on datasets in practice.I agree that the self-loops will be important in the commuting data studied in this paper.But, the self-loops will not so important in higher-resolution mobility data, such as 500mx500m grid.Today, because of GPS and other technologies, we are getting such nice datasets, and in the cases, the commuters within the same county are recorded as those between different locations and self-loops will be rare.Please rewrite your arguments on the novelty considering these points.6.I would like to see the result of the community detection in entire US, and visually compare it with the other partitions of MBA, megaregions, and other partitions of standard methods (if possible. e.g. degree-corrected SBM and modularity).Then, the results of the paper would be more effectively presented, and facilitate further scientific discussions.It is possible plot the overlapping communities using pi charts.https://graph-tool.skewed.de/static/doc/_images/lesmis-sbm-marginals.svgReviewer #2: Q1: I think that the method is technically sound from the point of view of the mathematics in it, but I do not think the data support the conclusions. I disagree that including self-edges is novel in the field of community detection (although towards the end in Section 6.4 the authors explain that they mean the use of community detection to study commuting networks). I also believe that one of the biggest points that the work argues is that the proposed method has a high coverage of the commuter travels, but I fail to see why that is the determinant justification of the method working. For example, the coverage can be trivially maximised by taking a single community that covers all of the US, but such a community is obviously not relevant. This fits into the larger issue of validation, of how can the communities found be interpreted (a general problem in the field of community detection) and I think that finding statistically significant will only give you confidence with respect to the thing you are trying to explain (coverage as compared to the null model). I am sceptical that some of the largest communities (such as the light blue in fig 2b) are useful. Finally, I think that the claims made in the conclusion about dramatically changing our understanding of the metropolitan structure of the US are not supported by the work presented before.Q2: I think that the statistical analysis was done rigorously. However, something that is not entirely clear to me from the text is if the CCME algorithm is proposed here or if it has been proposed somewhere else (in which case a citation would be needed). It's not entirely clear why the calculation of the variances is relevant for the analysis and it should be explained in the text (and I am not recommending to remove the calculations). In section 7 the text says that the sensitivity analysis can be found in the supplementary material and this is not the case.Q3: As far as the material that was available to download as an editor, I did not have access to any type of data. This includes data of commuters, networks build, maps, figures and anything that would be needed to reproduce the analysis.Q4: The article is written in generally good English but it has not been revised with care. There are many instances of missing words or extra words, specially towards the end of the article. More citations should be added, for example for the Benjami-Hochberg procedure. One further comment is that the terminology that is new to the article should be presented at the beginning, not in Section 5 (Results) as is the case for non-nodal community, monad, etc. While I agree that it is necessary to distinguish the different types of communties found, I think that in the context of the field is is confusing to use the work cluster to mean a community that contains more than one node. To the best of my knoledge, the word monad has been introduced in this manuscript and this is also a non-standard term. I would prefer the self-explanatory single-node community.**********6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: No[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.3 Oct 2019A separate document that individually addresses the reviewers' concerns is attached to the submission.Submitted filename: RESPONSE_2.pdfClick here for additional data file.20 Dec 2019PONE-D-19-14845R1Demarcating Geographic Regions using Community Detection in Commuting NetworksPLOS ONEDear Dr. Kaza,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.We would appreciate receiving your revised manuscript by Feb 03 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocolsPlease include the following items when submitting your revised manuscript:A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.We look forward to receiving your revised manuscript.Kind regards,Renaud LambiotteAcademic EditorPLOS ONE[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.Reviewer #1: (No Response)Reviewer #2: (No Response)**********2. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #1: YesReviewer #2: Yes**********3. Has the statistical analysis been performed appropriately and rigorously?Reviewer #1: YesReviewer #2: Yes**********4. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: Yes**********5. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #1: YesReviewer #2: No**********6. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #1: The revised manuscript has addressed the issues I pointed out, and the authors have responded them carefully. I still have some different opinions on the issues, but I also think that such differences are not critical and should be judged by readers.However, there is one exception related to the point (3). The authors argued that the output of the method is stable if the initial condition is fixed, and the outputs are robust against the change of the initial conditions. This is necessary requirement, but, it is not enough.To clarify the addressed issue, I'd like to focus on a single aspect of the output of the method. This is the number of the detected community by their method. This number is very important and should be determined on a reasonable basis. For example, in DC-SBM, the number can be determined by model selection or by cross validation (in this paper the number is pre-determined by authors in sec. 6. This should be updated, too). On the other hand, in the proposed method, the number of detected communities is determined by just one sample. This would be one of near optimal states of the assumed model, but there would a lot of another optimal states with different numbers of communities. Thus, there is no reasonable validation whether the detected number is good or not.I think this is a difficult problem, and my question did not mean that this issue should be solved perfectly. I also think that the proposed method is good enough to present a new visualization tool of geographic regions. But, this point can be problematic if this paper is still aimed to propose a new identification method for grouping counties for further statistics, and such limitations should be noted.Reviewer #2: I thank the authors for addressing most of my concerns and I think that the manuscript is much improved. I have a few final comments.1. I did not mean that monad should be changed for 'self-explanatory single-node community' which I agree is a mouthful. I was suggesting 'single-node community' and I meant that this term was self-explanatory. I still have the sense that introducing so much new terminology as the authors do makes is less intelligible to their peers. However, this is just a suggestion and I leave it now to their best judgement after having explained myself.2. While the manuscript is much improved in the explanation of the terms, there are still a couple of things that are confusing. In the abstract the term 'non-trivial' to refer to self-loops is really confusing. What is a trivial self-loop? We have to wait until page 12 to find out, which is very bad practice for an abstract. Idem for the term propensity. It is defined in page 6 but discussed much before then. This is also not a standard term, and I would suggest removing it entirely and explaining instead that you mean the proportion of the inter-county commuting flow compared to the total outward flow.3. The authors discuss the other community detection methods in terms of a comparison with a null model. They imply that this is done within the SBM framework, but this is not the case. In this framework we decide to model the data with a particular type of model, and then we fit the model parameters. In page 5 they talc about the 'structured SBM' and I don't know what structured means here.4. I would swap the sections 4.1 and 4.2 because the notation comes before it is needed, and more discussion follows.5. A minor point, but I see no technical reason why the degree of node u needs to exclude self-loops (the degree is fact is almost not used). To make this more similar to the strength which does include node u, it is convenient to define A_uu = 2 as a convention, and in this case the normal degree formulas work and the degree of each node (with self-loops) simply increases by 2.6. While some changes have been made and the authors acknowledge that the SBM methodology accomodates for self-edges, the text still says that most other methods and in particular modularity in general does not account for self-loops. This is not the case, and in fact the Newman-Girvan null model implicitly allows self-loops.7. In page 10 there are two instance of \\hat{\\rho}_u that seem to be referring to two different quantities, an empirical and a simulated one.8. Figure 6 should be moved before the references. This figure really adds to the manuscripts and in particular it is easy to see that the other methods find a very different structure (for example the 'background noise' that their method disregards). I would point out in the text in section 6.2 that the methods give non-overlapping communities (i.e. partitions as opposed to covers) (althought the SBM can be used to get covers, but this is probably much more work and not necessary).9. I did not understand the point of section 7.4, which is completely out of place with the discussion before and after (notably with the definition of modularity).10. Final comment, the text still needs more a good proofread. There a new errors (still missing or extra words), and in the new section 2.2 the authors cannot use '[2]' as a subject, but they can write 'In reference '[2]', the authors do X and Y'. The numbers cannot be used as a subject in the sentence.**********7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: No[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.4 Feb 2020Please refer to the Response to the Reviewers document.Submitted filename: PLOS_ReSub2_RevComments2.docxClick here for additional data file.13 Mar 2020Demarcating Geographic Regions using Community Detection in Commuting Networks with Significant Self-LoopsPONE-D-19-14845R2Dear Dr. Kaza,We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.With kind regards,Akbar AliAcademic EditorPLOS ONEAdditional Editor Comments (optional):Reviewers' comments:3 Apr 2020PONE-D-19-14845R2Demarcating Geographic Regions using Community Detection in Commuting Networks with Significant Self-LoopsDear Dr. Kaza:I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.For any other questions or concerns, please email plosone@plos.org.Thank you for submitting your work to PLOS ONE.With kind regards,PLOS ONE Editorial Office Staffon behalf ofDr. Akbar AliAcademic EditorPLOS ONE