Kai Zhao1, Mirco Musolesi2, Pan Hui3, Weixiong Rao4, Sasu Tarkoma1. 1. 1] Department of Computer Science, University of Helsinki, Helsinki, Finland [2] Helsinki Institute for Information Technology, HIIT, Helsinki, Finland. 2. School of Computer Science, University of Birmingham, Birmingham, UK. 3. Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong. 4. School of Software Engineering, Tongji University, Shanghai, China.
Abstract
Human mobility has been empirically observed to exhibit Lévy flight characteristics and behaviour with power-law distributed jump size. The fundamental mechanisms behind this behaviour has not yet been fully explained. In this paper, we propose to explain the Lévy walk behaviour observed in human mobility patterns by decomposing them into different classes according to the different transportation modes, such as Walk/Run, Bike, Train/Subway or Car/Taxi/Bus. Our analysis is based on two real-life GPS datasets containing approximately 10 and 20 million GPS samples with transportation mode information. We show that human mobility can be modelled as a mixture of different transportation modes, and that these single movement patterns can be approximated by a lognormal distribution rather than a power-law distribution. Then, we demonstrate that the mixture of the decomposed lognormal flight distributions associated with each modality is a power-law distribution, providing an explanation to the emergence of Lévy Walk patterns that characterize human mobility patterns.
Human mobility has been empirically observed to exhibit Lévy flight characteristics and behaviour with power-law distributed jump size. The fundamental mechanisms behind this behaviour has not yet been fully explained. In this paper, we propose to explain the Lévy walk behaviour observed in human mobility patterns by decomposing them into different classes according to the different transportation modes, such as Walk/Run, Bike, Train/Subway or Car/Taxi/Bus. Our analysis is based on two real-life GPS datasets containing approximately 10 and 20 million GPS samples with transportation mode information. We show that human mobility can be modelled as a mixture of different transportation modes, and that these single movement patterns can be approximated by a lognormal distribution rather than a power-law distribution. Then, we demonstrate that the mixture of the decomposed lognormal flight distributions associated with each modality is a power-law distribution, providing an explanation to the emergence of Lévy Walk patterns that characterize human mobility patterns.
Understanding human mobility is crucial for epidemic control1234,
urban planning56, traffic forecasting systems78
and, more recently, various mobile and network applications910111213.
Previous research has shown that trajectories in human mobility have statistically
similar features as Lévy Walks by studying the traces of bank notes14, cell phone users' locations15 and GPS16171819. According to the this model, human movement contains
many short flights and some long flights, and these flights follow a power-law
distribution.Intuitively, these long flights and short flights reflect different transportation
modalities. Figure 1 shows a person's one-day trip
with three transportation modalities in Beijing based on the Geolife dataset
(Table 1)202122. Starting from
the bottom right corner of the figure, the person takes a taxi and then walks
to the destination in the top left part. After two hours, the person takes
the subway to another location (bottom left) and spends five hours there.
Then the journey continues and the person takes a taxi back to the original
location (bottom right). The short flights are associated with walking and
the second short-distance taxi trip, whereas the long flights are associated
with the subway and the initial taxi trip. Based on this simple example, we
observe that the flight distribution of each transportation mode is different.
Figure 1
Illustration of a synthetic trail (taxi, walk, subway, walk, taxi,
walk) for one day trip and their corresponding flights.
This figure shows that the flight distribution of each transportation
mode (walk, taxi, subway) is very different.
Table 1
The Geolife and the Nokia
MDC Human Mobility Datasets
Geolife
Nokia MDC
Location
Beijing
Geneva
Measurement
GPS
GPS
Number of samples
24,876,978
11,077,061
Duration
5 years
1.5 year
Accuracy
3 m
3 m
Sampling interval
1–5 s
10 s
Number of participants
182
200
Number of flights with transportation mode
202,702
224,723
In this paper, we study human mobility with two large GPS datasets, the
Geolife and Nokia MDC datasets (approximately 20 million and 10 million GPS
samples respectively), both containing transportation mode information such
as Walk/Run, Bike, Train/Subway or Car/Taxi/Bus. The four transportation modes
(Walk/Run, Bike, Train/Subway and Car/Taxi/Bus) cover the most frequently
used human mobility cases. First, we simplify the trajectories obtained from
the datasets using a rectangular model, from which we obtain the flight length16. Here a flight is the longest straight-line trip from one point
to another without change of direction1619. One trail from
an origin to a destination may include several different flights (Fig.
1). Then, we determine the flight length distributions for different
transportation modes. We fit the flight distribution of each transportation
mode according to the Akaike information criteria23 in order
to find the best fit distribution.We show that human movement exhibiting different transportation modalities
is better fitted with the lognormal distribution rather than the power-law
distribution. Finally, we demonstrate that the mixture of these transportation
mode distributions is a power-law distribution based on two new findings:
first, there is a significant positive correlation between consecutive flights
in the same transportation mode, and second, the elapsed time in each transportation
mode is exponentially distributed.The contribution of this paper is twofold. First, we extract the distribution
function of displacement with different transportation modes. This is important
for many applications7891011. For example, a population-weighted
opportunities (PWO) model24 has been developed to predict human
mobility patterns in cities. They find that there is a relatively high mobility
at the city scale due to highly developed traffic systems inside cities. Our
results significantly deepen the understanding of urban human mobilities with
different transportation modes. Second, we demonstrate that the mixture of
different transportations can be approximated as a truncated Lévy Walk.
This result is a step towards explaining the emergence of Lévy Walk
patterns in human mobility.
Results
Power-law fit for overall flight
First, we fit the flight length distribution of the Geolife and Nokia MDC
datasets regardless of transportation modes (see Methods section). We fit
truncated power-law, lognormal, power-law and exponential distribution (see Supplementary Table S1). We find that the overall flight length
(l) distributions fit a truncated power-law P(l) ∝ l
with exponent α as 1.57 in the Geolife dataset (γ =
0.00025) and 1.39 in the Nokia MDC dataset (γ = 0.00016) (Fig. 2), better than other alternatives such as power-law,
lognormal or exponential. Figure 2 illustrates the PDFs
and their best fitted distributions according to Akaike weights. The best
fitted distribution (truncated power-law) is represented as a solid line and
the rest are dotted lines. We use logarithm bins to remove tail noises1625. Our result is consistent with previous research (Refs. 14–19), and the exponent α is close to
their results.
Figure 2
Power-law fit for overall flight.
(a–b) Power-law fitting of all flights regardless of transportation
modes in the Geolife and the Nokia MDC dataset. The green points refer to
the flight length samples obtained from the GeoLife and the Nokia MDC dataset,
while the solid red line represents the best fitted distribution according
to Akaike weights. The overall flight length distribution regardless of transportation
modes is well fitted with a truncated power-law distribution.
We show the Akaike weights for all fitted distributions in the Supplementary
Table S2. The Akaike weight is a value between 0 and 1. The larger it
is, the better the distribution is fitted2325. The Akaike
weights of the power-law distributions regardless of transportation modes
are 1.0000 in both datasets. The p-value is less than 0.01 in all our tests,
which means that our results are very strong in terms of statistical significance.
Note that here the differences between fitted distributions are not remarkable
as shown in the Fig. 2, especially between the truncated
power-law and the lognormal distribution. We use the loglikelihood ratio to
further compare these two candidate distributions. The loglikelihood ratio
is positive if the data is more likely in the power-law distribution, and
negative if the data is more likely in the lognormal distribution. The loglikelihood
ratio is 1279.98 and 3279.82 (with the significance value p < 0.01)
in the Geolife and the NokiaMDC datasets respectively, indicating that the
data is much better fitted with the truncated power-law distribution.
Lognormal fit for single transportation mode
However, the distribution of flight lengths in each single transportation
mode is not well fitted with the power-law distribution. Instead, they are
better fitted with the lognormal distribution (see Supplementary
Table S2). All the segments of each transportation flight length are
best approximated by the lognormal distribution with different parameters.
In Fig. 3 and Supplementary Fig. S1,
we represent the flight length distributions of Walk/Run, Bike, Subway/Train
and Car/Taxi/Bus in the Geolife and the Nokia MDC dataset correspondingly.
The best fitted distribution (lognormal) is represented as a solid line and
the rest are dotted lines.
Figure 3
Lognormal fit for single transportation mode in the Geolife dataset.
(a–d) Flight distribution of all transportation modes (Car/Taxi/Bus,
Walk/Run, Subway/Train, Bike). The green points refer to the flight length
samples obtained from the GeoLife, while the solid blue line represents the
best fitted distribution according to Akaike weights. The flight length distribution
in each transportation mode is well fitted with a lognormal distribution.
Table 2 shows the fitted parameter for all the
distributions (α in the truncated power-law, μ and σ
in the lognormal). We can easily find that the μ is increasing
over these transportation modes (Walk/Run, Bike, Car/Taxi/Bus and Subway/Train),
identifying an increasing average distance. Compared to Walk/Run, Bike or
Car/Taxi/Bus, the flight distribution in Subway/Train mode is more right-skewed,
which means that people usually travel to a more distant location by Subway/Train.
Table 2
Parameters of fitted
distributions in the GeoLife and in the Nokia MDC datasets. The p-value is
less than 0.01 in all the fitted distributions, identifying a strong statistical
significance
Dataset
Transportation Mode
Fitted Distribution
p
Parameters
GeoLife
Overall
Truncated Power-law
0.00
α = 1.57, γ = 0.00025
Walk/Run
Lognormal
0.00
μ = 4.08, σ = 0.76
Bike
Lognormal
0.00
μ = 5.03, σ = 0.68
Car/Bus/Taxi
Lognormal
0.00
μ = 5.78, σ = 1.04
Subway/Train
Lognormal
0.00
μ = 7.27, σ = 0.51
Nokia MDC
Overall
Truncated Power-law
0.00
α = 1.39, γ = 0.00016
Walk/Run
Lognormal
0.00
μ = 4.58, σ = 1.09
Bike
Lognormal
0.00
μ = 5.80, σ = 1.08
Car/Bus/Taxi
Lognormal
0.00
μ = 6.89, σ = 0.91
Subway/Train
Lognormal
0.00
μ = 6.93, σ = 0.94
It must be noted that our findings for the Car/Taxi/Bus mode are different
from these recent research results2627, which also investigated
the case of a single transportation mode, and found that the scaling of human
mobility is exponential by examining taxi GPS datasets. The differences are
mainly because few people tend to travel a long distance by taxi due to economic
considerations. So the displacements in their results decay faster than those
measured in our Car/Taxi/Bus mode cases.
Mechanisms behind the power-law pattern
We characterize the mechanism of the power-law pattern with Lévy
flights by mixing the lognormal distributions of the transportation modes.
Previous research has shown that a mixture of lognormal distributions based
on an exponential distribution is a power-law distribution30313233.
Based on their findings, we demonstrate that the reason that human movement
follows the Lévy Walk pattern is due to the mixture of the transportation
modes they take.We demonstrate that the mixture of the lognormal distributions of different
transportation modes (Walk/Run, Bike, Train/Subway or Car/Taxi/Bus) is a power-law
distribution given two new findings: first, we define the change rate as the
relative change of length between two consecutive flights with the same transport
mode. The change rate in the same transportation mode is small over time.
Second, the elapsed time between different transportation modes is exponentially
distributed.
Lognormal in the same transportation mode
Let us consider a generic flight l. The flight length
at next interval of time l+1, given the change
rate c+1, isIt has been found that the change rate c in the same
transportation mode is small over time922. The change rate c
reflects the correlation between two consecutive displacements in one trip.
To obtain the pattern of correlation between consecutive displacements in
each transportation mode, we plot the flight length point (l, l+1)
from the GeoLife dataset (Fig. 4). Here l
represents the t-th flight length and l+1
represents the t + 1-th flight length in a consecutive trajectory in
one transportation mode34. Figure 4 shows
the density of flight lengths correlation in Car/Taxi/Bus, Walk/Run, Subway/Train
and Bike correspondingly. (l, l+1)
are posited near the diagonal line, which identifies a clear positive correlation.
Similar results are also found in the Nokia MDC dataset (see Supplementary
Fig. S2).
Figure 4
Flight length correlation for each transportation mode.
(a–d) Consecutive Flight length correlation of all transportation
modes (Car/Taxi/Bus, Walk/Run, Subway/Train, Bike) in the GeoLife dataset.
A high density of points are near diagonal line l = l
+ 1, identifying a small difference l+1 − l
in the same transportation mode between two time steps.
We use the Pearson correlation coefficient to quantify the strength of
the correlation between two consecutive flights in one transportation mode35. The value of Pearson correlation coefficient r is shown
in the Supplementary Table S3. The p value is less
than 0.01 in all the cases, identifying very strong statistical significances. r
is positive in each transportation mode and ranges from 0.3640 to 0.6445,
which means that there is a significant positive correlation between consecutive
flights in the same transportation mode, and the change rate c
in the same transportation mode between two time steps is small.The difference c in the same transportation mode between
two time steps is small due to a small difference l+1 − l
in consecutive flights. We sum all the contributions as follows:We plot the change rate samples c of the Car/Taxi/Bus
mode from the Geolife dataset as an example in Supplementary
Fig. S3. We observe that the change rate c fluctuates
in an uncorrelated fashion from one time interval to the other in one transportation
mode due to the unpredictable character of the change rate. The Pearson correlation
coefficient accepts the findings at the 0.03–0.13 level with p-value
less than 0.05 (see Supplementary Table S4). By the Central
Limit Theorem, the sum of the change rate c is normally
distributed with the mean μT and the variance σ2T,
where μ and σ2 are the mean and variance
of the change rate c and T is the elapsed time.
Then we can assert that for every time step t, the logarithm of l
is also normally distributed with a mean μt and variance σ2t 36. Note here that l is the length of the flight
at the time T after T intervals of elapsed time. In the same
transportation mode, the distribution of the flight length with the same change
rate mean is lognormal, its density is given bywhich
corresponds to our findings that in each single transportation mode the flight
length is lognormal distributed.
Transportation mode elapsed time
We define elapsed time as the time spent in a particular transportation
mode; we found that it is exponentially distributed. For example, the trajectory
samples shown in Fig. 1 contain six trajectories with
three different transportation modes, (taxi, walk, subway, walk, taxi, walk).
Thus the elapsed time also consists of six samples (t1, t1, t1, t2, t2, t3).
The elapsed time t is weighted exponentially between the different
transportation modes (see Supplementary Fig. S4). Similar
results are also reported in Refs. 26, 37. The exponentially weighted time interval is mainly
due to a large portion of Walk/Run flight intervals. Walk/Run is usually a
connecting mode between different transportation modes (e.g., the trajectory
samples shown in Fig. 1), and Walk/Run usually takes
much shorter time than any other modes. Thus the elapsed time decays exponentially.
For example, 87.93% of the walk flight distance connecting other transportation
modes is within 500 meters and the travelling time is within 5 minutes in
the Geolife dataset.
Mixture of the transportation modes
Given these lognormal distributions P(l)
in each transportation mode and the exponential elapsed time t between
different modes, we make use of mixtures of distributions. We obtain the overall
human mobility probability by considering that the distribution of flight
length is determined by the time t, the transportation mode change
rate c mean μ and variance σ2.
We obtain the distribution of single transportation mode distribution with
the time t, the change rate mean μ and variance σ2
fixed. We then compute the mixture over the distribution of t since t
is exponentially distributed over different transportation modes with an exponential
parameter λ. If the distribution of l, p(l, t),
depends on the parameter t. t is also distributed according
to its own distribution r(t). Then the distribution of l, p(l)
is given by . Here the t in p(l, t)
is the same as the t in the r(t). r(t)
is the exponential distribution of elapsed time t with an exponential
parameter λ.So the mixture (overall flight length P(l))
of these lognormal distributions in one transportation mode given an exponential
elapsed time (with an exponent λ) between each transportation mode
iswhich can be calculated to givewhere the power law exponent α′ is determined by 313233. The calculation to obtain α′
is given in Supplementary Note 1. If we substitute the
parameters presented in Table 2, we will get the α′
= 1.55 in the Geolife dataset, which is close to the original parameter α
= 1.57, and α′ = 1.40 in the Nokia MDC dataset, which is
close to the original parameter α = 1.39. The result verifies that
the mixture of these correlated lognormal distributed flights in one transportation
mode given an exponential elapsed time between different modes is a truncated
power-law distribution.
Discussion
Previous research suggests that it might be the underlying road network
that governs the Lévy flight human mobility, by exploring the human
mobility and examining taxi traces in one city in Sweden19.
To verify their hypothesis, we use a road network dataset of Beijing containing
433,391 roads with 171,504 conjunctions and plot the road length distribution162829. As shown in Supplementary Fig. S5,
the road length distribution is very different to our power-law fit in flights
distribution regardless of transportation modes. The α in road
length distribution is 3.4, much larger than our previous findings α
= 1.57 in the Geolife and α = 1.39 in the Nokia MDC. Thus the underlying
street network cannot fully explain the Lévy flight in human mobility.
This is mainly due to the fact that it does not consider many long flights
caused by metro or train, and people do not always turn even if they arrive
at a conjunction of a road. Thus the flight length tails in the human mobility
should be much larger than those in the road networks.
Methods
Data Sets
We use two large real-life GPS trajectory datasets in our work, the Geolife
dataset20 and the Nokia MDC dataset38. The key
information provided by these two datasets is summarized in Table
1. We extract the following information from the dataset: flight lengths
and their corresponding transportation modes.Geolife202122 is a public dataset with 182 users'
GPS trajectory over five years (from April 2007 to August 2012) gathered mainly
in Beijing, China. This dataset contains over 24 million GPS samples with
a total distance of 1,292,951 kilometers and a total of 50,176 hours.
It includes not only daily life routines such as going to work and back home
in Beijing, but also some leisure and sports activities, such as sightseeing,
and walking in other cities. The transportation mode information in this dataset
is manually logged by the participants.The Nokia MDC dataset38 is a public dataset from Nokia Research
Switzerland that aims to study smartphone user behaviour. The dataset contains
extensive smartphone data of two hundred volunteers in the Lake Geneva region
over one and a half years (from September 2009 to April 2011). This dataset
contains 11 million data points and the corresponding transportation modes.
Obtaining Transportation Mode and The Corresponding Flight Length
We categorize human mobility into four different kinds of transportation
modality: Walk/Run, Car/Bus/Taxi, Subway/Train and Bike. The four transportation
modes cover the most frequently used human mobility cases. To the best of
our knowledge, this article is the first work that examines the flight distribution
with all kinds of transportation modes in both urban and intercity environments.
In the Geolife dataset, users have labelled their trajectories with transportation
modes, such as driving, taking a bus or a train, riding a bike and walking.
There is a label file storing the transportation mode labels in each user's
folder, from which we can obtain the ground truth transportation mode each
user is taking and the corresponding timestamps. Similar to the Geolife dataset,
there is also a file storing the transportation mode with an activity ID in
the Nokia MDC dataset. We treat the transportation mode information in these
two datasets as the ground truth.In order to obtain the flight distribution in each transportation mode,
we need to extract the flights. We define a flight as the longest straight-line
trip from one point to another without change of direction1619.
One trail from an original to a destination may include several different
flights (Fig. 1). In order to mitigate GPS errors, we
recompute a position by averaging samples (latitude, longitude) every minute.
Since people do not necessarily move in perfect straight lines, we need to
allow some margin of error in defining the ‘straight’ line. We
use a rectangular model to simplify the trajectory and obtain the flight length:
when we draw a straight line between the first point and the last point, the
sampled positions between these two endpoints are at a distance less than
10 meters from the line. The same trajectory simplification mechanism has
been used in other articles which investigates the Lévy walk nature
of human mobility16. We map the flight length with transportation
modes according to timestamp in the Geolife dataset and activity ID in the
Nokia MDC dataset and obtain the final (transportation mode, flight length)
patterns. We obtain 202,702 and 224,723 flights with transportation mode knowledge
in the Geolife and Nokia MDC dataset, respectively.
Identifying the Scale Range
To fit a heavy tailed distribution such as a power-law distribution, we
need to determine what portion of the data to fit (x)
and the scaling parameter (α). We use the methods from Refs. 30, 39 to determine x
and α. We create a power-law fit starting from each value in the
dataset. Then we select the one that results in the minimal Kolmogorov-Smirnov
distance, between the data and the fit, as the optimal value of x.
After that, the scaling parameter α in the power-law distribution
is given bywhere x
are the observed values of x > x
and n is number of samples.
Akaike weights
We use Akaike weights to choose the best fitted distribution. An Akaike
weight is a normalized distribution selection criterion23.
Its value is between 0 and 1. The larger the value is, the better the distribution
is fitted.Akaike's information criterion (AIC) is used in combination with Maximum
likelihood estimation (MLE). MLE finds an estimator of that maximizes the likelihood function of one distribution. AIC is used to describe the best
fitting one among all fitted distributions,Here
K is the number of estimable parameters in the approximating model.After determining the AIC value of each fitted distribution, we normalize
these values as follows. First of all, we extract the difference between different
AIC values called Δ,Then Akaike weights W are calculated as follows,
Authors: Moritz U G Kraemer; Adam Sadilek; Qian Zhang; Nahema A Marchal; Gaurav Tuli; Emily L Cohn; Yulin Hswen; T Alex Perkins; David L Smith; Robert C Reiner; John S Brownstein Journal: Nat Hum Behav Date: 2020-05-18
Authors: Robert J Hardwick; Carolin Vegvari; Benjamin Collyer; James E Truscott; Roy M Anderson Journal: J Theor Biol Date: 2021-04-23 Impact factor: 2.691