| Literature DB >> 35087096 |
Marcello Carammia1,2, Stefano Maria Iacus3, Teddy Wilkin4.
Abstract
The sudden and unexpected migration flows that reached Europe during the so-called 'refugee crisis' of 2015-2016 left governments unprepared, exposing significant shortcomings in the field of migration forecasting. Forecasting asylum-related migration is indeed problematic. Migration is a complex system, drivers are composite, measurement incorporates uncertainty, and most migration theories are either under-specified or hardly actionable. As a result, approaches to forecasting generally focus on specific migration flows, and the results are often inconsistent and difficult to generalise. Here we present an adaptive machine learning algorithm that integrates administrative statistics and non-traditional data sources at scale to effectively forecast asylum-related migration flows. We focus on asylum applications lodged in countries of the European Union (EU) by nationals of all countries of origin worldwide, but the same approach can be applied in any context provided adequate migration or asylum data are available. Uniquely, our approach (a) monitors drivers in countries of origin and destination to detect early onset change; (b) models individual country-to-country migration flows separately and on moving time windows; (c) estimates the effects of individual drivers, including lagged effects; (d) delivers forecasts of asylum applications up to four weeks ahead; (e) assesses how patterns of drivers shift over time to describe the functioning and change of migration systems. Our approach draws on migration theory and modelling, international protection, and data science to deliver what is, to our knowledge, the first comprehensive system for forecasting asylum applications based on adaptive models and data at scale. Importantly, this approach can be extended to forecast other social processes.Entities:
Year: 2022 PMID: 35087096 PMCID: PMC8795256 DOI: 10.1038/s41598-022-05241-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The Early Warning and Forecasting System workflow showing 4 categories of input variables (1–4), processing of datasets (5) storage (6), Early warning in the form of change alerts for each time series, lead lags, and correlation matrices (7), from which selected drivers (8) move to the forecasting sections (9) where they are trained (10), forecasted (11) in order to eventually forecast the outcome variable that is, applications for asylum (12).
Google trend topics (clusters of relevant keywords).
| Migration/travel | Asylum | Transit | Destination |
|---|---|---|---|
| Passport | Refugee | Egypt | Cyprus |
| Travel | Right of Asylum | Iraq | France |
| Travel visa | Jordan | Germany | |
| Lebanon | Greece | ||
| Turkey | Italy | ||
| Spain | |||
| European Union |
Figure 2Early Warning Summary, week ending on 10/06/2018. (a) Early Warning Signals Table. Countries of origin included (rows): Afghanistan, Iran, Iraq, Albania, Eritrea, Georgia, Nigeria, Pakistan, Russia, Syria, Turkey, Venezuela. For each country, the table first shows the last month applicants in EU, the number of alerting signals observed and the trend of applications in the EU+ in the previous month. The degree of warning for each covariate (columns) is then shown: L0 (no warning) to L3 (max warning). Covariates included in the table are event macro-categories (conflicts, governance, political events, social unrest, economic events) and Google Trends topics [searches related to countries of destination (Germany, Italy, Greece, France, Spain, EU) and migration (passports, travel, refugee)]. The table identifies the time series that deserve closer inspection. (b) Iran, week ending on 10/06/2018. Radar plot of relative level of activity of single covariates in the early warning window, here set as one month, compared to the entire period of analysis: GDELT events and Google Trends searches, level during the early warning window relative to each series’ past values (left); Google Trends relative volume of searches (middle). GDELT event indexes relative level of activity (right). All series rescaled to 0–100%. (c) Iran, week ending on 10/06/2018. Time series with signals for individual covariates. In this figure: Google Trends searches for “Refugee” topic in Iran and Frontex’s “Illegal Border Crossings at the Western Balkan Route” of Iranian nationals. Top of each panel: the summary statistics, recent “momentum” signals, change point statistics for the mean and the variance. In the middle panel: data for the early warning windows with signals and change point analysis. Bottom panel: cumulative rolling variance to check for instability of the time series. (d) Iran, week ending on 10/06/2018. Correlation matrices, with (right quadrant) and without (left quadrant) shifting the time series for the optimal lag. At the optimal lag many correlation effects emerge, as shown by the increased density of the lagged correlation plot.
Figure 3Forecast of applications by Afghan nationals in all EU+ for the four weeks following 30/12/2018. (a,b) DynENet model; (c,d) ARIMA model. (a,c) Show the full series, while (b,d) zoom in on the period starting with the forecast. Weeks are represented in the x axes, and the number of applications lodged by Afghan nationals in the y axes. The green line shows the number of applications lodged until the point in which the forecast is launched. The forecast is represented by the red line. The blue dotted line shows the actual number of applications lodged over the forecast period (and afterwards). The chosen week is a very difficult test for both models: the process has a huge drop down in coincidence with the end of the year when few applications were processed, and rebounds shortly after. The DynENet model copes better than ARIMA with the anomaly.
Figure 4Back-testing performance of the system for forecasted applications by Syrians in Germany. (a) The black line shows the actual number of applications lodged by Syrian nationals in Germany. The dotted blue line is the moving average of the process. The red dashed line shows the DynENet 4-week ahead forecast at each time point. The pink shaded area represents a ± 2-standard errors confidence band around the moving average. (b–d) Summary statistics for the relative error (b,d) and for the absolute error (c). ARIMA, which is only based on the autocorrelation of the applications timeseries, is used as a benchmark model.
Figure 5Predictors of asylum applications lodged by Syrian nationals in Germany in the period considered. The model adapts to the changing nature of the country-to-country dyad. The effect of some predictors is persistent in the first period observed (bottom-left); subsequently, the effect of those predictors fades and other predictors become important (upper right). The vertical axis shows all the predictors that have been selected by DynENet; the variables not shown were dropped by the forecasting model. The horizontal axis is the (weekly) timeline of the training period. Coloured cells denote the activation of given predictors at given weeks. The scale colour represents the relative importance of predictors evaluated through a Random Forest algorithm on the restricted model selected by DynENet, from 0 (white: predictor not selected in that week) to 1 (red: most important).
Parameters for the early warning function.
| Parameter | Default value | Details |
|---|---|---|
| country | No default value | Two digits ISO code for country of origin |
| cv.thr | 0.05 | Threshold on the coefficient of variation. Time series with coefficient of variation below the threshold are excluded from the analysis for this country |
| ibc.thr | 100 | IBC data threshold: if the maximal value of a specific IBC time series is below the threshold, the related data are dropped from the analysis |
| applicant.thr | 100 | EPS applicant data threshold: if the maximal value of a specific EPS applicant time series is below the threshold, the related data are dropped from the analysis |
| na.th | 0.3 | If any time series contains more than na.th*100 missing data, the time series is not reliable enough and hence dropped from the analysis |
| write.db | FALSE | Should write the result to a data base or on files? Currently, only FALSE is available, apart from a subset of data needed for forecasting which are stored anyway on the backend data base |
| refDate | Sys.Date() | The final date of the analysis |
| ma1 | 6 | Length of the first moving average (in weeks) |
| ma2 | 24 | Length of the second moving average (in weeks) |
| ma.th | 1.1 | Threshold of first and second moving average. If ma1/ma2 > ma.th, the signal is fired |
| clean.w | 6 | Data cleaning threshold, in months. All the dropping/cleaning pre-analysis is done only for the last window of data, i.e. the last 6 months. For example, if the maximal value of the IBC data in the last clean.w months is less than ibc.thr, the time series will be dropped |
| alert.w | 12 | Reference window to analyse the signals (in months) |
| back.w | 24 | Number of past months to consider in the analysis |
| pvalue | 0.05 | p-value threshold for assessing statistically significant structural change points in time series |
| llag.th | 0.05 | p-value threshold for assessing statistically significant lead-lag effects |
Parameters for the forecast function.
| Argument | Default value | Description |
|---|---|---|
| country | No default value | ISO 2 digit CoO country code |
| final.date | No default value | Should be in the format “YYYY-MM-DD” |
| start.date | "2017-01-01" | From where to start the back testing meta-analysis |
| n.ahead | 4 | Number of ahead periods prediction (in weeks) |
| prediction.win | 12 | Data used for the predictive model (in weeks) |
| alpha | 0.5 | ElasticNet parameter (see below) |
| burn | 12 | Number of data used in the local predictive statistical models (in weeks) |