| Literature DB >> 29087949 |
Neeti Pokhriyal1, Damien Christophe Jacques2.
Abstract
More than 330 million people are still living in extreme poverty in Africa. Timely, accurate, and spatially fine-grained baseline data are essential to determining policy in favor of reducing poverty. The potential of "Big Data" to estimate socioeconomic factors in Africa has been proven. However, most current studies are limited to using a single data source. We propose a computational framework to accurately predict the Global Multidimensional Poverty Index (MPI) at a finest spatial granularity and coverage of 552 communes in Senegal using environmental data (related to food security, economic activity, and accessibility to facilities) and call data records (capturing individualistic, spatial, and temporal aspects of people). Our framework is based on Gaussian Process regression, a Bayesian learning technique, providing uncertainty associated with predictions. We perform model selection using elastic net regularization to prevent overfitting. Our results empirically prove the superior accuracy when using disparate data (Pearson correlation of 0.91). Our approach is used to accurately predict important dimensions of poverty: health, education, and standard of living (Pearson correlation of 0.84-0.86). All predictions are validated using deprivations calculated from census. Our approach can be used to generate poverty maps frequently, and its diagnostic nature is, likely, to assist policy makers in designing better interventions for poverty eradication.Entities:
Keywords: Gaussian process; mobile phone; poverty mapping; remote sensing
Mesh:
Year: 2017 PMID: 29087949 PMCID: PMC5699027 DOI: 10.1073/pnas.1700319114
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Brief review of poverty estimation methods based on environmental data
| Ref. | Poverty variable | Model | Important variables | Main conclusions | Region |
| ( | Daily consumption expenditure | Regression, correlation | Indoor air pollution (wood/charcoal use), access to clean water, no sanitation, diarrhea, outdoor air pollution (number of deaths from PM10) | Substantial variability across countries | Cambodia, Lao PDR, Vietnam |
| ( | Per capita income | Regression | Mean road density, share in in internal revenue allotment, agrarian reform accomplishment rate, population growth, distance to major cities, mean elevation, percentage of slope with agricultural limitations, and mean annual rainfall | Spatial variation in poverty is mainly caused by disparities on access to road infrastructure | Philippines |
| ( | Food expenditure | Regression and clustering | Proportion of irrigation land, average landholding sizes | Poverty maps show significant spatial clustering of poor and nonpoor areas | Sri Lanka |
| ( | Per capita expenditure | Spatial regression | Slope, soil type, distance/travel time to public resources, elevation, type of land use, demographic variables | Increasing access to roads and improving soil conditions would result in decline in poverty | Kenya |
| ( | Per capita expenditure | Regression | Distance to town, soil quality, slope | Poverty in the remote areas is linked to low agricultural potential and lack of market access | Vietnam |
| ( | Household expenditure | Discriminant analysis | Distance to market, agro-climatic variables, diseases risk, livestock density | Satellite-derived variables tended to dominate the list of selected variables that determine poverty predictions | Uganda |
| ( | Household consumption expenditure, asset wealth | Transfer learning (deep learning) | Roofing material, distance to urban areas | Interesting potential of machine learning method using limited training data | Nigeria, Tanzania, Uganda, Malawi, Rwanda |
| ( | Household expenditure | Spatial regression, geographically weighted regression | Crop diversity, education, nonagricultural economic activities | Spatial nonstationarity of the relationship between poverty and its determinants | Malawi |
| ( | Household income | Geographically weighted regression | Education, accessibility, and services | High poverty incidence that corresponds with ecologically depressed areas. However, other livelihood-influencing factors such as education, accessibility, and services are significantly correlated with poverty. | Bangladesh |
| ( | Relative welfare (female literacy, land ownership, deprived class, and water source) | Random forests | Travel time to market towns, percentageof a village covered with woodland, and percentage of a village covered with winter crop | Satellite sensor data are strongly associated with aspects of rural welfare for an extensive region of a developing country | India |
Summary statistics and characteristics of the data used—CDRs, environment, census, and MPI
| Summary statistics | CDRs | Environment data | Census | Poverty index |
| Timeline | January–December 2013 | 1960–2014 | 2013 | 2013 |
| Number of total calls and text | 11 billion | N/A | N/A | N/A |
| Number of unique individuals | 9.54 M | N/A | 1.4 M | N/A |
| Spatial granularity of available data | Antenna level (1666) | Vector data—100 m−1 | Household level | Region level ( |
| Cost incurred in data collection and preparation | Low/no cost (data exhaust) | Low/no cost (data exhaust) | US$29 million | Very high cost, and human expertise |
| Frequency of update of data | Real time | 3–5 y | 3–5 y |
Source, unit, and expected relationship to poverty of each environmental variable used in this study
| Feature (no. of statistics) | Unit | Type of data | Endogeneity | Data sources | Expected relationship to poverty |
| Food security, availability | |||||
| Temperature—annual, annual range, diurnal range, warmest month, warmest quarter, coldest month, coldest quarter, wettest quarter, driest quarter, isothermality (11) | Degree Celsius | Ground | Exogenous | WorldClim database, 1960–1990 ( | High temperature (+) |
| Precipitation—annual, wettest month, wettest quarter, driest month, driest quarter, warmest quarter, coldest quarter, coefficient of variation (8) | Millimeter | Ground | Exogenous | WorldClim database, 1960–1990 ( | Low precipitation (+) |
| Elevation (1) | Meter | Remote sensing | Exogenous | CGIAR-SRTM data aggregated to 30 s ( | High elevation (+) |
| Slope (1) | Degree | Remote sensing | Exogenous | CGIAR-SRTM data aggregated to 30 s | High slope (+) |
| Soil type (14) | % of territory | Ground | Exogenous | Soil and Terrain Database for Senegal and the Gambia (version 1.0), scale 1:1 million (SOTER Senegal Gambia, | Poor agronomic soil (+) |
| NDVI (2) | — | Remote sensing | Endogenous | 10-d temporal synthesis of 1 km SPOT-VEGETATION satellite images (2000–2013) ( | Low NDVI (in rural areas) (+) |
| Crop production (7) | Ton | Ground | Endogenous | Direction de Analyse, de la Prévision et des Statistiques Agricoles (DAPSA) 2000–2014 database ( | Low production (in rural areas) (+) |
| Food security (access) | |||||
| Millet price (1) | CFA franc/kilogram | Ground | Endogenous | Modeling based on local supply and demand ( | High millet price (+) |
| Proximity to urban centers, Market (1) | Kilometer | GIS | Endogenous | ANSD | Far from urban centers (+) |
| Proximity to main roads (1) | Kilometer | GIS | Endogenous | Open Street Map ( | Far from main road (+) |
| Economic activity | |||||
| Nighttime lights (2) | Remote sensing | Endogenous | Version 4 of the 2013 nighttime lights time series captured by the Operational Linescan System of the Defense Meteorological Satellite Program (stable lights) | Low density of of light (+) | |
| Density of roads (1) | Kilometer | GIS | Endogenous | Open Street Map | Low density of roads (+) |
| Land cover | |||||
| Land cover (20) | % of territory | Remote sensing | Exogenous/endogenous | 2005 1:100,000 scale Senegal Land Cover Map produced by the Global Land Cover Network ( | Urban areas (−), cropland (+), forest (+), grassland (+) |
| Access to facilities | |||||
| Proximity to school/university (1) | Kilometer | GIS | Endogenous | Open Street Map | Far from school/university (+) |
| Proximity to water tower (1) | Kilometer | GIS | Endogenous | Open Street Map | Far from water tower (+) |
| Proximity to hospital (1) | Kilometer | GIS | Endogenous | Open Street Map | Far from hospital (+) |
| Total | 81 |
List of core features extracted for each individual from CDR data using the Bandicoot toolbox (31)
| Features (no. of statistics) | Description |
| Regularity | |
| Interevent time (4) | The interevent time between two records of the user. |
| Diversity | |
| Number of contacts (2) | The number of contacts with whom the user interacted (call and text handled separately). |
| Entropy of contacts (2) | The entropy of the user’s contacts, both for call and text. |
| Balance of contacts (4) | The balance of interactions per contact. This feature is calculated—each for text and call. For every contact, the balance is the number of outgoing interactions divided by the total number of interactions (in + out). |
| Interactions per contact (4) | The number of interactions a user had with each of his or her contacts. |
| Percent pareto interactions (2) | The percentage of user’s contacts that account for 80% of his or her interactions. |
| Percent pareto durations (1) | The percentage of user’s contacts that account for 80% of his or her total time spend on the phone. |
| Active behavior | |
| Percent nocturnal (2) | The percentage of interactions the user had at night (call and text). |
| Percent initiated conversations (1) | The percentage of conversations that have been initiated by the user both for call and text. |
| Percent initiated interactions (1) | The percentage of calls initiated by the user. |
| Response delay (2) | The response delay of the user within a conversation (in seconds). This is calculated for text (SD and mean of the response delay). |
| Response rate (1) | The response rate of the user (between 0 and 1). |
| Basic phone use | |
| Active days (1) | The number of days during which the user was active. |
| Call duration (2) | The SD and the mean of the duration of user’s calls. |
| Number of interactions (6) | The number of interactions. |
| Ratio of text and call interactions (1) | This computes the ratio of the text and call interactions. |
| Spatial behavior | |
| Number of antennae (1) | The number of unique places visited. |
| Entropy of antennas (1) | The entropy of visited antennas. |
| Percent at home (1) | The percentage of interactions the user had while he or she was at home. |
| Radius of gyration (1) | Returns the radius of gyration, the equivalent distance of the mass from the center of gravity, for all visited places. |
| Frequent antennas (1) | The number of locations that accounts for 80% of the locations where the user was. |
| Churn rate (2) | The SD and mean of the frequency spent at every antenna each week. |
| Total | 43 |
Features are grouped into categories based on prior research (29). These features are calculated for each month, so in total there are 43 × 12 = 516 features.
Fig. 1.Details about the target country, Senegal. On the Left is a composite map of Senegal. Black dots depict the location of mobile towers (antennas). The Voronoi tessellation formed by these towers is shown in gray. The commune (which is the finest administrative unit in Senegal) boundaries are shown in red. There are 552 communes with 431 rural communes and 121 urban centers. The navy blue boundaries are those of regions, which are the coarsest administrative units in Senegal. There are 14 regions that are named in the map. On the Right is the current (2016) map of Global MPI for four divisions of the country (West, North, South, and Center).
Fig. 2.Quantiles of predicted (Left) and actual (Right) MPI at the commune level. The urban centers are depicted by small circles on the map. The communes in the Dakar and Thiès regions are shown enlarged.
Fig. S6.Residual vs. fit plots to predict incidence of poverty (H) using CDR (Top) and environmental (Bottom) data. (Left) Linear (elastic net regression). (Right) Nonlinear (GPR). Linear model fits indicate nonlinearity in the data. The residuals for GPR are normally distributed. Shapiro–Wilk test statistic: CDR, 0.97 (P value ); environmental, 0.95 (P-value ).
Spatially cross-validated results of the predictions of MPI, headcount of poverty (H), and intensity of poverty (A), along with the individual indicators for poverty given by our model using disparate datasets
| Multisource data | CDR | Environment | |||||||
| Poverty indicators and dimensions | Corr. | Rank corr. | RMSE | Corr. | Rank corr. | RMSE | Corr. | Rank corr. | RMSE |
| MPI | 0.91 (0.06) | 0.88 (0.06) | 0.08 (0.01) | 0.89 (0.07) | 0.86 (0.07) | 0.08 (0.01) | 0.84 (0.09) | 0.80 (0.10) | 0.10 (0.02) |
| H | 0.91 (0.07) | 0.85 (0.08) | 10.79 (3.96) | 0.90 (0.08) | 0.84 (0.08) | 10.76 (2.60) | 0.83 (0.11) | 0.75 (0.11) | 13.65 (4.86) |
| A | 0.86 (0.05) | 0.85 (0.07) | 4.71 (0.96) | 0.83 (0.07) | 0.82 (0.08) | 4.98 (1.14) | 0.81 (0.07) | 0.79 (0.08) | 5.36 (0.75) |
| Education | 0.86 (0.05) | 0.84 (0.05) | 11.84 (1.88) | 0.82 (0.05) | 0.81 (0.07) | 13.08 (1.68) | 0.76 (0.07) | 0.74 (0.07) | 14.98 (3.03) |
| Health | 0.49 (0.15) | 0.50 (0.16) | 12.76 (2.12) | 0.50 (0.12) | 0.52 (0.12) | 12.91 (1.92) | 0.36 (0.23) | 0.35 (0.23) | 13.91 (2.32) |
| Standard of living | 0.83 (0.11) | 0.75 (0.13) | 14.82 (3.92) | 0.81 (0.11) | 0.74 (0.11) | 15.24 (3.45) | 0.73 (0.18) | 0.64 (0.20) | 17.88 (4.50) |
The results are compared when single source data are available. Corr., Pearson’s r correlation; rank corr., Spearman’s rank correlation; RMSE, rms error. For both types of correlations, all P values were less than 10−20. An SD associated with the multiple runs for each measurement is reported within parentheses.
Spatially-cross validated results of the predictions of MPI, incidence of poverty (H), and intensity of poverty (A), along with the individual indicators for poverty given by our model using disparate datasets
| Multisource data | CDR | Environment | Concatenated | |||||||||
| Poverty indicator | Corr. | Rank corr. | RMSE | Corr. | Rank corr. | RMSE | Corr. | Rank corr. | RMSE | Corr. | Rank corr. | RMSE |
| MPI | 0.91 (0.06) | 0.88 (0.06) | 0.08 (0.01) | 0.89 (0.07) | 0.86 (0.07) | 0.08 (0.01) | 0.84 (0.09) | 0.80 (0.10) | 0.10 (0.02) | 0.90 (0.06) | 0.85 (0.07) | 0.10 (0.02) |
| H | 0.91 (0.07) | 0.85 (0.08) | 10.79 (3.96) | 0.90 (0.08) | 0.84 (0.08) | 10.76 (2.60) | 0.83 (0.11) | 0.75 (0.11) | 13.65 (4.86) | 0.90 (0.07) | 0.83 (0.08) | 11.34 (3.87) |
| A | 0.86 (0.05) | 0.85 (0.07) | 04.71 (0.96) | 0.83 (0.07) | 0.82 (0.08) | 04.98 (1.14) | 0.81 (0.07) | 0.79 (0.08) | 05.36 (0.75) | 0.84 (0.07) | 0.82 (0.08) | 5.52 (1.40) |
| Individual indicators of poverty | ||||||||||||
| Years of schooling | 0.85 (0.04) | 0.85 (0.04) | 12.00 (1.21) | 0.81 (0.05) | 0.80 (0.06) | 13.30 (1.55) | 0.76 (0.07) | 0.75 (0.08) | 15.42 (2.48) | 00.85 (0.04) | 0.84 (0.04) | 12.06 (01.01) |
| School attendance | 0.86 (0.05) | 0.83 (0.06) | 11.68 (1.83) | 0.82 (0.07) | 0.81 (0.07) | 12.85 (1.73) | 0.75 (0.09) | 0.72 (0.09) | 14.54 (3.06) | 0.85 (0.05) | 0.83 (0.06) | 11.60 (2.05) |
| Child mortality | 0.45 (0.15) | 0.46 (0.16) | 10.91 (0.58) | 0.45 (0.13) | 0.48 (0.13) | 11.32 (00.73) | 0.34 (0.19) | 0.33 (0.21) | 11.54 (0.65) | 0.45 (0.14) | 0.45 (0.16) | 10.85 (0.49) |
| Nutrition | 0.52 (0.15) | 0.53 (0.15) | 14.61 (3.65) | 0.54 (0.11) | 0.55 (0.11) | 14.49 (3.10) | 0.38 (0.26) | 0.37 (0.25) | 16.28 (3.99) | 0.47 (0.21) | 0.46 (0.22) | 15.33 (4.24) |
| Cooking fuel | 0.86 (0.14) | 0.70 (0.18) | 13.82 (8.76) | 0.83 (0.14) | 0.68 (0.16) | 12.98 (7.00) | 0.76 (0.20) | 0.58 (0.25) | 16.49 (8.78) | 0.86 (0.13) | 0.70 (0.18) | 15.56 (9.19) |
| Sanitation | 0.79 (0.17) | 0.70 (0.18) | 16.99 (3.42) | 0.74 (0.17) | 0.69 (0.17) | 18.05 (3.14) | 0.72 (0.22) | 0.61 (0.26) | 18.64 (4.33) | 0.77 (0.20) | 0.66 (0.23) | 18.69 (3.91) |
| Water | 0.75 (0.14) | 0.72 (0.14) | 14.60 (3.22) | 0.74 (0.13) | 0.71 (0.12) | 14.70 (2.98) | 0.67 (0.20) | 0.61 (0.21) | 16.97 (3.25) | 0.68 (0.21) | 0.62 (0.22) | 17.15 (3.20) |
| Electricity | 0.88 (0.04) | 0.84 (0.07) | 15.09 (0.98) | 0.86 (0.04) | 0.83 (0.06) | 16.67 (1.25) | 0.79 (0.10) | 0.72 (0.13) | 20.27 (1.72) | 0.84 (0.05) | 0.80 (0.09) | 18.61 (1.65) |
| Floor | 0.78 (0.15) | 0.68 (0.14) | 15.79 (5.79) | 0.79 (0.13) | 0.70 (0.12) | 15.24 (4.93) | 0.64 (0.24) | 0.54 (0.23) | 17.87 (6.22) | 0.74 (0.19) | 0.63 (0.16) | 16.58 (5.81) |
| Asset ownership | 0.89 (0.04) | 0.86 (0.05) | 12.61 (1.33) | 0.87 (0.04) | 0.85 (0.04) | 13.81 (1.20) | 0.80 (0.11) | 0.75 (0.11) | 17.05 (2.69) | 0.85 (0.05) | 0.82 (0.06) | 15.37 (1.48) |
The results are compared with models learned on single source and on concatenated feature space. Corr., Pearson’s r correlation; rank corr., Spearman’s rank correlation; RMSE, root mean square error. For both types of correlations, all p values were less than 10−20. An SD associated with the multiple runs for each measurement is reported within parentheses.
Fig. 3.Predictive power of the Gaussian process model. Left denotes the comparison of actual and predicted MPI values for all communes and urban areas of Senegal. The rural and urban areas are differentiated using blue and red colors, respectively. The size of the circle denotes the variance of the MPI prediction for that commune. Top Right shows how the actual and predicted values compare for asset ownership, while Bottom Right shows the comparison for years of schooling.
Fig. S5.Relationship between precision of estimates of poverty and the population density of each commune.
A summary of poverty indicators and associated deprivations, with emphasis on how our methodology calculates them using the RGPHAE census data, keeping in view the OPHI guidelines
| Poverty indicators | Deprivation standards of a household used by OPHI for MPI calculation | RGPHAE census questionnaire response used by our methodology for MPI calculation |
| Health | ||
| Child mortality | At least one child has died | About living and deceased children in the household |
| Nutrition | Any member is undernourished | About going hunger at night for the past few months |
| Education | ||
| School attendance | Any school-aged child is not attending school up to grade 8 | About school-aged currently not in school |
| Years of schooling | No member who has completed at least 5 y of education | About higher schooling of any member |
| Standard of living | ||
| Cooking fuel | Uses solid fuels for cooking | Household does not use electricity or natural gas for cooking |
| Electricity | No access to electricity | No electricity or generator |
| Sanitation | No access to adequate sanitation or if it is shared | Household has no sewer connection or pit |
| Drinking water | No access to safe drinking water | No water tap in household |
| Flooring | Has dirt/earth/dung floor | Household has dirt/earth/dung floor |
| Assets | Has only one small asset (radio, TV, refrigerator, phone, bicycle, motorbike) and it has no car | Household has one asset (radio, TV, refrigerator, phone, bicycle, motorbike) and it has no car |
Fig. S2.Visualization of selected features using elastic net regularization on environmental data for prediction of selected deprivations. The rows represent the features, which are ranked according to their weights from positive (marked green) to negative (marked red). Different features groups are color-coded. Features related to food availability are given in black color, whereas those related to food accessibility are colored green. The land cover features are colored yellow, and the features detailing economic activity are in red color. Finally, features depicting access to services are shown in blue. The cells in white were given 0 weights by our model.
Fig. S3.Visualization of selected features using elastic net regularization on CDR data for prediction of selected deprivations. The rows represent features, which are ranked according to their weights from positive (marked green) to negative (marked red). The columns are the various deprivations. The feature groups are color-coded. Features related to diversity features are colored blue. Those related to spatial aspects are colored yellow. The features related to active behavior are marked in black. The features related to basic phone use are in red, and those related to regularity are in green. The cells in white were given 0 weights by our model. Legend in parentheses corresponds to the different variation in weights. H and A weights vary between 1.85 and , and for others the weights vary between 5.5 and .
List of the important features chosen by our model to predict each of H, A, schooling, school attendance, cooking fuel, sanitation, water, electricity, floor, and assets
| Feature type | H | A | Schooling | School attendance | Cooking fuel | Sanitation | Water | Electricity | Floor | Assets |
| Basic | ||||||||||
| Active days call and text | – | – | – | – | – | – | – | – | ||
| Ratio of call/text interactions | + | + | + | + | ||||||
| Number of interactions in text | – | |||||||||
| Regularity | ||||||||||
| Interevent time call, mean | + | + | + | + | ||||||
| Interevent time call/text, SD | + | + | + | + | + | |||||
| Diversity | ||||||||||
| Balance of contacts text, mean | + | + | + | + | + | + | + | |||
| Percent pareto interactions call | + | + | + | + | + | + | + | |||
| Interactions per contact call, mean | – | – | – | – | – | – | – | |||
| Entropy of contacts call | – | – | – | |||||||
| Active | ||||||||||
| Response delay text, mean | + | – | + | + | + | + | + | + | ||
| Response delay text, SD | – | – | – | – | – | – | – | |||
| Percent initiated interactions, call | + | + | + | + | + | |||||
| Percent initiated conversations, call and text | – | – | ||||||||
| Spatial | ||||||||||
| Frequent antennas | – | – | – | – | – | |||||
| Number of antennas | – | – | – | – | – | |||||
| Radius of gyration | + | + | ||||||||
| Entropy of antennas | – | – | – | – | ||||||
The features having positive relationships with the various deprivations are marked as + in the cell corresponding to the feature name and the deprivation. Otherwise they are marked as –. The various semantic groupings under which the different features fall are also listed.
Comparative table showing how our model performs compared with only nightlights and a previous work (used as a baseline) using only four features—namely, call volume and mobile ownership per capita, nightlights, and population density
| Data source | Model | Results, Pearson’s |
| Nightlights | Linear regression | 0.39 |
| (20) | Linear regression | 0.84 |
| Our model | GP regression | 0.91 |
Fig. 4.The uncertainty associated with each dataset evidenced by the most accurate one (denoted as CDR and ENV) for the average intensity of poverty (A) (Left) and prediction of the headcountof poverty (H) (Right).
Fig. S4.The highest deprivation by commune as predicted by our model for each dimension of global MPI (from top to bottom: education, health, and standard of living).
Brief review of poverty estimation methods based on CDR data
| Ref. | Data source | Model (number of features) | Sample size | Time period | Results, Pearson’s | Spatial resolution of validation | Poverty measure | Region |
| ( | CDR and phone survey | Linear regression (5,088) | 1.5 M (CDR) + 856 (survey) | 9 mo | 0.68 | 492 DHS clusters | DHS composite wealth index | Rwanda |
| ( | CDR | Support vector machine (279) | 500 K | 6 mo | 0.80 | — | Socioeconomic levels | Urban area in a Latin American city |
| ( | CDR | Linear regression (OLS) (5) | 5 M and 928 K | 20 wk and 6 wk | — | 11 subprefecture level (CIV) | IMF poverty rate | Cote d’Ivoire and anonymous region B |
| ( | CDR | Linear regression ( | 9 M and 150 K | 12 mo | 0.82 | 14 regions in Senegal | MPI (OPHI) | Senegal |