| Literature DB >> 35873664 |
Felix J Hoffmann1, Fabian Braesemann2,3, Timm Teubner1.
Abstract
Sustainability in tourism is a topic of global relevance, finding multiple mentions in the United Nations Sustainable Development Goals. The complex task of balancing tourism's economic, environmental, and social effects requires detailed and up-to-date data. This paper investigates whether online platform data can be employed as an alternative data source in sustainable tourism statistics. Using a web-scraped dataset from a large online tourism platform, a sustainability label for accommodations can be predicted reasonably well with machine learning techniques. The algorithmic prediction of accommodations' sustainability using online data can provide a cost-effective and accurate measure that allows to track developments of tourism sustainability across the globe with high spatial and temporal granularity. Supplementary Information: The online version contains supplementary material available at 10.1140/epjds/s13688-022-00354-6.Entities:
Keywords: Imbalanced classification; Nowcasting; Platform data; Supervised learning; Sustainable tourism; TripAdvisor
Year: 2022 PMID: 35873664 PMCID: PMC9289659 DOI: 10.1140/epjds/s13688-022-00354-6
Source DB: PubMed Journal: EPJ Data Sci ISSN: 2193-1127 Impact factor: 3.630
Tabular overview of empirical approaches to measure sustainable tourism, sustainable development indicators and related phenomena with online platform data
| Publication | Main variable | Data Source | Data type | Ground truth | Country/Region | Sample size |
|---|---|---|---|---|---|---|
| Bassolas et al. (2016) [ | Touristic site attractiveness | Geolocated tweets and user home locations | – | Africa, Asia, Europe, North and South America | 9.6 million geolocated tweets; 59,000 users’ places of residence | |
| Batista e Silva et al. (2018) [ | Average daily numbers of overnight tourists | Booking.com; TripAdvisor; Eurostat | Accommodation location and capacity | – | EU-28 (incl. GB) | 716,103 establishments |
| Buning and Lulla (2020) [ | Spatio-temporal usage patterns of bike shares | Bike fleet GPS; user zip codes | GPS; User data | – | Indianapolis, USA | 353,733 individual trips |
| Falk and Hagsten (2020) [ | Visitor flows to world heritage sites | Instagram; UNESCO | Number of posts, hashtags | – | Europe, North America | 680m Instagram posts for 525 sites |
| Fatehkia et al. (2020) [ | Wealth Index at clustered geographic locations | Advertisement market audience size estimates | DHS wealth index | India and Philippines | 1,205 (Philipines); 28,043 (India) | |
| Gallego and Font (2021) [ | Air passenger demand forecasts | Skyscanner flight searches (ForwardKeys) | Global air capacity and flight search data | – | Global | 5,000m searches and >600m picks |
| Grybauskas et al. (2021) [ | Apartment revisions | Property listing sites | Property listings with up to 16 numerical features | – | Vilnius, Lithuania | 18,922 listings |
| Hardy and Arval (2020) [ | Tourist movements in national park | Mobile app; GNSS | Location data; Demographic survey data | – | Tasmania, Australia | 472 tourists (4-14 days, 1 signal/10 seconds) |
| Kashyap and Verkroost (2021) [ | LinkedIn GGI = gender gaps along different dimensions | Advertisement market audience size estimates | Country-level professional gender gaps data from international labour organisation (ILO) | Global, up to 234 countries predicted (depending on level of analyis); Up to 185 in ground truth | 460 million users, 165,02 million without missing data | |
| Londoño and Hernandez-Maskivker (2016) [ | Customer feedback to GreenLeader program | TripAdvisor | Sustainability mention, gender, nationality, hotel category, city, GL level | Review concerning sustainability or not | 6 Cities in Europe and North America | 572 comments |
| Mariani and Borghi (2021) [ | eWOM (presence and depth of discourse) | Booking.com; TripAdvisor | Text (Comments) | – | Americas, Europe | 4,12 million TripAdvisor and 1,56 million Booking.com reviews |
| Mendoza et al. (2019) [ | Intensity of an earthquake in terms of damages | Tweets (text mining) | Earthquake catalogue provided by Seismological Center of Chile | Chile | Initially 825,310 tweets; final sample: 187,317 geo-mapped tweets | |
| Nurmi et al. (2020) [ | Nights spent by foreign tourists | GDS; Amadeus, Sabre, Galileo | Flight bookings | Official accommodation statistics (Finland) | Finland | 58 months × 7 countries |
| Quattrone et al. (2018) [ | Number of Airbnb listings per tract (geographical unit based on census data) | Airbnb | Geolocations of listings | – | 8 cities in USA | 54,681 listings |
| Serrano et al. (2021) [ | Attributes important to “green tourists” | Airbnb | Text (Comments) | Actual user assessments | Global, 83 cities | more than 176 million comments |
| Sun and Paule (2017) [ | Restaurant and bar rating clusters | Yelp | User ratings from 1-5 | – | Pheonix, USA | 2578 restaurants, 981 fast food restaurants, 797 bars |
| Talebi et al. (2021) [ | Potential to become ecotourism destination | Manually collected geo data | Geo data (slope, elevation, soil texture, vegetation, etc.) | Classification from systemic analysis | Arasbaran, Iran | 637 recreational areas |
Figure 1Differences between GreenLeader and other accommodations in TripAdvisor data. (A)–(D) Distributions of continuous variables: reviews per room, Number of rooms, total number of photos per room and languages spoken by staff in GreenLeader (blue) and other (red) accommodations. (E)–(F) Proportion of accommodation types and hotel class (stars) in the groups of GreenLeader (left) and other (right) accommodations. (G) Proportion of amenities in the groups of GreenLeader (top) and other (bottom) accommodations. GreenLeader accommodations tend to be larger, have more user interactions, are of higher quality, and offer more amenities than other accommodations
Figure 2Unsupervised Learning techniques applied to TripAdvisor data. (A) Heatmap of principal component loadings of the four main principal components based on dimensionality reduction of the 33 continuous variables in the data set. The algorithm identifies four main dimensions in the data: accommodation size and user interaction (PC1), user rating (PC2), location (PC3), and quality (PC4). (B) Summary statistics of four clusters identified by k-means clustering. The accommodations can be grouped according to quality and user interaction variables. The clusters show different proportions of the GreenLeader outcome variable, varying from 2% to 19%. (C) Two-dimensional representation (PC1, PC2) of TripAdvisor data (10% sample) grouped in four clusters (panels) and GreenLeader/other accommodations (color). The unsupervised learning algorithms are able to split the data into distinct groups with varying proportions of GreenLeader accommodations
Figure 3Classification performance and extrapolation. (A) Comparison of 360 classification models (90 models per classifier and panel) regarding three performance metrics: F2 score, Recall, and ROC AUC (each dot represents a model). QDA model (1) shows the highest F2 score, QDA model (2) achieves the highest Recall, and Random Forest model (3) the best ROC AUC score. (B) Confusion matrices of the three best performing models (1) to (3) according to the performance metrics (top panels) and a random draw model (4) (lowest panel). Inset: performance comparison between machine learning models (red) and 20,000 random draws (blue) according to F2 score, Recall, and ROC AUC. The machine learning models show a significantly better prediction performance than the random draw models. (C) Predicted share of GreenLeader accommodations in Europe (NUTS-2) according to the QDA model (1). The model predicts urban centres and several regions in West and North Europe to have the high shares of sustainable tourist accommodations