Literature DB >> 36074292

Assessment of urban air quality from Twitter communication using self-attention network and a multilayer classification model.

Thushara Sudheish Kumbalaparambi¹, Ratish Menon², Vishnu P Radhakrishnan³, Vinod P Nair⁴.

Abstract

Social media platforms are one of the prominent new-age methods used by public for spreading awareness or drawing attention on an issue or concern. This study demonstrates how the twitter responses of public can be used for qualitative monitoring of air pollution in an urban area. Tweets discussing about air quality in Delhi, India, were extracted during 2019-2020 using a machine learning technique based on self-attention network. These tweets were cleaned, sorted, and classified into 3-class quality viz. poor air quality, good air quality, and noise or neutral tweets. The present study used a multilayer classification model with first layer as an embedding layer and second layer as bi-directional long-short term memory (BiLSTM) layer. A method was then devised for estimating PM2.5 concentration from the tweets using 'spaCy' similarity analysis of classified tweets and data extracted from Continuous Ambient Air Quality Monitoring Stations (CAAQMS) in Delhi for the study period. The accuracy of this estimation was found to be high (80-99%) for extreme air quality conditions (extremely good or severe) and lower during moderate variations in air quality. Application of this methodology depended on perceivable changes in air quality, twitter engagement, and environmental consciousness among public.

Entities: Chemical

Keywords: Air pollution; BiLSTM; Deep learning; Delhi; PM2.5; spaCy

Year: 2022 PMID： 36074292 PMCID： PMC9453714 DOI： 10.1007/s11356-022-22836-w

Source DB: PubMed Journal: Environ Sci Pollut Res Int ISSN： 0944-1344 Impact factor: 5.190

Introduction

Increasing air pollution has become a major concern for the environmental quality of life in the urban areas. Recent studies visualize the grim reality of air pollution and its health risks across the cities of the world (Leffel et al. 2022; Castells-Quintana et al. 2021; WHO 2020; Pant et al. 2019). For strategizing any control or eradication program, first step would be to monitor air quality at possibly high temporal and spatial resolutions to identify the source, location, quality and quantity of pollutants (Gholami et al. 2021; Shanavas et al. 2020; Xu et al. 2018). The existing regulatory monitoring networks, especially in developing countries such as India, however, make sparse spatial and temporal measurements (CPCB 2020). Any additional information in real time, even if qualitative, would be an advantage for air quality management initiatives in countries like India which has 6 out of the 10 most polluted cities in the world (IQAir 2019). With increased access to internet and emergence of social media platforms, people are now expressing their views and relation to the events around them like never before. Social media have grown tremendously in number and popularity ever since its emergence and now include multitude of platforms such as Twitter, Facebook, LinkedIn, YouTube channels, blogs, chat rooms, and discussion forums. India with its population of ≃1.3 billion people has over 560 million internet users and is the second largest online market in the world. The internet accessibility and use in the country largely varied based on factors like gender and socio-economic divide (Statista 2020b). Analyzing public behaviour on such platforms are not new to researchers and have been part of routine business analytics (Hassani and Mosconi 2022; Fan and Gordon 2014). Many studies have also explored their potential for monitoring environmental events such as natural disasters in real time (Yigitcanlar et al. 2022; Sakaki et al. 2010; Lindsay 2011; Earle et al. 2011; Li et al. 2012; Kent and Capello Jr. 2013; Middleton et al. 2013; Singh et al. 2019). Social media responses also have the potential to extract useful information on air pollution (Robinson and Fialkowski 2010; Jiang et al. 2015; Gurajala and Matthews 2018; Jiang et al. 2019). It has been seen that the exposed people tend to share pictures, videos, blogs, and tweets about air quality events almost immediately on social media within minutes of occurrence of an event (Robinson and Fialkowski 2010). These user generated contents (UGC) enable us to derive real-time information on air pollution. The response of users or ‘human sensors’ may be direct in the form of immediate posts, tweets, blogs, etc., or indirect like re-sharing an already existing content or supporting it. This rise in popularity of social media has given people access to a humongous volume of information which can be used in a variety of valuable areas. Extracting and analyzing UGC to derive actionable knowledge is challenging task. However, recent approaches using machine learning (ML) techniques have been found promising for the purpose (Mohammadifar et al. 2021; Gurajala and Matthews 2018; Jackoway et al. 2011;). Many ML techniques are exploited curiously by new researchers are air quality prediction and forecasting (Xu et al. 2020; Chang et al., 2020; Zhang et al. 2021; Al-Janabi et al. 2020, 2021). Use of attention network in deep machine learning algorithms is now widely recognized in sequence modelling (Bahdanau et al. 2014; Vaswani et al. 2017) to maximize performance in models. Long-short term memory (LSTM) are known to be effective for time series prediction tasks, and multiple LSTM units are stacked to form the more effective bi-directional long-short term memory (Bi-LSTM) models (Zhang et al. 2021). Wiedemann et al. (2018) and Wu et al. (2019) have also presented inspiring Bi-LSTM model architectures. Table 1 shows a tabulation of some of the previous works that attempted to predict air quality using different ML techniques. It is seen that the influence of social media for air quality prediction is lesser explored than using air quality parameters and other polluting factors. Also, self-attention networks have not been prominently used for air quality prediction using social sensors. Comprehensive analysis of various classifiers for air pollution detection is also under represented in the prior literature.

Table 1

Comparison between various reference papers that performed air quality prediction

Reference papers	Monitored air quality parameter	Source/social media platform used for prediction	Country or city under study	Prediction models/techniques used, efficiency obtained
Jiang et al. (2015)	AQI	Sina Weibo (Chinese Twitter)	Beijing	Gradient tree boosting (GTB) — 59%
Gurajala and Matthews (2018)	PM_2.5	Twitter	Paris, Delhi, London	Air quality not predicted
Jiang et al. (2019)	AQI	Twitter	California, Idaho, Illinois, Indiana, Ohio	Natural Language Processing (NLP) — 6.9 to 17.7% improvement with social media intervention over base line method.
Xu et al. (2020)	PM_2.5	Historical meteorological data, road network data, administrative boundary vector data, POI data.	Beijing, Tianjin, Hebei	Temporal-spatial-regression-tree model, grid prediction model — 90%
Chang et al. (2020)	PM_2.5	Local and neighbouring station data, chimney and abroad pollution data	Taiwan	Aggregate LSTM — better than GTB, LSTM, SVR
Zhang et al. (2021).	PM_2.5	PM_2.5 — hourly, daily, restructured multi hour data	Beijing	LSTM, Bi-LSTM, EMD-BiLSTM, — more than 95%

SVR support vector regression, EMD empirical mode decomposition

Comparison between various reference papers that performed air quality prediction SVR support vector regression, EMD empirical mode decomposition In the present work, Twitter responses related to air quality in Delhi were analyzed during 2019–2020 using a deep learning model to estimate PM2.5 (mass concentration of particulate matter having an aerodynamic particle diameter less than 2.5 μm) from the tweet content. A self-attention network–based classifier was used to characterize the tweets. The PM2.5 data from Continuous Ambient Air Quality Monitoring Stations (CAAQMS) within the city were used to compare with tweet content to find its relationship with the pollution level. The study analyzed the fit and relevance of the physically monitored air quality parameter with the tweeting volume and behaviour of people in Delhi and developed a new self-attention deep learning classification model with high accuracy to classify tweets to those indicating poor air quality (Class I), good air quality (Class II), and neutral or noise tweets (Class 0) with minimal human intervention. A method was then devised for estimating PM2.5 concentration from the tweets using ‘spaCy’ similarity analysis of classified tweets and data extracted from CAAQMS in Delhi for the study period.

Methodology

Figure 1 shows the methodology framework followed in this study. Each step is explained in detail in the following sections.

Fig. 1

Framework of the study

Selection of study area and the social media platform

Twitter was chosen as the social media platform for the present study. Twitter is peculiar for its short and crisp contents limited over 280 characters and its prominence at the time of an event. In India, there were 13.15 million twitter users as of April 2020 (Statista 2020a). Many world leaders, governments, ministries, influencers, institutions, and news channels have their official Twitter accounts to make announcements, influence, and engage with the general public. Twitter has also become an increasingly important tool in politics, mass media communications, and many more such that it touches almost every sector of life. Tweets for a period of 12 months ranging from March 2019 to February 2020 were used as data in this study. Tweet response volumes, defined as the sum of direct and indirect response of people with respect to air quality, were estimated. Indirect responses included likes and re-tweets for a particular tweet. There has been works like that of Jiang et al. (2015) who have considered the effects of direct and indirect responses separately. However, in this study, it was decided to consider them as a collective response of the people of Delhi to the cities air quality issues. Delhi, capital city of India, was selected as a representative urban area to demonstrate the methodology presented in this paper (Fig. 2). Delhi is bounded by the Indo-Gangetic alluvial plains in the North and East, by Thar desert in the West and by Aravalli hill ranges in the South and its land use comprises of residential, industrial, and commercial areas (Hang and Rahman 2018). Delhi has been witnessing tremendous population growth from 1.7 million in 1951 to over 16 million in 2011 (Census 2011). A preliminary analysis of tweets related to air pollution from the 100 most populated cities in India as part of this study identified Delhi as the city with maximum number of air quality episodes (IQAir 2019) and maximum number of tweets related to air quality. Also, a public health emergency was declared at Delhi in November 2019 as the air quality index exceeded more than 3 times the ‘hazardous’ level (CNN 2020).

Fig. 2

Study area and air pollution monitoring stations in Delhi, India (Source: CPCB)

Data extraction

Tweets were extracted from twitter by attaining access to twitter streaming API. The tweets were fetched with the help of keywords pertaining to air pollution along with the name of the city. About 15 different keyword combinations with Delhi (as shown in Table 2) were used for tweet extraction for the study period. The keywords thus chosen were different combinations of ‘air quality’, ‘air pollution’, and ‘smog’. The data set collected included attributes such as the day, date and time of the tweet, the tweet text and the number of likes and retweets received by that particular tweet. In this manner, a data set of around 82,000 tweets were created. Tweets only in the English language were considered in this study as twitter users in India are pre-dominantly English speakers (Poell and Rajagopalan 2015). In order to reduce the noise in the data set, few words were used to filter out unnecessary tweets like those generated by automated websites on a periodic basis regardless of the change in pollution intensity. The tweets extracted over a year along with the likes and retweets appended to them comprised the dataset for the study. As initial step, the manually classified dataset was tokenized into words. Preprocessing was performed using Natural Language Toolkit (NLTK) library to remove special symbols, stop words, punctuations, Twitter handles, etc. The data set was further groomed by performing processes like word indexing, integer encoding and assigning them with pad descriptions before using the data for supervised learning.

Table 2

Keyword combinations used to extract tweets during study period

	Keyword combinations
Air Quality + Delhi	AirPollution + Delhi	DelhiChokes
AirQuality + Delhi	AirPollution + Delhi	Choke + Delhi
AirQualityDelhi	AirPollutionDelhi	Clean air + Delhi
DelhiAirQuality	DelhiAirPollution	DelhiEmergency
Delhi + Smog	DelhiSmog	Delhi + RightToBreathe

Keyword combinations used to extract tweets during study period In order to relate twitter activity with ambient air quality information, PM2.5 data from 38 CAAQMS stations within Delhi during the period of March 2019 to February 2020 were collected from the CPCB data repository (CCR-CPCB 2020). The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Data classification

Each tweet, within the allowed 280 characters contained words indicating good or bad air quality. Extracted dataset also contained some tweets that are either irrelevant or neutral with respect to the air quality. The tweet data set was manually read and sorted by a team of 15 environmental engineers and the data set was sorted into 3 classes to segregate them into poor air quality tweets (Class I), good air quality tweets (Class II), and noise or neutral tweets (Class 0). The sorted data set was then checked for variations in sorting and resorted (if needed) by an expert team of 3 environmental engineers. The sorted set consisted of tweets in Class I and II which contributed as indicators of the ambient air quality, whereas Class 0 consisted of tweets generated by bots, irrelevant tweets, etc. Examples of tweets in their respective classes are shown in Table 3. The classified set was then preprocessed and used to train a self-attention model that would automatically classify a sample tweet into its suitable classes.

Table 3

Examples of classified tweets

S. no.	Tweets	Classification
1.	New Delhi: Over 1.2 million people died in India due to air pollution in 2017, said a global report on air pollution	Class 0
2.	#DelhiAirEmergency #DelhiPollution #DelhiBachao #DelhiAirQuality #DelhiNCRPollution	Class 0
3.	Kab tak zindagi katoga bd or cigar mein, kuch din to gujaro delhi, in ncr	Class 0
4.	Just landed in #Delhi, the air here is just unbreathable	Class 1
5.	Amazing Air Quality today in Delhi! Enjoy the blue sky and clean air while it lasts.	Class 2

Examples of classified tweets

Data analysis

Relationship between tweet responses and PM2.5 mass concentration

The volume of tweet responses related to air quality was compared with the ambient concentrations of PM2.5 so as to analyze similarity in their trend. Semi log time series graphs for the tweet response volumes for each of Class I and Class II tweets were plotted against the ground average PM2.5 concentrations obtained from CPCB for the duration of the study. Twitter activity was plotted on a logarithmic scale considering their vast data range. The coherence of peaks and falls in the graph were analyzed. The graphical representation and the inferences drawn from it are discussed in ‘Temporal variation of tweet responses with PM’. In order to understand the statistical relationship between the two data sets, box plots were made with PM2.5 on x-axis and tweet response volume on y-axis. The data was sorted on the basis of the classification of PM2.5 values into 6 categories as defined by the CPCB namely good (0–30 μg/m3), satisfactory (31–60 μg/m3), moderate (61–90 μg/m3), poor (91–120 μg/m3), very poor (121–250 μg/m3), and severe (above 250 μg/m3) for plotting the box plots. The relationship of tweet response volumes of each class (Class I and II) in these categories was analyzed, and the results are given in ‘Statistical relationship of tweet response with PM’.

Behavioural analysis using word clouds

Word clouds were generated corresponding to each season to understand how people express the changes in the air quality through words. A python program with necessary supporting libraries was utilized for the purpose. India has 4 meteorological seasons namely winter (January, February), pre-monsoon (March–May), southwest monsoon (June–September), and post-monsoon (October–December) as per Indian meteorological Department (IMD). A prominent transition in the air quality levels in Delhi was observed during post monsoon and winter seasons having the worst air quality which improved over pre monsoon and southwest monsoon seasons.

Automated data classification model

The dataset was a highly imbalanced multiclass data set with around 62,000 tweets in Class 0, 18,000 tweets in Class I, and 1700 tweets in Class II. The respective classes of the tweets were label encoded. The processed data set was split into a training set and test set as 90% and 10% respectively for a 10 fold cross validation analysis to evaluate the overall performance of the model and a test-train split of 80–20% for result analysis. The present study used a multilayer classification model. The network architecture is shown in Table 4. The first layer was an embedding layer in which an embedding matrix was constructed from the data set using pertained global vector (GloVe) embedding for creating weights for the embedding layer (Pennington et al. 2014). The second layer was a BiLSTM layer (units: 100) where the input sequences were analyzed in the forward and backward direction; that is, if a tweet has ‘n’ number of words (w1, w2,… w), then the tweet was first analyzed to from w1 to w by forward LSTM and w to w1 by backward LSTM and forms a word feature (Wf) for each word on the basis of both the forward analysis (f) and backward analysis (f) such thatwhere ‖ in Eq. (1) stands for concatenation function.

Table 4

Network architecture (deep learning model)

Layer
Embedding: weight matrix using Glove used
BiLSTM: units: 100
SeqSelfAttention
Dense: (units: 50, activation: Relu)
Dense: (units: 50, activation: Relu)
Flatten
Dense: (3 (for class 0, class 1, class 2), activation: softmax)
Epochs = 5, batch size: 128, Optimizer: Adam.

Network architecture (deep learning model) Similarly, a word feature was formed for every word of the data file. After this, the data flowed into a self-attention layer (SeqSelfAttention) layer where words were assigned weights and added to the resultant word features of the previous layer based on the relative importance of that word to the entire tweet. This layer improved the efficiency of the classification as ‘attention’ was given to words based on their importance. For further detailed reference on Bi-LSTM and attention neural networks, Bahdanau et al. (2014), Vaswani et al. (2017), Wiedemann et al. (2018), Zhang et al. (2018), Wu et al. (2019), Xu et al. (2020), Chang et al. (2020), and Zhang et al. (2021) may be used. Python libraries like Keras (https://keras.io/) and Keras_self_attention (https://pypi.org/project/keras-self-attention/) were used to support this layer. The context vectors or sentence feature vector that were obtained as the output of the attention layer using weighted sum function was then given to two dense layers or the fully connected layers where all the inputs and outputs are connected to all the neurons in each layer. Each dense layer contains 50 units and is activated with Rectified Linear Unit (ReLu) layer activation function. It is then flattened in the next layer, the flatten layer. The final output was obtained in a dense layer which had units equal to the number of output classes and was then given output activation by softmax function which provided the probabilities of the potential outcomes.

PM2.5 estimation from the tweet content

The methodology used similarity analysis to find the most similar tweets from a selected set of tweets to estimate the probable ground PM2.5 concentration that would have been recorded on that day. This model used ‘spaCy’ library available in python in order to find the top 10% of the most similar tweets out of a season wise selected set of tweets. A test tweet first undergoes model classification as mentioned in the previous section and then from the date of the tweet identify respective season for comparing with right set of tweets. The test tweet was then analyzed for similarity with each of the tweets in the selected set and a similarity index was assigned to each tweet in the selected set. The tweets are then arranged in descending of the cosine similarity index and the first 10% of this sorted set was taken as the resultant ‘most similar tweets set’. The arithmetic mean of PM2.5 concentration as measured by CAAQMS stations on days corresponding to the tweets in the ‘most similar tweets set’ was then reported as the PM2.5 concentration which influenced the test tweet behaviour and its content. The usefulness of this method was analyzed by measuring the correctly estimated concentrations as the percentage of the total estimations under each air quality category specified by CPCB (CPCB 2014) to understand the social media behaviour with respect to change in air quality (Fig. 6).

Fig. 6

Variations in accuracy of PM2.5 estimation from the tweet content by described methodology

Results and discussions

The data classification model sorted the tweets into their corresponding Class (Class 0, I, or II) with an overall accuracy of 96.7% from the 10 fold cross validation testing and an accuracy of 87.4% for the model using test-train split of 80–20%. On the classified data, further analyses were done, the results of which are discussed in the following sections.

Temporal variation of tweet responses with PM2.5 concentrations

A time series analysis for tweet responses of Class I and II along with corresponding PM2.5 concentration was conducted for the duration of study (Fig. 3). It was observed that there is rise in peak of the tweet response volume along with rise in pollution in the air and also a corresponding dip in tweet response volume with a fall in air pollution. It was noticed that there is more response in Class I than Class II. The larger response volume in Class I might be due to the alertness of the public to poor quality of air that brings distress to their daily lives and their eagerness to spread the information to others, draw attention of government or other related or powerful individuals in the society, apart from the fact that the quality of air remained poor for a major part of the study period. The low volume in Class II was primarily due to lesser number of good air quality days in Delhi and also because good air quality days were appreciated mostly after a spike in pollution and not much while quality of air has been staying good for a period of time.

Fig. 3

Time series of PM2.5 mass concentration with a Class I tweets count and b Class II tweets count

Time series of PM2.5 mass concentration with a Class I tweets count and b Class II tweets count During pre-monsoon season (March–May 2019), Delhi had its worst air quality towards the middle of May and that was the time when the highest tweet response (around 200) in Class I was recorded. The day with best air quality of the season reported no Class I tweets. High volume in Class II tweet response volume was found during the early days of March when the air quality used to improve to moderate category after a poor air quality period. Most of the times when the PM2.5 concentrations fell within the desirable limit, there was some Class II tweet response activity. A sudden fall in pollutant level resulted in more tweet responses than a gradual fall. During the period June–September 2019 (southwest monsoon season), the average PM2.5 concentration levels in Delhi were mostly within the desirable limit. Therefore, Class I tweet response activity was comparatively lesser (below 100) compared to other seasons. There were more number of days with minimal or no Class I tweet response volume. Whereas in the case of Class II, the maximum tweet response volumes were recorded on the day with best air quality of the year. Even though the quality of air in this season was mostly good, there wasn’t a tweet response for each of those days. Class II responses were mostly observed on good air quality days after an increase in pollution or if the pollutant concentrations were well below the desirable limits. In the post-monsoon season, the average PM2.5 concentration in Delhi was seldom within the desirable limits that are below 60 μg/m3 as specified in National Ambient Air Quality Standards by CPCB. This was the infamous times when the stubble burning in northern parts of India severely polluted the air in the northern states and public health emergency was declared in Delhi (CNN 2020). This season also coincides with the Indian festival Diwali which is mostly celebrated with fireworks, crackers, etc. Hence, for almost every day of this season, there was Class I tweet response. Class I tweet response volume was varied from 100 units and ranges as high as above 50,000 on the most polluted day of the year when average PM2.5 concentration over Delhi was around 550 μg/m3. The time series of the tweet response volume of this season resembles much with its PM2.5 concentrations. There was a high response in Class II as well. This was due to the fact that even a small fall in pollutant concentration was a relief to the citizen that they readily responded, increasing the volume in Class II. In the winter season, Class I response was observed almost every day of the season owing to the fact that the average PM2.5 concentration in Delhi during this season was seldom within the desirable limits. As the air pollution was not bad as in the previous season, the tweet response volume range has reduced. The low volume of response on January 1, 2020, in contrary to the high ground PM2.5 concentration on the same day might have been a result of deviation of attention of public from air pollution to New Year celebrations and activities. As the air quality in this season was generally poor, the volume of tweet responses in Class II was less. However, Class II tweets were prominent with sharp fall in pollution as seen in the previous cases.

Statistical relationship of tweet response with PM2.5

To analyze the statistical relationship between the tweet response volumes and the average ground PM2.5 concentrations, box plots were made for each class as shown in Fig. 4. For this analysis, the data was classified according to the different PM2.5 concentration categories as specified by CPCB.

Fig. 4

Statistical variations of the tweet volume responses with PM2.5 concentration ranges for a Class I and (b) Class II

Statistical variations of the tweet volume responses with PM2.5 concentration ranges for a Class I and (b) Class II It was seen from Fig. 4a that the tweet response volumes for Class I were the largest when the PM2.5 concentrations were in the poor-severe category and the lowest when it was in the good-moderate category. This proves that the people were more responsive on Twitter and posted tweets indicating poor air quality during the times when the quality of air was in the poor-severe conditions and the Class I tweet response volumes were proportional to the severity of air pollution. The small volume of Class I tweets even on days with good-moderate quality of air could be because the air quality of a day was poorer than the previous day though the concentrations were not above the desired limits. It could also be a response of people visiting the city for the first time and finding the air quality to be bad compared to the place from which they arrived. From the present dataset, it was difficult to verify the behaviour of the floating population. Class II tweets were made not only based on present day air quality, but also when there was a decrease in pollution levels from previous days. Therefore, for the statistical analysis, the change in PM2.5 concentration every 3 days (∆3) was calculated and the cumulative Class II tweet response volume during that period was considered and plotted to a box plot. From Fig. 4b, it was observed that days with good-satisfactory PM2.5 concentrations had smaller ∆3 and therefore had lesser tweet response volumes. During the very poor-severe air pollution days, the tweet response volumes were more in Class I category than Class II. The pollution was too high to have anything positive to tweet about. It was only while the air quality improved from severe to moderate the people were more responsive to tweet in a positive manner which explains the larger tweet response volume observed under moderate-poor category.

Behavioural analysis from word clouds

Word clouds were generated from the model classified and pre-processed data segregated season wise in order to study the responsive nature of public about air quality over various situations and seasons. Word clouds help in identifying the most frequent words in a season, the nature of emotions expressed through these words and the focus of people. The word cloud for each season of the study period is shown in Fig. 5.

Fig. 5

Word clouds for a) pre monsoon, b) southwest monsoon, c) post monsoon, and d) winter season

Word clouds for a) pre monsoon, b) southwest monsoon, c) post monsoon, and d) winter season The word clouds for the different season in the study period collectively suggest that the citizens of Delhi were much interested in following the air quality index (AQI) of the city regularly. People seems to pay continuous attention to the news updates which are a major source of the AQI indices apart from dedicated apps, twitter handles, etc., and like and retweet tweets put up by news channel in regard with the varying air quality conditions. During poor air quality seasons, the tweeting pattern seems to be dependent on how distressed they feel about air pollution along with concern for health and calling for help and attention from the government and related departments. During good air quality season, people were found to tweet about the content and happiness in having a clear sky and asking the government to maintain the good conditions. The reason for an improvement or deterioration of the air quality is also of interest to the people. From the word cloud for pre monsoon season, it was evident that the air quality was mostly in the ‘moderate’ to ‘severe’ zone. The social media responses have reported when the air quality turned ‘poor’ or when it is ‘deteriorate’-ing or when it has shown an ‘improve’-ment. This was also the summer season in Delhi and there was tweets correlating air quality with ‘temperature’ and ‘dust’ and how it was an ‘unhealthy’ condition and difficult to ‘breathe’. In the monsoon season, tweets were mostly about ‘improve’-d air quality and tweets about ‘rain’, ‘wind’, and ‘storm’ with an ‘AQI’ mostly in the ‘good’, ‘satisfactory’, or ‘moderate’ ranges and asking the ‘government’ to maintain the ‘clean’ ‘sky’. The post monsoon season had large number of tweets in comparison to the other seasons and the word cloud had the maximum words. The public mostly spoke about how the condition was ‘severe’, ‘poor’, ‘toxic’, ‘hazardous’, and an ‘emergency’ situation which ‘choked’ and affected the daily life of the ‘people’ and led to ‘school’ to shut and ‘flights’ delayed or redirected. They reported about the effect of ‘diwali’ too in turning Delhi into a ‘gas chamber’. Even the slightest ‘improve’-ment brought a ‘relief’ causing people to tweet about it. In the winter season, tweets were manly about ‘poor’ quality of air still ‘remain’-ing in the city with its share of ‘improve’-ments. But the condition was mostly ‘toxic’ and ‘bad’ as ‘hell’. The ‘fog’ contributed much to the ‘smog’.

PM2.5 estimation results

Deep learning-based model that can estimate PM2.5 concentrations from the content of tweets was developed in the present study. Figure 6 shows the percentage accuracy of these estimations for each PM2.5 concentration category (CPCB 2014). Variations in accuracy of PM2.5 estimation from the tweet content by described methodology Percent accuracy was found to be high for extreme conditions of air quality such as good air quality (80%) and severe air quality (99%) categories. This was because the public could clearly experience these conditions without ambiguity, who then reflected these experiences with appropriate words in their tweet and thus making it easy to estimate with our prediction model. As the PM2.5 concentration moved into other categories, mixed tweeting behaviour was observed thereby reducing the prediction accuracy for the model. Also, for a city like Delhi, the citizens were used to moderate to poor air quality conditions for the major part of the year and thus their reactivity to moderate air pollution situations were not so prominent. It was the deviations from the moderate conditions that made the people more alert and produced larger specific tweets suitable for predicting accurately.

Conclusions

Present study explored the social media behaviour of urban dwellers towards air pollution and developed a deep learning based model to predict the concentration of PM2.5 in the city based on the air pollution related social media responses. The population of Delhi expressed their agony and disappointment on urban air quality through various modes and present study analyzed their response to air quality through popular social media forum - Twitter. The twitter generated data was sorted before analysis using a well efficient self-attention network. The relationship of the twitter responses to actual pollution level was inferred from temporal and statistical analysis of average PM2.5 concentrations with tweet response volumes received per day over various season of a year and categories of PM2.5 concentration ranges. The emotions and attitudes that the Delhi public hold towards air quality is depicted through the words they use in a tweet and was analyzed using word clouds which showed that the Delhi public closely follows the AQI in Delhi through news streams or dedicated apps or through social media. It is evident from these analyses that social media is a powerful tool to monitor air quality variations in an urban city like Delhi. The PM concentration estimation by this methodology showed that there was high estimation accuracy (above 80%) for extreme air quality conditions and lower accuracy for moderate air quality conditions. Similar results have been reported by Gurajala and Matthews (2018). For low PM2.5 values, the correlation with tweet numbers was poor and it improved with increasing PM. The observation of increasing correlation with increasing PM values is consistent with the findings of Jiang et al. (2015). It was observed that limited people in India geo tag their tweets, and therefore, it was difficult to find location of specific tweets unless they mention the location in the tweet. The present study has only considered the text in tweets for analysis here. There were twitter responses in other forms such as pictures, re-directing internet links, etc. which were not considered in this study. This may be overcome by developing a system to identify multiple languages, image interpretations, etc. Also, tweets in regional languages were also overlooked in this study and could further develop the models to include more vernacular languages of the area and mixed scripts using cross lingual knowledge transfer and analysis. The nature of response of people on twitter in terms of likes or retweets may be influenced by tweets of influential persons, popular organisations, bots, etc. This influence maybe taken into attention and the current study could be further advanced with the analysis of account origins of the tweets. It was seen during the pilot study of this study that many Indian cities face air quality issues but most of them do not have an active twitter response culture. Metro cities showed relatively better responses and Delhi with its severe air quality issues had high response volume. Though the study focused on the social media behaviour of people in Delhi and depends on several factors such as access to social media, environmental awareness among public, range of changes in air quality experienced by public etc., the methodology is replicable to any urban area in the world. Twitter was also not very popular in India as compared to other social media platforms like Facebook, Youtube, and Instagram. Such other social media platforms may also be explored in the future to analyze if they hold higher relevance to real-time air pollution–related responses monitoring in India than twitter for similar studies. In case of estimation of pollutant concentrations, similarity analysis might be insufficient for peculiar case of air quality conditions unfamiliar to the current training data set such as the unduly fall in air pollution levels in India during Covid-19 pandemic lockdown. (DOCX 16 kb)

3 in total

1. Evaluation of machine learning techniques with multiple remote sensing datasets in estimating monthly concentrations of ground-level PM_2.5.

Authors: Yongming Xu; Hung Chak Ho; Man Sing Wong; Chengbin Deng; Yuan Shi; Ta-Chien Chan; Anders Knudby
Journal: Environ Pollut Date: 2018-08-11 Impact factor: 8.071

2. Spatial modelling of soil salinity: deep or shallow learning models?

Authors: Aliakbar Mohammadifar; Hamid Gholami; Shahram Golzari; Adrian L Collins
Journal: Environ Sci Pollut Res Int Date: 2021-03-23 Impact factor: 5.190

3. Using Social Media to Detect Outdoor Air Pollution and Monitor Air Quality Index (AQI): A Geo-Targeted Spatiotemporal Analysis Framework with Sina Weibo (Chinese Twitter).

Authors: Wei Jiang; Yandong Wang; Ming-Hsiang Tsou; Xiaokang Fu
Journal: PLoS One Date: 2015-10-27 Impact factor: 3.240