Literature DB >> 33659911

Topic classification of electric vehicle consumer experiences with transformer-based deep learning.

Sooji Ha^1,2, Daniel J Marchetto³, Sameer Dharur⁴, Omar I Asensio^3,5.

Abstract

The transportation sector is a major contributor to greenhouse gas (GHG) emissions and is a driver of adverse health effects globally. Increasingly, government policies have promoted the adoption of electric vehicles (EVs) as a solution to mitigate GHG emissions. However, government analysts have failed to fully utilize consumer data in decisions related to charging infrastructure. This is because a large share of EV data is unstructured text, which presents challenges for data discovery. In this article, we deploy advances in transformer-based deep learning to discover topics of attention in a nationally representative sample of user reviews. We report classification accuracies greater than 91% (F1 scores of 0.83), outperforming previously leading algorithms in this domain. We describe applications of these deep learning models for public policy analysis and large-scale implementation. This capability can boost intelligence for the EV charging market, which is expected to grow to US$27.6 billion by 2027.

Entities: Chemical Disease Gene Species

Keywords: artificial intelligence; consumer behavior; deep learning; electric vehicles; machine learning; natural language processing; policy analysis; sustainable transportation; topic classification; transformers

Year: 2021 PMID： 33659911 PMCID： PMC7892356 DOI： 10.1016/j.patter.2020.100195

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

In recent years, there has been a growing emphasis on vehicle electrification as a means to mitigate the effects of greenhouse gas emissions and related health impacts from the transportation sector. For example, typical calculations suggest that electric vehicles (EVs) reduce emissions from 244 to 98 g/km, and this number could further decrease to 10 g/km with renewable energy integration. The environmental benefits range by fuel type, with reported carbon intensities of 8,887 g CO2 per gallon of gasoline and 10,180 g CO2 per gallon of diesel. Government-driven incentives for switching to EVs, including utility rebates, tax credits, exemptions, and other policies, have been rolled out in many US states.5, 6, 7 In this effort, public charging infrastructure remains a critical complementary asset for consumers in building range confidence for trip planning and in EV purchase decisions.8, 9, 10 Prior behavioral research has shown that policies designed to enhance EV adoption have largely focused on increasing the quantity of cars and connected infrastructure as opposed to the quality of the charging experience. However, a fundamental challenge to deploying large-scale EV infrastructure is regular assessments of quality. Private digital platforms such as mobility apps for locating charging stations and other services have become increasingly popular. Reports by third-party platform owners suggest there are already over 3 million user reviews of EV charging stations in the public domain.12, 13, 14, 15 In this paper, we evaluate whether transformer-based deep learning models can automatically discover experiences about EV charging behavior from unstructured data and whether supervised deep learning models perform better than human benchmarks, particularly in complex technology areas. Because mobile apps facilitate exchanges of user texts on the platform, multiple topics of discussion exist in EV charging reviews. For example, a review states: “Fast charger working fine. Don't mind the $7 to charge, do mind the over-the-phone 10 min credit card transaction.” A multilabel classification algorithm may be able to discover that the station is functional, that a user reports an acceptable cost, and that a user reports issues with customer service. Consequently, text classification algorithms that can automatically perform multilabel classification are needed to interpret the data. Being able to do multilabel classification on these reviews is important for three principal reasons. First, these algorithms can enable analysis of massive digital data. This is important because behavioral evidence about charging experiences has primarily been inferred through data from government surveys or simulations. These survey-based approaches have major limitations, as they are often slow and costly to collect, are limited to regional sampling, and are often subject to self-report or recency bias. Second, multilabel algorithms with digital data can characterize phenomena across different EV networks and regions. Some industry analysts have criticized EV mobility data for poor network interoperability, which prevents data from easily being accessed, shared, and collected. This type of multilabeled output is also important for application programming interface (API) standardization across the industry, such as with emerging but not yet widely accepted technology standards, including the Open Charge Point Protocol that would help with real-time data sharing across regions. Third, this capability may be critical for standardizing software and mobile app development in future stages of data science maturity (see https://www.cell.com/patterns/dsml) to detect behavioral failures in near real-time from user-generated data. Modern computational algorithms from natural language processing (NLP) could uniquely address the need for fast, real-time consumer intelligence related to electric mobility, but these algorithms need to be appropriately tailored to domains to be useful. Large-scale analysis of unstructured EV user data remains difficult to carry out, especially when there are multiple topics discussed in each review and the datasets are imbalanced. Imbalanced data create challenges for models to learn important but less frequently occurring labels and often lead to algorithmic bias. In this paper, we demonstrate the use of deep neural networks to automatically discover insights for topic analysis. We use supervised learning to overcome prior challenges with unsupervised methods that could produce clusters with very little theoretical or social meaning. We provide a proof of concept for the complex task of multilabel topic classification in this domain, which builds on an earlier demonstration of binary sentiment classification with NLP. We apply transformer neural networks, a recent class of pre-trained contextual language models, to accurately detect long-tail discussion topics with imbalanced data, a capability that has been elusive with prior approaches. Prior research demonstrated the efficacy of convolutional neural networks (CNNs)18, 19, 20, 21 and long short-term memory (LSTM), a commonly used variant of recurrent neural networks, for NLP. These models have been recently applied to sentiment classification and single-label topic classification tasks in this domain. As a result, the use of NLP has increased our understanding of potential EV charging infrastructure issues, such as the prevalence of negative consumer experiences in urban locations compared with non-urban locations.,, Although these models showed promise for binary classification of short texts, generalizing these models to reliably identify multiple discussion topics automatically from text presents researchers with an unsolved challenge of underdetection, particularly in corpora with wide-ranging topics and imbalances in the training data. Prior research using sentiment analysis indicates negative user experiences in EV charging station reviews, but it has not been able to extract the specific causes. As a result, multilabel topic classification is needed to understand behavioral foundations of user interactions in electric mobility. In this paper, we achieve state-of-the-art multilabel topic classification in this domain using the transformer-based deep neural networks BERT, which stands for bidirectional encoder representations, and XLNet, which integrates ideas from Transformer-XLarchitectures. We benchmark the performance of these transformer models against classification results obtained from adapted CNNs and LSTMs. We also evaluate the potential for super-human performance of the classifiers by comparing human benchmarks from crowd-annotated training data with expert-annotated training data and transformer models. The extent of this improvement could significantly accelerate automated research evaluation using large-scale consumer data for performance assessment and regional policy analysis. We discuss implications for scalable deployment, real-time detection of failures, and management of infrastructure in sustainable transportation systems.

Results and discussion

Discovering topics

Charging station reviews can be considered asynchronous social interactions within a community of EV drivers. To characterize user experiences, we introduce 8 main topics and 32 sub-topics that make up a typology of charging behavior. This typology allows for easier identification of behavioral issues with the charging process (Table 1). The definitions we use for supervised learning are as follows: Functionality refers to comments describing whether particular features or services are working properly at a charging station. Range anxiety refers to comments regarding EV drivers' fear of running out of fuel mid-trip and comments concerning tactics to avoid running out of fuel. Availability refers to comments concerning whether charging stations are available for use at a given location. Cost refers to comments about the amount of money required to park and/or charge at particular locations. User interaction refers to comments in which users are directly interacting with other EV drivers in the community. Location refers to comments about various features or amenities specific to a charging station location. The Service time topic refers to comments reporting charging rates (e.g., 10 miles of range per hour charged) experienced in a charging session. The Dealership topic refers to comments concerning specific dealerships and user's associated charging experiences. Reviews that do not fall into the previous eight topics refer to the Other topic, and are relatively rare. For more information on the robustness of typology, see Supplemental experimental procedures and Tables S5–S7 in the Supplemental information.

Table 1

EV mobile app typology of user reviews

Topic	Sub-topic examples
Functionality	general functionality, charger, screen, power level, connector type, card, reader, connection, time, error message, station, mobile application, customer service
Range anxiety	trip, range, location accessibility
Availability	number of stations available, ICE, general congestion
Cost	parking, charging, payment
User interactions	charger etiquette, anticipated time available, user tips
Location	general location, directions, staff, amenities, points of interest, user activity, signage
Service time	charging rate
Dealership	dealership charging experience, competing brand quality, relationship with dealers
Other	general experiences

ICE refers to situations where a charging station is blocked by an internal combustion engine vehicle.

EV mobile app typology of user reviews ICE refers to situations where a charging station is blocked by an internal combustion engine vehicle. In preliminary experiments, we investigated several unsupervised topic modeling techniques that did not provide theoretically meaningful clusters. By contrast, our empirically driven typology is ideally suited to hypothesis testing, spatial analysis, benchmarking with other corpora in this domain, and real-time tracking of station failures, all of which are not identifiable with current information systems. For additional details on how the typology and coding scheme were developed from prior work and theory, see Developing the coding scheme for supervised learning.

Transformers beat other deep neural networks

Overall performance

We evaluated the accuracy of BERT and XLNet transformer models against other leading models, CNN and LSTM, which were previously dominant architectures in this domain., Given that we have imbalanced data for machine classification, we also report the F1 score, which is the harmonic average of precision and recall and is considered a measure of detection efficiency. As shown in Table 2, we achieved high overall accuracy scores for BERT and XLNet of 91.6% (0.13 SD) and 91.6% (0.07 SD), and F1 scores of 0.83 (0.0037 SD) and 0.84 (0.0015 SD), respectively. The standard deviations were generated from 10 cross-validation runs. While CNN and LSTM models had slightly lower accuracy, we find that both transformer models outperform the CNN and LSTM models considering both accuracy and F1 score. We report 2 to 4 percentage point improvements in the F1 scores for both transformer models. For implementation details, see Supplemental experimental procedures and Figure S1. For reference, we provide the hyperparameters used for the transformer models in Table S1. We also open sourced the model weights (see Resource availability).

Table 2

Overall model performance

	Accuracy % (SD)	F1 score (SD)
BERT	91.6 (0.13)	0.83 (0.0037)
XLNet	91.6 (0.07)	0.84 (0.0015)
Majority classifier	81.1 (0.00)	0.45 (0.0000)
LSTM	90.3 (0.17)	0.80 (0.0036)
CNN	90.9 (0.12)	0.81 (0.0032)

Models were trained and tested on expert annotated data.

Overall model performance Models were trained and tested on expert annotated data. The F1 scores for the transformer models are also a substantial 40 percentage points higher compared with the majority classifier (Table 2). This means the models learned to detect minority classes effectively. Briefly, the majority classifier provides a measure of the level of imbalance. For a given category, the majority classifier simply predicts the most prevalent label. For example, if 90% of training data has not been selected for a topic, then the classifier predicts all data as not selected, giving a high accuracy of 90%. Thus, for highly imbalanced data, a majority classifier can provide arbitrarily high accuracy without significant learning. Because it is possible that misclassification errors may not distribute equally across the topics, in the next section, we also evaluate the performance by topics.

Increasing detection of imbalanced labels

A key challenge was to evaluate whether we could improve multilabel classifications even in the presence of imbalanced data. Figure 1A shows a large percentage point increase in accuracy for all the deep learning models tested, compared with the majority classifier. This evidence of learning is especially notable for the most balanced topics (i.e., Functionality, Location, and Availability). As shown in Figure 1B, we report improvements in the F1 scores for BERT and XLNet across most topics versus the benchmark models. In particular, this result holds for the relatively imbalanced topics (i.e., Range anxiety, Service time, and Cost), which have presented technical hurdles in prior implementations. In comparison with the previously leading CNN algorithm, BERT and XLNet produce F1 score increases of 1–3 percentage points on Functionality, Availability, Cost, Location, and Dealership topics and 5–7 percentage points on User interaction and Service time topics. For Range anxiety, BERT is within the statistical uncertainty of the CNN performance, while XLNet produces an increase in the F1 score of 4 percentage points. These numbers represent considerable improvements in topic level detection. For detailed point estimates, see Tables S2 and S3.

Figure 1

Topic level classification performance

(A) For the baseline model we use the majority classifier, which predicts the simple majority for a given topic. For higher values in accuracy, the majority classifier reflects more imbalance in the training and testing data. We find that the deep learning models outperform the majority classifier in model accuracy, particularly for more frequently occurring labels, the Functionality, Location, and Availability topics.

(B) We also compare the relative performance of the transformer models with CNN and LSTM classifiers. High F1 scores for imbalanced topics indicate strong detection of true positives. Our results indicate that transformer models, BERT and XLNet, which achieve similar performance, improve upon the CNN and LSTM benchmarks in the F1 score across all topics. The error bars represent upper and lower 95% confidence intervals.

Computation time

An important metric to consider while running deep learning models for large-scale deployment is the computation time. Deep neural networks have recently been criticized for the large amount of resources needed, such as graphics processing units (GPUs) and distributed computing clusters, frequently leading to higher costs of deployment. Further, NLP researchers have also considered the environmental costs of the power consumption and CO2 emissions for computing, which necessarily involve trade-offs. In our application, we report the training times per epoch for BERT and XLNet as 196 and 346 s, respectively. These results were generated using four widely available NVIDIA Tesla P100 GPUs with 16 GB of memory. We find that the training and testing times are considerably longer for the transformer models compared with CNN and LSTM. For transformers, total computing times vary from 1 to 4 h, and for CNN and LSTM, computing times vary from 1 to 90 min, depending on the number of GPUs (see Table S4 for details). We argue that the model performance improvements in the transformer models may be justified for large-scale deployment. This is because the increase in computational cost is offset by substantial gains in accuracy and F1 score. When comparing BERT and XLNet within the class of transformers, we also show BERT to be considerably faster in total computing time for a comparable level of performance. Therefore, we argue that, as further enhancements to BERT and its optimized variants are rapidly advancing in the literature,31, 32, 33 BERT could be a preferred text classification algorithm for this domain. In the next section, we consider scalability of the models by evaluating potential sources of training data.

Trained experts beat the crowd

In Table 3, we compare the machine classification results based on training data from a crowd of non-experts versus a group of trained expert annotators. For performance comparison of models trained with expert- and crowd-annotated data, we created a ground truth dataset by conducting researcher audits to ensure 100% agreement on the ground truth labels. See Human annotation of training data for further details. Not surprisingly, we find that human experts are closer to the ground truth (random holdout sample; n = 100) in both accuracy and F1 score, as shown in Table 3. This is consistent with related literature on limitations to wise crowds. In fact, prior research has found gaps in general public knowledge about EVs and consumer misperceptions.35, 36, 37, 38 In the next section, we quantify the performance of crowd-trained versus expert-trained transformer models, using the two experimentally curated sources of training data.

Table 3

Ground truth evaluation of human performance versus transformer models

Classifier	Training set	Accuracy % (SD)	F1 score (SD)
BERT	Expert annotated	89.1 (4.09)	0.82 (0.06)
BERT	Crowd annotated	73.2 (3.85)	0.53 (0.06)
XLNet	Expert annotated	91.0 (4.70)	0.85 (0.06)
XLNet	Crowd annotated	74.2 (4.15)	0.54 (0.07)
Crowd (κ = 0.007)	–	73.9 (6.06)	0.61 (0.09)
Human experts (κ = 0.538)	–	86.0 (4.40)	0.79 (0.07)

Cross validation was for 10 runs.

Ground truth evaluation of human performance versus transformer models Cross validation was for 10 runs.

Crowd-trained models perform poorly

The transformer models trained with crowd-annotated data produced accuracies of 73.2% (3.85 SD) and 74.2% (4.15 SD) and F1 scores of 0.53 (0.06 SD) and 0.54 (0.07 SD) for BERT and XLNet, respectively (see Table 3). By contrast, we see a remarkable improvement in these results with the expert-trained BERT and XLNet models, which produced model accuracies of 89.1% (4.09 SD) and 91.0% (4.70 SD) and F1 scores of 0.82 (0.06 SD) and 0.85 (0.06 SD), respectively. We discovered that the enhancement in the F1 score is largely due to gains in the interrater reliability, which is the result of improvements in the quality of the training data between crowds and experts (see the Fleiss score increase from 0.007 to 0.538 in Table 3). We argue that interrater agreement is critical when working with annotated data from complex domains such as EV mobility. For reference, at the sub-topic level, values for Fleiss' range from −0.001 to 0.019 for the crowd, and from 0.30 to 0.72 for the experts, which indicates considerable disagreement on the labeling task within a sample of adults, 18 years and over, representative of the US population. See Experimental procedures for details on human annotation experiments. Although sourcing strategies with online labor pools may be inexpensive, we find that the cost advantage does not justify the poor performance (F1 score 0.61, 0.09 SD). These results indicate that the use of low-cost crowd-sourcing approaches to build massive training sets is likely not feasible for large-scale implementation in this domain. This is in stark contrast to other deep learning domains, such as computer vision, where cheap, crowd-sourced training data can be easily acquired. For example, identifying sections of a road or public bus in an image is an easy task for the average person, but the average person cannot easily categorize the topics of EV user reviews. To provide an example of this, in our experiments, the review, “… What an inconvenience when I need to drive to Glendale and I have a very low charge …,” was cognitively difficult for general crowd annotators to correctly classify as Range anxiety, even when annotators had unrestricted access to definitions and related examples. This was not the case for most experts. As a result, for these complex domains, expert-curated training data will be required for large-scale implementation. In the next section, we compare the performance of our best classifiers, using artificial intelligence versus human intelligence.

Possibility of super-human classification

During hand validations of the transformer-based experiments, we noticed that some test data that were not correctly labeled by the human experts were being correctly labeled by the transformer models. This caught our attention, as it indicated the possibility that BERT and XLNet could in some cases exceed the human experts in multilabel classification. In Table 3, we see that expert-trained transformer models performed about 3–5 percentage points higher in accuracy and 0.03–0.06 points higher in the F1 score compared with our human experts. In Table 4, we provide six specific examples of this phenomenon where the expert-trained transformers do better than human experts. For example, exceeding human expert benchmarks could happen in multiple ways. It could be that the algorithm correctly detects a topic that the human experts did not detect (i.e., reviews 1 and 2 in Table 4), or that it does not detect a topic that has been incorrectly labeled by an expert (i.e., reviews 4–6 in Table 4), or that the sum of misclassification errors is smaller than that of human experts (i.e., reviews 3–6 in Table 4). We also provide quantitative measures in accuracy for these examples in Table 4.

Table 4

Examples where expert-trained transformers exceed human benchmarks

	Ground truth	Human expert		Expert-trained transformers
	Ground truth	Human expert		BERT		XLNet
	Labels	Labels	Acc. (%)	Labels	Acc. (%)	Labels	Acc. (%)
1. “… unit says decommissioned but it will still release the charger after a long pause.”	Functionality	User interaction	75	Functionality	100	Functionality	100
2. “Thanks very busy dealership but happy to allow use of qcdc.”	Functionality, Availability, Dealership	Functionality, Dealership	87.5	Functionality, Availability, Dealership	100	Functionality, Availability, Dealership	100
3. “Charging on the quick charger - will be done by 12:15.”	Functionality, User interaction	Functionality, Location	75	User interaction	87.5	User interaction	87.5
4. “Went from 18-82% in 27 min! First time DC charging and met another nice Leaf owner who showed me how to use the machine. Thanks for the charge!”	Functionality, Service time	Functionality, Availability, Location, User interaction, Dealership	62.5	Service time	87.5	Functionality, Service time, Dealership	87.5
5. “The CHAdeMO charger does work …. Nissan Hill had to move an ICE for me to gain access, but did so quickly. The CHAdeMO did not cost me any $ Charged quick! Don't hesitate to use.”	Functionality, Availability, Cost, Dealership	Functionality, Availability, Cost, User interaction, Location, Service time, Dealership	62.5	Functionality, Cost, Dealership	87.5	Functionality, Cost, Service time, Dealership	75
6. “So the dealer had all of their cars being serviced parked in every spot including the quick charger. I called and asked them for at least access to the quick charger and they agreed but never did anything so I left and drove to Larry h nissan. I was willing to pay because I was in a hurry and obviously the Toyota dealer doesn't want my business.”	Availability, Cost, Dealership	Functionality, Availability, User interaction, Location, Dealership	50	Availability, Dealership	87.5	Availability, Location, Dealership	75

Examples where expert-trained transformers exceed human benchmarks “… unit says decommissioned but it will still release the charger after a long pause.” “Thanks very busy dealership but happy to allow use of qcdc.” “Charging on the quick charger - will be done by 12:15.” “Went from 18-82% in 27 min! First time DC charging and met another nice Leaf owner who showed me how to use the machine. Thanks for the charge!” “The CHAdeMO charger does work …. Nissan Hill had to move an ICE for me to gain access, but did so quickly. The CHAdeMO did not cost me any $ Charged quick! Don't hesitate to use.” “So the dealer had all of their cars being serviced parked in every spot including the quick charger. I called and asked them for at least access to the quick charger and they agreed but never did anything so I left and drove to Larry h nissan. I was willing to pay because I was in a hurry and obviously the Toyota dealer doesn't want my business.” Although a full investigation of super-human performance for these transformer neural networks is outside the scope of the current study, we suggest this as an important future work. Evidence that artificial intelligence can outperform human benchmarks on multilabel classification tasks can have practical benefits for station managers and investors to be able to accurately predict system problems or examine customer needs at high resolution in ways not previously possible.

Applications to local and regional policy

As EV consumer data expands, we comment on the possibility to apply this computational approach widely to local and regional policy analysis. We note that, previously, this type of extracted consumer intelligence has not been easily accessible to policy makers or governments due to the nature of unstructured data and issues with data access. For example, the US Department of Energy's Alternative Fuels Data Center maintains a list of all publicly accessible stations in the United States and Canada. This includes location information, such as station name, address, phone number, charging level (e.g., L1, L2, or L3), number of connectors, and operating hours with a developer-friendly API. However, these aggregated data sources do not typically include real-time usage or station availability, due to challenges with network interoperability. This means that due to the presence of different charging standards of manufacturers in regional EV networks, there remain structural issues with sharing and receiving EV usage data between regions. Recently, there has been a movement by a global consortium of public and private EV infrastructure leaders to promote open standards such as the Open Charge Point Protocol and the Open Smart Charging Protocol. As these technology standards become more widely adopted, there will be a rapid increase in the amount of real-time data that can be shared with researchers and analysts. For instance, a growing number of digital platform providers have begun moving toward open data. These include platforms such as Open Charge Map, Recharge, and Google Maps. In the future, it should be possible to easily merge consumer review data with other spatial features and information. This could provide a wealth of commonly used features for analysis such as socioeconomic indicators, including population, income levels, educational attainment, age, poverty rates, unemployment, and affordability of nearby housing. Other important features could include transportation economic indicators, air pollution, health data, mobile phone tracking data, point-of-interest information, and local and regional incentives. To provide an example of possible data insights for urban policy, we conducted a spatial analysis of metropolitan and micropolitan statistical areas (MSAs and μSAs). One of the dominant topics is Availability, which is predicted when a user reports whether a given charging station is available for use. In Figure 2, we visualize the spatial distribution of predicted station availability by US census regions. To create this map, we merged the predicted review topics with counties based on shape files from the Office of Management and Budget's (OMB) 2013 specification of MSAs and μSAs. In the United States, there are 1,167 MSAs (population larger than 50,000) and 641 μSAs (population greater than 10,000), and 1,335 non-core-based statistical areas (population less than 10,000). To visualize model predictions, we standardized the predicted frequency of the Availability topic into quantiles for each census region (West, Midwest, Northeast, and South), with 0%–44%, rarely; 45%–69%, sometimes; 70%–90%, a moderate amount; and over 90%, a great deal (see Figure 2). The map reveals areas with high and low predicted Availability issues in in all core-based statistical areas.

Figure 2

Predicted discussion frequency of station availability for US metropolitan and micropolitan statistical areas

The map reveals areas with high and low discussion frequency for predicted Availability issues in all metropolitan statistical areas (e.g., population greater than 50,000). Micropolitan statistical areas (e.g., population 10,000–49,999) have higher Availability discussions in some states in the West and Midwest regions. The algorithms predict that many micropolitan statistical areas could be underserved with regard to station availability.

Predicted discussion frequency of station availability for US metropolitan and micropolitan statistical areas The map reveals areas with high and low discussion frequency for predicted Availability issues in all metropolitan statistical areas (e.g., population greater than 50,000). Micropolitan statistical areas (e.g., population 10,000–49,999) have higher Availability discussions in some states in the West and Midwest regions. The algorithms predict that many micropolitan statistical areas could be underserved with regard to station availability. Using this approach, we find that predicted station availability issues are not necessarily concentrated in the large central metro counties (MSAs over 1 million population), but rather away from the city centers, such as smaller μSAs of population less than 50,000. This is particularly true in the West (e.g., Oregon, Utah, Colorado, Wyoming, New Mexico) and Midwest (e.g., South Dakota and Nebraska) and Hawaii. By contrast, for the South (e.g., Texas, Alabama, Florida, North Carolina, South Carolina, Tennessee) and Northeast regions (e.g., New York, New Jersey, Massachusetts, Maryland, Pennsylvania), we find the highest frequency of availability issues in the major MSAs for the period of analysis. One primary insight from this analysis is that μSAs could be underserved with regard to station availability. In additional analyses, we also used our methodology to detect whether a specific station is functioning. Based on the rate of consumers leaving reviews at charging stations across the United States, we find that the deep learning algorithms can detect the functioning of a certain station, daily. For further details of these estimates, see Supplemental experimental procedures. This type of detection could also be done with any of our introduced topics and with expanded sample datasets from network providers. Given the proliferation of EV policies worldwide, this spatial analysis could be expanded globally, for example, in the European Union, policies such as Alternative Fuels Infrastructure Directives (previously known as the Directive on Alternative Fuels Infrastructure). In addition, the European Commission has supported implementation of fast charging infrastructure through the Trans-European Network for Transport and Connecting Europe Facility Transport programs., This type of national-scale infrastructure expansion in the European Union is part of an overall strategy by the European Union to reduce CO2 emissions from the transportation sector by 60% by 2050. This capability to deploy accurate and more efficient deep learning models can be applied to evaluate other charging infrastructure rollout policies that aim to increase the number of charge points, reduce charging congestion, promote vehicle-to-grid and overnight charging, as well as solar adoption. For recent reviews on how charging behavior can guide charging infrastructure implementation policy, see van der Kam et al. and McCollum et al. Other applications that use artificial intelligence and NLP to discover hard-to-reveal patterns in unstructured data, especially those that merge spatial information, should generate fruitful areas of future inquiry.

Concluding remarks

In this study, we report state-of-the-art results for multilabel topic classification of consumer reviews in EV infrastructure. This represents a potential step change in our ability to aggregate data and insights for EV business model development and public policy advisory. Implementing automated topic modeling solutions has been challenging because of the technical nature of the corpus and training data imbalances. Our experimental protocols highlight the importance of the quality of training data annotations in the data processing pipeline. First, human expert annotators outperform the general crowd in both accuracy and F1 score metrics. This is due to improvements in the interrater reliability that is critical while working with data from complex domains. Second, improvements in training data quality also produce more accurate and reliable detection. This is seen in the approximate increase of 15 percentage points in accuracy and 50% improvement in the F1 score in the expert-trained transformer models compared with the crowd-trained models (Table 3). Third, when the models are trained on top of high-quality expert curated training data, surprisingly, the transformer neural networks can outperform even human experts. This indicates evidence of super-human classification on imbalanced corpora. As deep learning models have often been criticized for their black-box nature, we suggest technical enhancements that focus on model interpretability as future work, such as through the use of rationales, influence functions, or sequence tagging approaches that can offer deeper insights on the models and the reasons for their predictions. This is an area of active research. Further applications of methods that we propose, particularly those that integrate artificial intelligence with real-time data and spatial analysis, can greatly enhance new ways of thinking about infrastructure management as well as economic and policy analysis. Other opportunities abound.

Experimental procedures

Resource availability

Lead contact

Further information and requests for resources and materials should be directed to and will be fulfilled by the lead contact, Dr. Omar I. Asensio (asensio@gatech.edu).

Materials availability

The trained model weights for BERT and XLNet generated in this study have been deposited in Figshare: https://doi.org/10.6084/m9.figshare.12612092.v1.

Data and code availability

The anonymized datasets and code generated during this study have been deposited in the Zenodo: https://doi.org/10.5281/zenodo.4276350 The raw data may not be posted publicly due to privacy restrictions.

Data

We reanalyzed data derived from a nationally representative collection of unstructured consumer reviews from 12,720 charging station locations across the United States. It comprised 127,257 reviews, all written in English, by 29,532 registered and unregistered EV drivers across a 4-year duration from 2011 to 2015.,, The spatial coverage of the dataset includes reviews from 750 MSAs (309 large MSAs of population 1 million or more; 228 medium MSAs of population 250,000–999,999; 213 small MSAs of population 50,000–249,999). This also includes 294 μSAs (i.e., population 10,000–49,999) and 232 non-core-based statistical areas (i.e., population less than 10,000). This spatial coverage is based on the 2013 OMB delineation of MSAs and μSAs. The data are statistically representative of the entire US EV market, which includes all major EV networks and a mix of both public and private stations, urban and rural stations, and both low and highly rated stations. The data include the text of consumer reviews and contains other useful indicators such as the timestamp of the reviews and the car make and model. We also geo-coded the station location and related points of interest using the Google Places API. However, the dataset does not contain EV transactions data, such as how many kilowatt hours were transferred. The data are also observable only on condition of a user checking in and posting a review. This type of data is expanding globally and we estimate that there are already over 3.2 million reviews through 2020 across more than 15 charge station locator apps.12, 13, 14, 15, 16 This includes English-language reviews as well as reviews in over 42 languages on all continents, such as Ukrainian, Russian, Spanish, French, German, Finnish, Italian, Croatian, Icelandic, Haitian-Creole, Ganda, Sudanese, Kinyarwanda, Afrikaans, Nyanja, Korean, Mandarin, Japanese, Indonesian, and Cebuano.

Developing the coding scheme for supervised learning

We developed the coding scheme for our typology from prior work and theory using three strategies. First, we reviewed the extant literature to capture the most important potential behavioral issues for EV drivers. This led to identification of Range anxiety,,49, 50, 51, 52 Dealership practices,53, 54, 55 Cost,,,56, 57, 58 Service time,,,, Availability issues,, User interaction,61, 62, 63 station Functionality,,, and Location. Second, to find evidence of the importance of these topics from the data, we hand-coded 8,953 randomly selected reviews to validate the 8 topics from prior literature and used these to generate 34 sub-topics for classification. We found that only 1% of the reviews were unclassifiable according to our 8 main categories (i.e., Other). Third, to validate the coding scheme, we also interviewed industry experts and practitioners, which allowed us to further refine our main topics and sub-topics shown in Table 1. This included informal communications with representatives from firms such as General Motors, ChargePoint, ReCharge Technologies, Electrada, Electrify America, and charging station managers (e.g., representatives from Ford and Georgia Tech Parking and Transportation Services) who were not directly involved in the research.

Human annotation of training data

A common criticism with deep neural networks is the high cost and annotator skill requirements for implementations in specialized corpora. We evaluated possible methods to lower implementation costs, such as crowd sourcing by using online labor pools for human annotation. This led us to conduct human annotator experiments with two training sets, each labeled by a crowd of non-experts and a small group of trained experts. Given the known possible biases with historical data, we investigated whether protocols related to the labeling of the training data could have an impact on performance., The crowd and expert annotators each labeled a random sample of 10,652 reviews. We used an 80:10:10 split for training, validation, and testing, which met our objective of having equal amounts of training data for both annotator groups. We conducted statistical tests to determine whether the sampled training dataset was representative of the full dataset in key observable station characteristics. We confirmed that the training dataset was statistically representative in the mix of urban and non-urban stations (t test, p = 0.426) and public and private stations (t test, p = 0.709), as well as by station points of interest (t test, p = 0.802), e.g., retail, shopping, workplace, transit centers, etc. We also found that the training data were not statistically different in topic distribution from the predictions of the full dataset (Kolmogorov-Smirnov test, p = 0.9801).

Crowd annotators

For the crowd-sourced training data sample, 1,000 US adults (age 18+) were pre-recruited via a Qualtrics online panel using their popular online survey platform. The crowd was statistically sampled on the basis of age, income, education, and sex, representative of the US population. This is important to mitigate possible human rater biases that could arise when discussing environmental topics. To enhance understanding of the domain-specific terminology for the general crowd, definitions and examples for the topics and sub-topics as shown in Table 1 were provided for annotation along with a supporting diagram containing typical components of an EV charging station (see Figures S2 and S3). We report the Fleiss κ for crowd annotators as 0.007.

Expert annotators

For the expert-sourced training data sample, five student annotators with technical backgrounds were recruited and trained in a facilitated focus group. They were instructed to recognize the domain-specific topics using a detailed training manual for the annotation. To support scientific replication and to document the protocols, we have open sourced this training manual. These protocols were developed in consultation with EV industry experts who have been in contact with the researchers. Although our expert annotators have been trained to recognize domain-specific terminology, we acknowledge that we were not able to compare the performance of our expert annotators with that of EV industry professionals due to cost reasons. Despite this limitation, we find that our human experts were two orders of magnitude more reliable in the annotation (76-fold increase in our reliability measure) than the crowd annotators ( = 0.538 and = 0.007, respectively). See Model metrics under Performance measures for additional details on computing Fleiss' κ. To provide a greater control over the labeling task, we developed a custom web application used by the expert annotators as shown in Figure S3. The web app provides efficient database support for random sampling from a large dataset and overcomes latency and scaling challenges that we encountered during crowd annotation in popular survey software.

Ground truth labels

To generate the ground truth labels, we followed the same training protocols used by the expert annotators. Then, we randomly sampled 100 overlapping reviews that were annotated by both annotator groups to enable performance comparisons. On this sample, we conducted an additional round of researcher audits that validated 100% agreement on the annotations. Given that the human experts exhibited some level of disagreement (Fleiss' κ = 0.538, Table 3), this sample was used to benchmark the performance of the US crowd and the human experts. The results of these comparisons as well as their statistical uncertainty are reported in Table 3. To generate the uncertainty, we performed a cross validation using block randomization with 10 equal-sized blocks of ground truth data.

Performance measures

Model metrics

To assess model performance, we report the micro-averaging F1 score, which is a standard metric for classifier performance on detection of false positives and false negatives. We used standard measures for multilabel accuracy, where annotators could choose multiple labels per review. Our overall accuracy metric accounts for partially correct matches. By convention, this is equivalent to 1 − Hamming loss, where the Hamming loss is an calculation of the dissimilarity (i.e., a fraction of wrong labels compared with the total number of labels). For categories classified on a sample of size , the accuracy can be calculated as: For example, if a multilabel prediction [1, 1, 1, 0] had a true label [1, 1, 1, 1], the accuracy is 3/4 or 75%.

Interrater Reliability

To measure the interrater agreement level among the annotators, we used Fleiss' , which allows for the measurement of agreement between multiple annotators (i.e., more than 2). It is calculated as shown below:where is the average number of agreements on all annotations between rater pairs for the reviews, and is the sum of squares of the probability share for the assignment to a topic. As is bounded between −1 and 1, when is less than 0, agreement between raters is occurring below what would be expected at random, while a above 0 means that agreement between raters is occurring at more than what would be expected by random chance. For more information, see Fleiss.

Ethics statement

Human subjects research was conducted under the approved Institutional Review Board Protocol H18250.

2 in total

1. Widespread use of National Academies consensus reports by the American public.

Authors: Diana Hicks; Matteo Zullo; Ameet Doshi; Omar I Asensio
Journal: Proc Natl Acad Sci U S A Date: 2022-03-01 Impact factor: 11.205

2. Electric vehicle charging stations in the workplace with high-resolution data from casual and habitual users.

Authors: Omar Isaac Asensio; M Cade Lawson; Camila Z Apablaza
Journal: Sci Data Date: 2021-07-07 Impact factor: 6.444

2 in total