| Literature DB >> 32071958 |
Mattia Zago1, Stefano Longari2, Andrea Tricarico2, Michele Carminati2, Manuel Gil Pérez1, Gregorio Martínez Pérez1, Stefano Zanero2.
Abstract
This article details the methodology and the approach used to extract and decode the data obtained from the Controller Area Network (CAN) buses in two personal vehicles and three commercial trucks for a total of 36 million data frames. The dataset is composed of two complementary parts, namely the raw data and the decoded ones. Along with the description of the data, this article also reports both hardware and software requirements to first extract the data from the vehicles and secondly decode the binary data frames to obtain the actual sensors' data. Finally, to enable analysis reproducibility and future researches, the code snippets that have been described in pseudo-code will be publicly available in a code repository. Motivated enough actors may intercept, interact, and recognize the vehicle data with consumer-grade technology, ultimately refuting, once-again, the security-through-obscurity paradigm used by the automotive manufacturer as a primary defensive countermeasure.Entities:
Keywords: Automotive; Controller area network (CAN); Dataset; Reverse engineering
Year: 2020 PMID: 32071958 PMCID: PMC7015990 DOI: 10.1016/j.dib.2020.105149
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Fig. 1Data repository structure and data samples. (b) Sample of RAW data, obtained as described in Section 2.3 and indicated in Fig. 1a as raw.csv. Note that the data column is truncated due to space concerns. (c) Sample of decoded data, obtained as described in Section 2.4 and indicated in Fig. 1a as unified.csv
Fig. 2Code repository [2].
List of experiments per vehicle.
| Vehicle | Test | Experiment time | IDs | Frames | Description | ||
|---|---|---|---|---|---|---|---|
| Date | Start | End | |||||
| C-1 | #1 | 2018-07-26 | 15:15:58 | 15:35:20 | 77 | 3,062,691 | city and highway driving |
| C-1 | #2 | 2018-07-26 | 15:46:13 | 15:48:32 | 76 | 364,863 | city and highway driving |
| C-1 | #3 | 2018-07-26 | 15:49:10 | 15:49:23 | 76 | 33,005 | repeated brake tests |
| C-1 | #4 | 2018-07-26 | 15:50:29 | 16:10:54 | 83 | 3,227,315 | city and highway driving |
| C-1 | #5 | 2018-07-26 | 16:10:57 | 16:20:16 | 83 | 1,473,625 | city and highway driving |
| C-1 | #6 | 2018-07-26 | 16:20:20 | 16:30:59 | 83 | 1,684,769 | city and highway driving |
| C-1 | #7 | 2018-07-26 | 16:53:17 | 17:10:31 | 83 | 2,723,484 | city and highway driving |
| C-1 | #8 | 2019-02-01 | 16:31:01 | 16:40:58 | 82 | 1,569,776 | city and highway driving |
| C-1 | #9 | 2019-02-01 | 15:18:55 | 16:30:36 | 88 | 10,942,747 | city and highway driving |
| C-2 | #1 | 2019-10-02 | 08:54:16 | 09:22:40 | 78 | 3,467,855 | city and highway driving |
| T-1 | #1 | 2019-02-20 | 16:04:06 | 16:35:04 | 31, 47 | 1,798,602 | city and highway driving |
| T-2 | #1 | 2019-11-08 | 14:51:57 | 15:07:43 | 22 | 498,721 | city driving |
| T-2 | #2 | 2019-11-08 | 14:34:33 | 14:43:20 | 22 | 263,269 | vehicle not moving |
| T-3 | #1 | 2019-11-08 | 11:48:56 | 12:14:58 | 23 | 1,729,623 | city driving |
| T-3 | #2 | 2019-11-08 | 11:16:55 | 11:23:42 | 19 | 2,795,321 | vehicle not moving test 1 |
| T-3 | #3 | 2019-11-08 | 12:57:48 | 13:42:52 | 23 | 2,795,321 | vehicle not moving test 2 |
For this experiment, there are included both can0 and can1 lines.
Dataset composition according to vehicle type.
| ID | Vehicle | Type | Connector | FMS |
|---|---|---|---|---|
| C-1 | Alfa Romeo Giulia Veloce | Car | OBD-II | No |
| C-2 | Opel Corsa | Car | OBD-II | No |
| T-1 | Mitsubishi Fuso Canter | Commercial Truck | OBD-II | Yes |
| T-2 | ISUZU M55 | Commercial Truck | OBD-II, direct wire access | No |
| T-3 | Piaggio Porter Maxi | Commercial Truck | OBD-II | No |
Main identifiers for each vehicle.
† Obtained by FMS.
∗ Manually identified.
∗∗ These variables appear to be replicated multiple times in the data frame.
Fig. 3CAN data frame.
Fig. 4Architecture of the collection framework including both the vehicle and the direct human intervention.
Fig. 5Sample bitflips heatmaps for vehicle Opel Corsa (See Table 2) with different level of information.


Fig. 6Number of data frames intercepted for each vehicle, CANline and experiment (Table 2), the series have been sorted and includes only those IDs for which there is at least one data variable obtained from the data extraction algorithm, as specified in Section 2.4.
Fig. 7Statistical information regarding the number of ECU identifiers for each vehicle and experiment. The bottom axis presents the unique count of ECU identifiers, while the top axis reports the boxplots that describe the distributions of data frames for each vehicle, experiment and ECU identifier. Note that for vehicle T-1 there are two CAN lines.
Fig. 8Interarrival frames times for vehicle Alfa Romeo Giulia (C-1). Values are in logarithmic scale.
Fig. 9Bitflips magnitude heatmap for vehicle T-1, limited to the ECU identifier that carries the information regarding the (RPM) and speed sensors.
Specification Table
| Subject area | Engineering, Computer Science |
| Specific area | Automotive Engineering, Artificial Intelligence |
| Type of data | CSV files |
| How data were acquired | Controller Area Network (CAN) buses have been accessed using a standard CAN connector and a CANtact board. The CAN Utils library, publicly available in the Linux Kernel, has been used to intercept the network traffic of the vehicle. Sensors data have been decoded using the state-of-the-art algorithm. The source code for each step of the analysis is publicly available in the repository, as specified below. |
| Data format | Raw and Filtered |
| Parameters for data collection | |
| Description of data collection | |
| Data source location | Dipartimento di Elettronica, Informazione e Bioingengeria, Politecnico di Milano, Milan, Italy |
| Data accessibility | Data repository: ReCAN Data - Reverse engineering of Controller Area Networks [ |
| Data identification number: 10.17632/76knkx3fzv | |
| Direct URL to data: | |
| Source code repository: ReCAN Source - Reverse engineering of Controller Area Networks [ | |
| Source code URL: |
These data endeavor to fulfill the lack of large, continuous, and machine-learning-ready datasets for automotive analysis. The primary recipient for the data are the academic scientists that focus on machine-learning-driven researches. They might greatly benefit from these freshly generated and carefully reviewed data. The main usage of this data is twofold: i) the raw data can be used to train unsupervised automatic decoders while ii) the decoded data can be used to power self-optimized intrusion detection systems. The Controller Area Network (CAN) streams are also decoded and interpreted, such preprocess might provide the scientific community with additional and improved data characterization. |