| Literature DB >> 35408169 |
Juan Diego Ortega1,2, Paola Natalia Cañas1, Marcos Nieto1, Oihana Otaegui1, Luis Salgado3.
Abstract
Tremendous advances in advanced driver assistance systems (ADAS) have been possible thanks to the emergence of deep neural networks (DNN) and Big Data (BD) technologies. Huge volumes of data can be managed and consumed as training material to create DNN models which feed functions such as lane keeping systems (LKS), automated emergency braking (AEB), lane change assistance (LCA), etc. In the ADAS/AD domain, these advances are only possible thanks to the creation and publication of large and complex datasets, which can be used by the scientific community to benchmark and leverage research and development activities. In particular, multi-modal datasets have the potential to feed DNN that fuse information from different sensors or input modalities, producing optimised models that exploit modality redundancy, correlation, complementariness and association. Creating such datasets pose a scientific and engineering challenge. The BD dimensions to cover are volume (large datasets), variety (wide range of scenarios and context), veracity (data labels are verified), visualization (data can be interpreted) and value (data is useful). In this paper, we explore the requirements and technical approach to build a multi-sensor, multi-modal dataset for video-based applications in the ADAS/AD domain. The Driver Monitoring Dataset (DMD) was created and partially released to foster research and development on driver monitoring systems (DMS), as it is a particular sub-case which receives less attention than exterior perception. Details on the preparation, construction, post-processing, labelling and publication of the dataset are presented in this paper, along with the announcement of a subsequent release of DMD material publicly available for the community.Entities:
Keywords: ADAS; automotive; datasets; driver monitoring; multi-camera
Mesh:
Year: 2022 PMID: 35408169 PMCID: PMC9003494 DOI: 10.3390/s22072554
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Examples of activities performed in the DMD.
Comparison of public vision-based driver monitoring datasets.
| Dataset | Year | Drivers a | Views b | Size c | GT d | Streams | Scenarios | Usage |
|---|---|---|---|---|---|---|---|---|
| CVRR-Hands [ | 2013 | 8 (1/7) | 1 | 7 k | Hands, | RGB | Car | Normal driving, |
| DrivFace [ | 2016 | 4 (2/2) | 1 | 0.6 k | Face/Head | RGB | Car | Normal driving, |
| DROZY [ | 2016 | 14 (11/3) | 1 | 7 h | Face/Head | IR | Laboratory | Drowsiness |
| NTHU-DDD [ | 2017 | 36 (18/18) | 1 | 210 k | Actions | RGB | Simulator | Normal driving, |
| Pandora [ | 2017 | 22 (10/12) | 1 | 250 k | Face/Head, | RGB | Simulator | Head/Body pose |
| DriveAHead [ | 2017 | 20 (4/16) | 1 | 10.5 h | Face/Head, | Depth | Car | Normal driving, |
| UTA-RLDD [ | 2019 | 60 (9/51) | 1 | 30 h | Subjective KSS | RGB | Laboratory | Drowsiness |
| DD-Pose [ | 2019 | 24 (6/21) | 2 | 6 h | Face/Head, | RGB e | Car | Normal driving, |
| AUC-DD [ | 2019 | 44 (15/29) | 1 | 144 k | Actions | RGB | Car | Normal driving, |
| Drive&Act [ | 2019 | 15 (4/11) | 6 | 12 h | Hands/Body, | RGB e | Car | Autonomous driving, |
| DMD (ours) | 2021 | 37 (10/27) | 3 | 41 h | Face/Head, | RGB | Car, | Normal driving, |
a Number of drivers (female/male); b Simultaneous camera views of the scene; c h: hours of video, k: image number; d Ground-truth data; e only for side view; f only for face view.
Figure 2General overview of the DMD creation process.
Figure 3Metadata label taxonomy for the DMD.
Subset of Annotation levels for distraction detection in the DMD.
| Level | Labels |
|---|---|
| Camera Occlusion | - Face camera |
| Gaze on Road | - Looking road |
| Talking | - Talking |
| Hands Using Wheel | - Both |
| Hand on Gear | - Hand on gear |
| Objects in Scene | - Cellphone |
| Driver Actions | - Safe drive |
Actions performed in the distraction protocol.
| Car Stopped | Car Driving | Simulator Driving |
|---|---|---|
| Safe driving | Safe driving | Safe driving |
| Reach object backseat | Operating the radio | Brush the hair |
| Reach object side | Drinking | Phone call—right hand |
| Brush the hair | Talk to passenger | Phone call—left hand |
| Phone call—right hand | Texting—right hand | |
| Phone call—left hand | Texting—left hand | |
| Texting—right hand | Drinking | |
| Texting—left hand |
Actions performed in the drowsiness protocol.
| Car Stopped and Simulator Driving |
|---|
| Drive normally |
| Sleepy driving |
| Yawn without hand |
| Yawn with hand |
| Micro-sleep |
Actions performed in the gaze–hands protocol.
| Car–Simulator–Driving | |
|---|---|
| Gaze Zones | Hand Actions |
|
| Both hands on (not moving) |
| Right hand on (not moving) | |
| Left hand on (not moving) | |
| Both hands off (not moving) | |
| Both hands on (moving) | |
| Right hand on (moving) | |
| Left hand on (moving) | |
Figure 4DMD camera setup and recording environments.
Specifications of Intel® RealSense™ cameras used to generate the DMD.
| D415 | D435 | |
|---|---|---|
| Use Environment | Indoor/Outdoor | Indoor/Outdoor |
| Depth FOV ( |
|
|
| Depth Resolution | Up to | Up to |
| Depth Frame Rate | Up to 90 FPS | Up to 90 FPS |
| RGB FOV ( |
|
|
| RGB Resolution | Up to | Up to |
| RGB Frame Rate | 30 FPS | 30 FPS |
| Min. Depth Distance at Max Resolution | ∼45 cm | ∼28 cm |
| Ideal Range | 0.5 m to 3 m | 0.3 m to 3 m |
Figure 5Participants information.
Camera recording parameters and required bandwidths. The selected specifications for the DMD is marked in bold.
| RGB | Depth | IR | Bandwidth | Bandwidth |
|---|---|---|---|---|
| 1920 × 1080 × 30 | 1280 × 720 × 30 | 1280 × 720 × 30 | 2157 | 6470 |
|
|
|
|
|
|
| 848 × 480 × 30 | 848 × 480 × 30 | 848 × 480 × 30 | 586 | 1758 |
| 848 × 480 × 60 | 848 × 480 × 60 | 848 × 480 × 60 | 1172 | 3517 |
| 640 × 480 × 30 | 640 × 480 × 30 | 640 × 480 × 30 | 442 | 1327 |
| 640 × 480 × 60 | 640 × 480 × 60 | 640 × 480 × 60 | 885 | 2654 |
| 640 × 360 × 90 | 640 × 480 × 90 | 640 × 480 × 90 | 1161 | 3484 |
Figure 6DMD material weight in raw and video count.
Figure 7Process for multi-sensor stream synchronization: (a) Region of interest and signal extraction, (b) Processing of temporal signal and correlation calculation. Blue signal is computed from face and hands ROI in face and body camera, respectively. Red signal is computed from face and hands ROI in body and hands camera, respectively.
Comparison of annotation times using v1 and v2 of TaTo.
| Session | TaTo Version | # Videos | Total Time | Time/Video | Improvement |
|---|---|---|---|---|---|
| s1 | V1 | 4 | 13:02:00 | 3:25:00 | 56.10% |
| V2 | 1 | 1:30:00 | 1:30:00 | ||
| s2 | V1 | 3 | 16:10:00 | 5:36:00 | 63.69% |
| V2 | 2 | 4:04:00 | 2:02:00 | ||
| s3 | V1 | 2 | 2:20:00 | 1:10:00 | 81.43% |
| V2 | 3 | 0:41:00 | 0:13:00 | ||
| s4 | V1 | 2 | 19:45:00 | 8:22:00 | 63.55% |
| V2 | 3 | 9:10:00 | 3:03:00 |
Figure 8DMD distribution for: (a) Driver_actions (b) Hands_using_wheel (c) Talking (d) Hand_on_gear (e) Objects_in_scene (f) Gaze_on_road.
Figure 9Procedure to distribute the DMD.
Figure 10DMD file structure.